Building Multi-Agent AI Systems with Autogen & AWS Bedrock

The rise of Large Language Models has opened new possibilities for building intelligent systems that can reason, plan, and execute complex tasks. In this post, I’ll share my experience building a production multi-agent AI system for customs declaration automation using Microsoft’s Autogen framework and AWS Bedrock.

Why Multi-Agent Architectures?

Single-agent LLM systems hit limitations quickly when dealing with complex, multi-step tasks. Consider customs declaration generation: you need to extract information from invoices, validate against regulations, format according to specific templates, and handle edge cases gracefully.

A multi-agent architecture offers several advantages:

Separation of concerns: Each agent specializes in a specific task
Better error handling: Agents can critique and correct each other
Scalability: Add new capabilities by adding new agents
Human-in-the-loop: Easier to insert human oversight at critical points

Architecture Overview

Our system uses four main agents:

┌─────────────────┐     ┌─────────────────┐
│  OCR Agent      │────▶│  Extraction     │
│  (Document      │     │  Agent          │
│   Processing)   │     │  (Data Parse)   │
└─────────────────┘     └────────┬────────┘
                                 │
                                 ▼
┌─────────────────┐     ┌─────────────────┐
│  Human Review   │◀────│  Validation     │
│  Agent          │     │  Agent          │
│  (HITL)         │     │  (Rules Check)  │
└────────┬────────┘     └─────────────────┘
         │
         ▼
┌─────────────────┐
│  Declaration    │
│  Generator      │
└─────────────────┘

Implementing with Autogen

Autogen provides a clean abstraction for multi-agent conversations. Here’s a simplified example of our agent setup:

from autogen import AssistantAgent, UserProxyAgent

# Configuration for AWS Bedrock
llm_config = {
    "config_list": [{
        "model": "anthropic.claude-3-sonnet",
        "api_type": "bedrock",
        "region": "eu-west-1"
    }],
    "temperature": 0.1
}

# OCR Agent - processes documents
ocr_agent = AssistantAgent(
    name="OCRAgent",
    system_message="""You are an OCR specialist.
    Extract all text from invoice images accurately.
    Pay special attention to: item descriptions, quantities,
    values, HS codes, and country of origin.""",
    llm_config=llm_config
)

# Validation Agent - checks extracted data
validation_agent = AssistantAgent(
    name="ValidationAgent",
    system_message="""You are a customs compliance expert.
    Validate extracted invoice data against customs regulations.
    Flag any inconsistencies or missing required fields.""",
    llm_config=llm_config
)

Human-in-the-Loop Design

For customs declarations, accuracy is critical. We implemented a human review step using Autogen’s UserProxyAgent:

human_proxy = UserProxyAgent(
    name="HumanReviewer",
    human_input_mode="ALWAYS",
    code_execution_config=False,
    system_message="""Present validation results to human reviewer.
    Format findings clearly and request confirmation
    before proceeding to declaration generation."""
)

The key insight is that human-in-the-loop shouldn’t just be a final approval step. We insert human review at the validation stage, allowing corrections before the final declaration is generated.

Scaling with AWS Bedrock

AWS Bedrock provides managed LLM inference, which simplified our production deployment significantly:

No infrastructure management: Focus on agent logic, not GPU clusters
Built-in guardrails: Content filtering and PII detection
Cost control: Pay per token, easy to monitor and budget
Multi-model flexibility: Easy to switch between Claude, Titan, or other models

For high-throughput scenarios, we use AWS Lambda for the agent orchestration layer:

# Lambda handler for invoice processing
def handler(event, context):
    invoice_data = event['invoice']

    # Initialize agent conversation
    chat_result = ocr_agent.initiate_chat(
        validation_agent,
        message=f"Process this invoice: {invoice_data}"
    )

    return {
        'statusCode': 200,
        'body': chat_result.summary
    }

Lessons Learned

After several months in production, here are key takeaways:

1. Prompt Engineering is Critical

Each agent’s system prompt went through dozens of iterations. Be specific about:

Expected input format
Output structure
Edge cases to handle
When to escalate to human review

2. Structured Outputs Save Time

Instead of parsing free-form LLM responses, we use structured output schemas:

from pydantic import BaseModel

class ExtractedInvoice(BaseModel):
    items: list[InvoiceItem]
    total_value: float
    currency: str
    origin_country: str
    destination_country: str

3. Monitoring is Essential

Track these metrics:

Agent success/failure rates per task
Human intervention frequency
Processing time per document
Token usage and costs

4. Graceful Degradation

When the LLM fails or returns low-confidence results, the system should:

Route to human review automatically
Provide clear context for what went wrong
Learn from corrections (via fine-tuning or prompt updates)

What’s Next

We’re exploring:

Fine-tuning smaller models on our specific domain
Multi-modal agents that process images directly
Automated prompt optimization using DSPy
Expanded agent capabilities for regulatory updates

The multi-agent paradigm is still evolving rapidly. What excites me most is the potential for systems that genuinely augment human expertise rather than just automating simple tasks.

Have questions about building agentic AI systems? Feel free to reach out on LinkedIn.