Architecture¶

The system is built on dual engines of "Multi-Agent Collaboration + RAG Knowledge Enhancement", adopting a three-layer architecture and a "1+5" multi-Agent collaboration design, with built-in fallback at each layer to ensure availability.

System Overview¶

flowchart TD
    User([User Request]) --> API["Access Layer API v1<br/>/chat · /chat/stream · /agent/*"]
    API --> Cache{"HotQueryCache<br/>hot cache hit?"}
    Cache -- "Hit" --> Reply([Return cached reply])
    Cache -- "Miss" --> Orch["Collaboration Layer Orchestrator<br/>Intent · Routing · Fallback"]

    Orch --> Intent{"Intent Recognition<br/>Rule fast path / IntentCache / LLM"}
    Intent --> Route{"Routing Decision"}
    Route --> KA["KnowledgeAgent<br/>Hybrid Retrieval+RAG"]
    Route --> BA["BusinessAgent<br/>Business Query+Masking"]
    Route --> EA["EmotionAgent<br/>Sentiment+Escalation"]
    Route --> TA["TicketAgent<br/>Ticket Extract+Create"]
    Route --> DA["DialogAgent<br/>Dialog Polish"]

    KA --> Data["Data Layer<br/>ChromaDB · BM25 · Reranker"]
    BA --> Biz["Business System<br/>mock / http"]
    EA --> Esc["Escalation Engine<br/>EscalationCard"]
    TA --> Ticket["Ticket Storage"]

    KA --> Dialog["DialogAgent<br/>Result Merge+Polish"]
    BA --> Dialog
    EA --> Dialog
    TA --> Dialog
    Dialog --> Output([Final Reply])
    Esc --> Output

    style Cache fill:#e8f5e9,stroke:#4caf50
    style Orch fill:#e3f2fd,stroke:#2196f3
    style Data fill:#fff3e0,stroke:#ff9800

Three-Layer Architecture¶

The system is divided into access layer, collaboration layer, and data layer, each with clear responsibilities and one-way dependencies.

Access Layer (API v1)¶

Located in app/api/v1/, responsible for HTTP access, authentication, request validation, and response wrapping.

Module	Endpoints	Responsibility
`chat.py`	`/chat`, `/chat/stream`	Sync chat and SSE streaming chat
`agent.py`	`/agent/sessions/*` (8 endpoints)	Agent assist workbench
`knowledge.py`	`/knowledge/ingest`, `/knowledge/stats`	Knowledge base management
`evaluation.py`	`/evaluation/run`	Retrieval evaluation (Recall/Hit/MRR/hallucination rate)
`performance.py`	`/performance/metrics`, `/performance/cache/invalidate`	Performance monitoring and cache cleanup
`observability.py`	`/observability/health`	Component health check
`operations.py`	`/operations/dashboard`	Operations dashboard and canary release

Async-first

All endpoints are async def. IO-intensive scenarios (LLM calls, vector retrieval, business API) do not block the event loop, allowing a single process to handle high concurrency.

Collaboration Layer (agents)¶

Located in app/agents/, the core of multi-Agent collaboration, responsible for intent recognition, task routing, Agent execution, and result merging.

orchestrator.py: Orchestration Agent, intent recognition and routing dispatch
graph.py: LangGraph state machine orchestration (degrades to synchronous orchestrator)
knowledge_agent.py: Knowledge Retrieval Agent (hybrid retrieval + reranking + summary)
business_agent.py: Business Query Agent
emotion_agent.py: Sentiment Analysis Agent
ticket_agent.py: Ticket Processing Agent
dialog_agent.py: Dialog Polish Agent
llm_client.py: LLM client (incl. _MockLLM fallback)

Data Layer (knowledge + core)¶

Located in app/knowledge/ and app/core/, provides knowledge retrieval, persistence, and infrastructure.

Module	Responsibility
`hybrid_retriever.py`	Hybrid retrieval (vector + BM25 + RRF fusion)
`reranker.py`	CrossEncoder reranking (degrades to cosine)
`vectorstore.py`	ChromaDB wrapper
`embeddings.py`	BGE embedding service (degrades to hash fallback)
`bm25.py`	BM25 keyword retrieval
`query_rewriter.py`	Query rewriting
`pipeline.py`	Document ingestion pipeline
`performance.py`	HotQueryCache / ModelRouter / IntentCache
`circuit_breaker.py`	Circuit breaker fallback
`langfuse_client.py`	Langfuse tracing (degrades to no-op)
`session.py`	Session management

Multi-Agent "1+5" Architecture¶

The system uses 1 orchestration Agent to coordinate 5 specialized Agents, each with its own responsibility and no overlap.

flowchart LR
    subgraph Orchestration
        O["Orchestrator<br/>Orchestration Agent"]
    end
    subgraph Specialized Agents
        K["KnowledgeAgent<br/>Knowledge Retrieval"]
        B["BusinessAgent<br/>Business Query"]
        E["EmotionAgent<br/>Sentiment Analysis"]
        T["TicketAgent<br/>Ticket Processing"]
        D["DialogAgent<br/>Dialog Polish"]
    end
    O -->|Route dispatch| K
    O -->|Route dispatch| B
    O -->|Route dispatch| E
    O -->|Route dispatch| T
    O -->|Result merge| D
    D --> Output([Final Reply])

    style O fill:#e3f2fd,stroke:#2196f3,stroke-width:2px

Orchestrator (Orchestration Agent)¶

The "brain" of the multi-Agent architecture, responsible for:

Intent recognition: Three-level mechanism (rule fast path → IntentCache → LLM), see Multi-Agent Collaboration
Routing dispatch: Routes the query to the corresponding specialized Agent based on intent
Sentiment priority: When sentiment sensitivity or agitation is detected, forcibly switches to sentiment handling
Fallback handling: unknown intent returns a guidance message; 2 consecutive unresolved turns escalate to human

KnowledgeAgent (Knowledge Retrieval Agent)¶

The core of knowledge Q&A, orchestrates the full RAG chain:

Query rewriting → hybrid retrieval (vector + BM25 + RRF) → Reranker reranking → threshold filtering
Optional LLM summary generation (generate_summary=True)
Returns a fallback reply when retrieval is empty, avoiding meaningless LLM calls
See RAG Retrieval Pipeline

BusinessAgent (Business Query Agent)¶

Integrates with the business system to query orders/members/returns/accounts:

Supports both mock (in-memory mock) and http (real business system) modes
Phone number masking: phone numbers in results are auto-masked (middle 4 digits replaced with ****)
Write operation confirmation: write operations like refunds/returns require user confirmation before execution
Failure fallback: when the business API is unavailable, degrades to the mock business system

EmotionAgent (Sentiment Analysis Agent)¶

:material-emotion: Recognizes user sentiment and triggers corresponding handling:

Keyword sentiment scoring: profanity +5, complaint words +3
When sentiment is agitated (score > 4), prioritizes sentiment handling: soothe first, then resolve
Sentiment-sensitive intent directly triggers escalation to human, avoiding escalation of conflict

TicketAgent (Ticket Processing Agent)¶

Extracts ticket information from user conversations and ingests it:

Recognizes ticket intents like returns/refunds/complaints/after-sales
Extracts key info (order number, problem description) to create a ticket
Returns an acceptance script after ticket ingestion

DialogAgent (Dialog Polish Agent)¶

Unifies the style and annotates sources for the final reply:

Merges the raw results of each Agent into a coherent reply
Annotates citation sources (Source: faq.md page 3)
Performance optimization: chitchat/ticket/business_query intents skip LLM polishing and use the raw reply directly

Key Design Principles¶

1. Fallback-First¶

Each layer has fallback guarantees, ensuring the system remains available under any single-point failure:

Layer	Component	Fallback Target
Access	API auth	`API_KEY` empty → no-auth mode
Collaboration	LangGraph	Unavailable → synchronous orchestrator `_SynchOrchestrator`
Collaboration	LLM	Unavailable → `_MockLLM` assembled reply
Collaboration	Real LLM call	Failure → ModelRouter falls back to default model retry
Data	BGE embedding	Load failure → hash fallback vectors
Data	Reranker	Load failure → cosine similarity reranking
Data	Redis	Unreachable → in-memory queue
Data	Business API	Failure → mock business system
Observability	Langfuse	Not configured → no-op, no impact on main path

See Fallback Strategy for details.

2. Async-First¶

All middleware and API endpoints are async/await. IO-intensive operations do not block the event loop:

LLM calls, vector retrieval, and business API calls all go async or via thread pool
Complex problems with multiple subtasks run in parallel via ThreadPoolExecutor (4 workers)
SSE streaming responses return token-by-token with low first-token latency

3. Cache-First¶

The system is designed with multi-layer caching to reduce latency and LLM call cost:

Cache	Location	Hit Effect
HotQueryCache	`run_graph` entry/exit	Knowledge Q&A cache hit drops to 0.002s, skipping all LLM calls
IntentCache	Intent recognition stage	First token from 2.7s down to ~800ms
ModelRouter	Intent recognition stage	Simple queries route to small model, first token down to ~1s

Three-Layer Performance Optimization Combo¶

The HotQueryCache + ModelRouter + IntentCache three-layer combo is the core of system performance optimization:

flowchart TD
    Q[User Query] --> HQC{"HotQueryCache<br/>hot cache"}
    HQC -- "Hit → 0.002s" --> Fast([Return cached directly])
    HQC -- "Miss" --> IC{"IntentCache<br/>intent cache"}
    IC -- "Hit → skip LLM" --> Route[Route dispatch]
    IC -- "Miss" --> MR{"ModelRouter<br/>large/small model routing"}
    MR -- "Simple → small model ~1s" --> Route
    MR -- "Complex → main LLM ~2.7s" --> Route
    Route --> Agent[Agent execution]
    Agent --> Write[Write to HotQueryCache]

    style HQC fill:#c8e6c9,stroke:#4caf50,stroke-width:2px
    style Fast fill:#c8e6c9,stroke:#4caf50

Measured performance

Metric	Target	Actual
Recall@5	≥ 0.85	1.0
Hit Rate	≥ 0.90	0.9333
Hallucination Rate	≤ 0.10	0.0
Independent Resolution Rate	≥ 60%	80%
Avg Response Time	≤ 3s	2.27s
Hot Cache Hit	—	0.002s

Project Structure Overview¶

app/
├── api/v1/              # Access Layer: REST API endpoints
│   ├── chat.py          # Chat endpoint (sync + SSE streaming)
│   ├── agent.py         # Agent assist endpoints (8)
│   ├── knowledge.py     # Knowledge base management
│   ├── evaluation.py    # Retrieval evaluation
│   ├── performance.py   # Performance monitoring
│   ├── observability.py # Observability
│   └── operations.py    # Operations dashboard
├── agents/              # Collaboration Layer
│   ├── orchestrator.py  # Orchestration Agent
│   ├── graph.py         # LangGraph state machine
│   ├── knowledge_agent.py    # Knowledge Retrieval Agent
│   ├── business_agent.py     # Business Query Agent
│   ├── emotion_agent.py      # Sentiment Analysis Agent
│   ├── ticket_agent.py       # Ticket Processing Agent
│   ├── dialog_agent.py       # Dialog Polish Agent
│   └── llm_client.py    # LLM client (mock fallback)
├── core/                # Core infrastructure
│   ├── config.py        # Configuration management
│   ├── session.py       # Session management
│   ├── performance.py   # HotQueryCache / ModelRouter / IntentCache
│   ├── circuit_breaker.py   # Circuit breaker fallback
│   └── langfuse_client.py   # Langfuse tracing
├── knowledge/           # Data Layer
│   ├── hybrid_retriever.py  # Hybrid retrieval
│   ├── reranker.py      # Reranking
│   ├── vectorstore.py   # ChromaDB
│   ├── embeddings.py    # BGE embedding
│   ├── bm25.py          # BM25 retrieval
│   └── pipeline.py      # Document ingestion pipeline
└── schemas/             # Pydantic data models

Topic	Link
Multi-Agent collaboration (LangGraph state machine)	Multi-Agent Collaboration
RAG retrieval pipeline (hybrid retrieval + reranking)	RAG Retrieval Pipeline
Fallback and fault tolerance (7-layer fallback matrix)	Fallback Strategy
All configuration options	Configuration Guide