Architecture¶
The system is built on dual engines of "Multi-Agent Collaboration + RAG Knowledge Enhancement", adopting a three-layer architecture and a "1+5" multi-Agent collaboration design, with built-in fallback at each layer to ensure availability.
System Overview¶
flowchart TD
User([User Request]) --> API["Access Layer API v1<br/>/chat · /chat/stream · /agent/*"]
API --> Cache{"HotQueryCache<br/>hot cache hit?"}
Cache -- "Hit" --> Reply([Return cached reply])
Cache -- "Miss" --> Orch["Collaboration Layer Orchestrator<br/>Intent · Routing · Fallback"]
Orch --> Intent{"Intent Recognition<br/>Rule fast path / IntentCache / LLM"}
Intent --> Route{"Routing Decision"}
Route --> KA["KnowledgeAgent<br/>Hybrid Retrieval+RAG"]
Route --> BA["BusinessAgent<br/>Business Query+Masking"]
Route --> EA["EmotionAgent<br/>Sentiment+Escalation"]
Route --> TA["TicketAgent<br/>Ticket Extract+Create"]
Route --> DA["DialogAgent<br/>Dialog Polish"]
KA --> Data["Data Layer<br/>ChromaDB · BM25 · Reranker"]
BA --> Biz["Business System<br/>mock / http"]
EA --> Esc["Escalation Engine<br/>EscalationCard"]
TA --> Ticket["Ticket Storage"]
KA --> Dialog["DialogAgent<br/>Result Merge+Polish"]
BA --> Dialog
EA --> Dialog
TA --> Dialog
Dialog --> Output([Final Reply])
Esc --> Output
style Cache fill:#e8f5e9,stroke:#4caf50
style Orch fill:#e3f2fd,stroke:#2196f3
style Data fill:#fff3e0,stroke:#ff9800
Three-Layer Architecture¶
The system is divided into access layer, collaboration layer, and data layer, each with clear responsibilities and one-way dependencies.
Access Layer (API v1)¶
Located in app/api/v1/, responsible for HTTP access, authentication, request validation, and response wrapping.
| Module | Endpoints | Responsibility |
|---|---|---|
chat.py |
/chat, /chat/stream |
Sync chat and SSE streaming chat |
agent.py |
/agent/sessions/* (8 endpoints) |
Agent assist workbench |
knowledge.py |
/knowledge/ingest, /knowledge/stats |
Knowledge base management |
evaluation.py |
/evaluation/run |
Retrieval evaluation (Recall/Hit/MRR/hallucination rate) |
performance.py |
/performance/metrics, /performance/cache/invalidate |
Performance monitoring and cache cleanup |
observability.py |
/observability/health |
Component health check |
operations.py |
/operations/dashboard |
Operations dashboard and canary release |
Async-first
All endpoints are async def. IO-intensive scenarios (LLM calls, vector retrieval, business API) do not block the event loop, allowing a single process to handle high concurrency.
Collaboration Layer (agents)¶
Located in app/agents/, the core of multi-Agent collaboration, responsible for intent recognition, task routing, Agent execution, and result merging.
orchestrator.py: Orchestration Agent, intent recognition and routing dispatchgraph.py: LangGraph state machine orchestration (degrades to synchronous orchestrator)knowledge_agent.py: Knowledge Retrieval Agent (hybrid retrieval + reranking + summary)business_agent.py: Business Query Agentemotion_agent.py: Sentiment Analysis Agentticket_agent.py: Ticket Processing Agentdialog_agent.py: Dialog Polish Agentllm_client.py: LLM client (incl._MockLLMfallback)
Data Layer (knowledge + core)¶
Located in app/knowledge/ and app/core/, provides knowledge retrieval, persistence, and infrastructure.
| Module | Responsibility |
|---|---|
hybrid_retriever.py |
Hybrid retrieval (vector + BM25 + RRF fusion) |
reranker.py |
CrossEncoder reranking (degrades to cosine) |
vectorstore.py |
ChromaDB wrapper |
embeddings.py |
BGE embedding service (degrades to hash fallback) |
bm25.py |
BM25 keyword retrieval |
query_rewriter.py |
Query rewriting |
pipeline.py |
Document ingestion pipeline |
performance.py |
HotQueryCache / ModelRouter / IntentCache |
circuit_breaker.py |
Circuit breaker fallback |
langfuse_client.py |
Langfuse tracing (degrades to no-op) |
session.py |
Session management |
Multi-Agent "1+5" Architecture¶
The system uses 1 orchestration Agent to coordinate 5 specialized Agents, each with its own responsibility and no overlap.
flowchart LR
subgraph Orchestration
O["Orchestrator<br/>Orchestration Agent"]
end
subgraph Specialized Agents
K["KnowledgeAgent<br/>Knowledge Retrieval"]
B["BusinessAgent<br/>Business Query"]
E["EmotionAgent<br/>Sentiment Analysis"]
T["TicketAgent<br/>Ticket Processing"]
D["DialogAgent<br/>Dialog Polish"]
end
O -->|Route dispatch| K
O -->|Route dispatch| B
O -->|Route dispatch| E
O -->|Route dispatch| T
O -->|Result merge| D
D --> Output([Final Reply])
style O fill:#e3f2fd,stroke:#2196f3,stroke-width:2px
Orchestrator (Orchestration Agent)¶
The "brain" of the multi-Agent architecture, responsible for:
- Intent recognition: Three-level mechanism (rule fast path → IntentCache → LLM), see Multi-Agent Collaboration
- Routing dispatch: Routes the query to the corresponding specialized Agent based on intent
- Sentiment priority: When sentiment sensitivity or agitation is detected, forcibly switches to sentiment handling
- Fallback handling: unknown intent returns a guidance message; 2 consecutive unresolved turns escalate to human
KnowledgeAgent (Knowledge Retrieval Agent)¶
The core of knowledge Q&A, orchestrates the full RAG chain:
- Query rewriting → hybrid retrieval (vector + BM25 + RRF) → Reranker reranking → threshold filtering
- Optional LLM summary generation (
generate_summary=True) - Returns a fallback reply when retrieval is empty, avoiding meaningless LLM calls
- See RAG Retrieval Pipeline
BusinessAgent (Business Query Agent)¶
Integrates with the business system to query orders/members/returns/accounts:
- Supports both
mock(in-memory mock) andhttp(real business system) modes - Phone number masking: phone numbers in results are auto-masked (middle 4 digits replaced with
****) - Write operation confirmation: write operations like refunds/returns require user confirmation before execution
- Failure fallback: when the business API is unavailable, degrades to the mock business system
EmotionAgent (Sentiment Analysis Agent)¶
:material-emotion: Recognizes user sentiment and triggers corresponding handling:
- Keyword sentiment scoring: profanity +5, complaint words +3
- When sentiment is agitated (score > 4), prioritizes sentiment handling: soothe first, then resolve
- Sentiment-sensitive intent directly triggers escalation to human, avoiding escalation of conflict
TicketAgent (Ticket Processing Agent)¶
Extracts ticket information from user conversations and ingests it:
- Recognizes ticket intents like returns/refunds/complaints/after-sales
- Extracts key info (order number, problem description) to create a ticket
- Returns an acceptance script after ticket ingestion
DialogAgent (Dialog Polish Agent)¶
Unifies the style and annotates sources for the final reply:
- Merges the raw results of each Agent into a coherent reply
- Annotates citation sources (
Source: faq.md page 3) - Performance optimization: chitchat/ticket/business_query intents skip LLM polishing and use the raw reply directly
Key Design Principles¶
1. Fallback-First¶
Each layer has fallback guarantees, ensuring the system remains available under any single-point failure:
| Layer | Component | Fallback Target |
|---|---|---|
| Access | API auth | API_KEY empty → no-auth mode |
| Collaboration | LangGraph | Unavailable → synchronous orchestrator _SynchOrchestrator |
| Collaboration | LLM | Unavailable → _MockLLM assembled reply |
| Collaboration | Real LLM call | Failure → ModelRouter falls back to default model retry |
| Data | BGE embedding | Load failure → hash fallback vectors |
| Data | Reranker | Load failure → cosine similarity reranking |
| Data | Redis | Unreachable → in-memory queue |
| Data | Business API | Failure → mock business system |
| Observability | Langfuse | Not configured → no-op, no impact on main path |
See Fallback Strategy for details.
2. Async-First¶
All middleware and API endpoints are async/await. IO-intensive operations do not block the event loop:
- LLM calls, vector retrieval, and business API calls all go async or via thread pool
- Complex problems with multiple subtasks run in parallel via
ThreadPoolExecutor(4 workers) - SSE streaming responses return token-by-token with low first-token latency
3. Cache-First¶
The system is designed with multi-layer caching to reduce latency and LLM call cost:
| Cache | Location | Hit Effect |
|---|---|---|
| HotQueryCache | run_graph entry/exit |
Knowledge Q&A cache hit drops to 0.002s, skipping all LLM calls |
| IntentCache | Intent recognition stage | First token from 2.7s down to ~800ms |
| ModelRouter | Intent recognition stage | Simple queries route to small model, first token down to ~1s |
Three-Layer Performance Optimization Combo¶
The HotQueryCache + ModelRouter + IntentCache three-layer combo is the core of system performance optimization:
flowchart TD
Q[User Query] --> HQC{"HotQueryCache<br/>hot cache"}
HQC -- "Hit → 0.002s" --> Fast([Return cached directly])
HQC -- "Miss" --> IC{"IntentCache<br/>intent cache"}
IC -- "Hit → skip LLM" --> Route[Route dispatch]
IC -- "Miss" --> MR{"ModelRouter<br/>large/small model routing"}
MR -- "Simple → small model ~1s" --> Route
MR -- "Complex → main LLM ~2.7s" --> Route
Route --> Agent[Agent execution]
Agent --> Write[Write to HotQueryCache]
style HQC fill:#c8e6c9,stroke:#4caf50,stroke-width:2px
style Fast fill:#c8e6c9,stroke:#4caf50
Measured performance
| Metric | Target | Actual | Pass |
|---|---|---|---|
| Recall@5 | ≥ 0.85 | 1.0 | |
| Hit Rate | ≥ 0.90 | 0.9333 | |
| Hallucination Rate | ≤ 0.10 | 0.0 | |
| Independent Resolution Rate | ≥ 60% | 80% | |
| Avg Response Time | ≤ 3s | 2.27s | |
| Hot Cache Hit | — | 0.002s |
Project Structure Overview¶
app/
├── api/v1/ # Access Layer: REST API endpoints
│ ├── chat.py # Chat endpoint (sync + SSE streaming)
│ ├── agent.py # Agent assist endpoints (8)
│ ├── knowledge.py # Knowledge base management
│ ├── evaluation.py # Retrieval evaluation
│ ├── performance.py # Performance monitoring
│ ├── observability.py # Observability
│ └── operations.py # Operations dashboard
├── agents/ # Collaboration Layer
│ ├── orchestrator.py # Orchestration Agent
│ ├── graph.py # LangGraph state machine
│ ├── knowledge_agent.py # Knowledge Retrieval Agent
│ ├── business_agent.py # Business Query Agent
│ ├── emotion_agent.py # Sentiment Analysis Agent
│ ├── ticket_agent.py # Ticket Processing Agent
│ ├── dialog_agent.py # Dialog Polish Agent
│ └── llm_client.py # LLM client (mock fallback)
├── core/ # Core infrastructure
│ ├── config.py # Configuration management
│ ├── session.py # Session management
│ ├── performance.py # HotQueryCache / ModelRouter / IntentCache
│ ├── circuit_breaker.py # Circuit breaker fallback
│ └── langfuse_client.py # Langfuse tracing
├── knowledge/ # Data Layer
│ ├── hybrid_retriever.py # Hybrid retrieval
│ ├── reranker.py # Reranking
│ ├── vectorstore.py # ChromaDB
│ ├── embeddings.py # BGE embedding
│ ├── bm25.py # BM25 retrieval
│ └── pipeline.py # Document ingestion pipeline
└── schemas/ # Pydantic data models
Further Reading¶
| Topic | Link |
|---|---|
| Multi-Agent collaboration (LangGraph state machine) | Multi-Agent Collaboration |
| RAG retrieval pipeline (hybrid retrieval + reranking) | RAG Retrieval Pipeline |
| Fallback and fault tolerance (7-layer fallback matrix) | Fallback Strategy |
| All configuration options | Configuration Guide |