FAQ¶
Organized into the four categories of Installation / Configuration / Usage / Performance. Click a question to expand the answer. If you cannot find an answer, please submit an issue at GitHub Issues.
Installation¶
Q: What should I do if chromadb installation fails?
A: chromadb depends on onnxruntime and pypika. Common failure causes and solutions:
Some dependencies (such as hnswlib) require a C++ build environment:
- Download Visual Studio Build Tools
- During installation, select the "Desktop development with C++" workload
- Restart the terminal and re-run
pip install
Q: The BGE embedding model downloads very slowly. What should I do?
A: The BGE model is hosted on Hugging Face. If access is slow in your region, use a mirror endpoint.
# Option 1: set the HF mirror endpoint (recommended)
export HF_ENDPOINT=https://hf-mirror.com
pip install -r requirements.txt
# Option 2: manually download the model weights to a local path and specify it in .env
# git clone https://hf-mirror.com/BAAI/bge-large-zh-v1.5 models/bge-large-zh
# In .env set:
# EMBEDDING_MODEL=./models/bge-large-zh
First load is cached
After the first load, the model is cached under ~/.cache/huggingface/; subsequent startups do not need to re-download it.
Q: Can I use Python 3.10?
A: We recommend Python 3.11+. Some dependencies may be incompatible on 3.10.
| Version | Support | Notes |
|---|---|---|
| 3.11+ | ✅ Recommended | All dependencies tested successfully |
| 3.10 | ⚠️ Partially compatible | Some chromadb / langfuse features may behave abnormally |
| 3.9 and below | ❌ Not supported | Type annotations and syntax incompatible |
Q: What happens if Redis is not started?
A: The system automatically degrades to an in-memory queue without affecting main-chain functionality, but sessions and caches are not persisted.
# Recommend starting Redis with Docker
docker run -d --name redis -p 6379:6379 redis:7-alpine
# Verify the connection
redis-cli ping # should return PONG
Redis must be started in production
In production, if Redis is not started, all sessions and caches are lost on process restart.
Configuration¶
Q: Is it safe to leave API_KEY empty?
A: Only for local development mode. Production must set a non-empty API_KEY.
- When
API_KEYis empty, theverify_api_keydependency skips authentication; any caller can access the system - When
API_KEYis non-empty, all endpoints marked ✅ require a matchingX-API-Keyrequest header
# .env example
# Development mode (local debugging)
API_KEY=
# Production mode (required)
API_KEY=your-strong-random-key-here
Security risk
Leaving API_KEY empty in production is equivalent to fully opening the interface, which can be exploited by any caller, including sensitive operations like ingestion and deletion.
Q: What happens if SMALL_LLM_API_KEY is not configured?
A: ModelRouter automatically falls back to the main LLM, with no side effects. Only the first-token latency increases slightly.
| Configuration | Behavior | First-Token Latency |
|---|---|---|
| Small model + main model configured | Intent recognition on small model; generation on main model | ~200ms |
| Only main model configured | Everything on the main model | ~800ms |
| Neither configured | Mock mode (no real LLM) | <50ms |
Q: Will incorrect Langfuse configuration block the main chain?
A: No. The Langfuse client has a built-in fallback; on configuration errors it automatically becomes a no-op.
# Degradation logic in app/core/langfuse_client.py (simplified)
if not settings.LANGFUSE_ENABLED or not settings.LANGFUSE_PUBLIC_KEY:
# All methods return None without raising
return NoOpLangfuseTrace()
Conditions that trigger fallback:
LANGFUSE_ENABLED=FalseLANGFUSE_PUBLIC_KEYorLANGFUSE_SECRET_KEYis empty- Langfuse service connection timeout (default 3 seconds)
Impact after fallback
After fallback, traces are not reported to Langfuse, but the local Monitor still records trace summaries, viewable via /api/v1/monitor/traces.
Q: Should BUSINESS_ADAPTER_MODE be mock or http?
A: Use mock for development and testing; switch to http when integrating with a real business system.
# mock mode: in-memory simulated order/membership/return APIs; out of the box
BUSINESS_ADAPTER_MODE=mock
# http mode: calls the real business system REST API
BUSINESS_ADAPTER_MODE=http
BUSINESS_API_BASE_URL=https://your-business-api.com
BUSINESS_API_KEY=your-business-api-key
BUSINESS_API_TIMEOUT=10
http mode fallback
When BUSINESS_API_BASE_URL is empty, the system automatically falls back to mock and emits a warning; it does not block business queries.
Q: How do I adjust working hours (the escalation time window)?
A: Modify WORKING_HOURS_START / WORKING_HOURS_END in .env:
# 24-hour format, half-open interval [START, END). Outside this range, emotion/failure requests are not proactively escalated
WORKING_HOURS_START=9
WORKING_HOURS_END=18
TIMEZONE=Asia/Shanghai
User-initiated transfers are not constrained by the time window
Even outside working hours, if the user explicitly says "transfer to human", escalation is still triggered. Only system-initiated escalations are constrained by the time window.
Usage¶
Q: Answers don't change after a knowledge base update. What should I do?
A: You need to call /api/v1/performance/cache/invalidate to clear the hot cache.
# Clear the hot cache; the next query will go through retrieval again
curl -X POST http://localhost:8000/api/v1/performance/cache/invalidate \
-H "X-API-Key: $API_KEY"
Reason: HotQueryCache caches replies for high-frequency queries (default 1000 entries). After a knowledge base update, the cache still returns stale replies.
Auto-clear cache after batch ingestion
The batch ingestion script already includes a cache-clear call, so no manual clearing is needed.
Q: Streaming responses (SSE) frequently disconnect. What should I do?
A: Usually Nginx / CDN is buffering the SSE stream; you need to disable buffering.
location /api/v1/chat/stream {
proxy_pass http://backend;
proxy_buffering off; # disable response buffering
proxy_cache off; # disable caching
chunked_transfer_encoding on; # enable chunked transfer
proxy_read_timeout 300s; # extend read timeout
proxy_set_header Connection ''; # clear the Connection header
}
Cloudflare buffers SSE by default; disable it in Page Rules:
- Match: */api/v1/chat/stream*
- Setting: Cache Level: Bypass
The system already sets the X-Accel-Buffering: no response header, but some proxies still require explicit configuration.
Q: Is the session history retained after escalation?
A: Yes. After escalation, the session history is fully retained; the agent can view the entire context.
```bash
View session details; the history field contains all conversation records¶
curl http://localhost:8000/api/v1/agent/sessions/$SESSION_ID \ -H "X-API-Key: $API_KEY" | jq .history ```
The returned `history` includes all user / assistant messages from before the escalation, so the agent can quickly understand the user's request.
Q: How does the system decide whether to escalate to a human?
A: The system automatically triggers escalation in the following three scenarios:
| Trigger | Description |
|---|---|
| User explicit request | Matches keywords such as "transfer to human" or "human agent" |
| Emotion-sensitive intent | Emotion score below threshold; recognized as emotion_sensitive |
| Consecutive failures reach threshold | failed_attempts >= ESCALATE_FAILED_THRESHOLD (default 3) |
After escalation, an EscalationCard is generated with the escalation reason, priority, and context summary, and is passed to the agent workbench.
Q: How do I enter a human solution to consolidate back to the knowledge base?
A: Enter it via /api/v1/agent/sessions/{session_id}/solution or /api/v1/escalation/solution.
```bash
Option 1: agent endpoint (recommended; associated with the session)¶
curl -X POST http://localhost:8000/api/v1/agent/sessions/$SESSION_ID/solution \ -H "X-API-Key: $API_KEY" \ -H "Content-Type: application/json" \ -d '{ "question": "How do I return a product?", "solution": "Please click Return on the order page...", "intent": "knowledge_qa" }'
Option 2: escalation endpoint (independent of a session)¶
curl -X POST http://localhost:8000/api/v1/escalation/solution \ -H "X-API-Key: $API_KEY" \ -H "Content-Type: application/json" \ -d '{"session_id": "xxx", "question": "...", "solution": "..."}' ```
After entry, the solution enters the pending review queue. Once approved and ingested as a FAQ, the next bot retrieval can match it.
Q: How do I view traces to troubleshoot?
A: The system provides two ways to view traces:
```bash
View the 10 most recent traces¶
curl "http://localhost:8000/api/v1/monitor/traces?limit=10" | jq .
View a single trace's details (including per-step duration)¶
curl http://localhost:8000/api/v1/monitor/traces/$TRACE_ID | jq . ```
After configuring LANGFUSE_ENABLED=True, visit Langfuse Cloud to view:
- Full-chain trace visualization
- name/version for the 11 prompt markers
- Automatic token / cost / latency statistics
Performance¶
Q: The first query is very slow (~2.7 seconds). What should I do?
A: The main cost is the DeepSeek query rewrite phase; you can optimize by disabling query_rewrite.
Latency breakdown (typical scenario):
| Phase | Latency | Notes |
|---|---|---|
| Intent recognition | ~300ms | Small model (qwen-turbo) |
| Query rewrite | ~1500ms | Main LLM (DeepSeek); blocks the main chain |
| Hybrid retrieval | ~280ms | Vector + BM25 + RRF |
| Reranking | ~150ms | BGE reranker |
| Generation | ~500ms | Main LLM streaming generation |
Optimization options:
```bash
Option 1: disable query rewrite (sacrifice a little recall for latency)¶
Set enable_query_rewrite to false in retrieval_tuner¶
curl -X PUT http://localhost:8000/api/v1/tuner/params \ -H "X-API-Key: $API_KEY" \ -H "Content-Type: application/json" \ -d '{"enable_query_rewrite": false}'
Option 2: use the streaming endpoint (first token <1s; users perceive it as faster)¶
curl -N -X POST http://localhost:8000/api/v1/chat/stream \ -H "X-API-Key: $API_KEY" \ -H "Content-Type: application/json" \ -d '{"message": "return process"}' ```
Hot cache hit gives first token <100ms
The second request with the same query hits HotQueryCache, skipping all orchestration; first token <100ms.
Q: Responses slow down under heavy concurrency. What should I do?
A: Adjust retrieval parameters and enable the hot cache to reduce per-request overhead.
```bash
Lower VECTOR_TOP_K and BM25_TOP_K to reduce recall volume¶
curl -X PUT http://localhost:8000/api/v1/tuner/params \ -H "X-API-Key: $API_KEY" \ -H "Content-Type: application/json" \ -d '{ "vector_top_k": 15, "bm25_top_k": 15, "rerank_top_k": 3 }' ```
| Parameter | Default | Recommended (high concurrency) | Impact |
|---|---|---|---|
VECTOR_TOP_K |
25 | 15 | Vector recall volume |
BM25_TOP_K |
25 | 15 | Keyword recall volume |
RERANK_TOP_K |
5 | 3 | Number of chunks entering generation |
```bash
View the cache hit rate; should be >50% to be considered effective¶
curl http://localhost:8000/api/v1/performance/cache/stats | jq . ```
If the hit rate is low, check whether HotQueryCache is enabled, or increase the cache capacity.
```bash curl http://localhost:8000/api/v1/performance/metrics | jq .
Focus on cache_hit_rate / avg_response_ms / concurrent_requests¶
```
Q: Memory usage is too high. What should I do?
A: Adjust EMBEDDING_BATCH_SIZE and ChromaDB caching strategy.
```bash
.env adjust batch size to lower the per-embedding memory peak¶
EMBEDDING_BATCH_SIZE=16 # default 32; halve when memory is tight
ChromaDB persistence directory to avoid in-memory indexes¶
CHROMA_PERSIST_DIR=./chroma_data ```
Main sources of memory usage:
| Component | Usage | Optimization |
|---|---|---|
| BGE embedding model | ~2GB | Model resident in memory; cannot be released |
| ChromaDB index | ~500MB | Persist to disk; reduce in-memory indexes |
| LLM client connection pool | ~100MB | Reuse connections; avoid repeated creation |
| Session history | Depends on concurrency | Adjust MAX_HISTORY_LENGTH |
Container deployment recommendation
For Docker deployment, reserve 4GB of memory. For Kubernetes, set resources.limits.memory: 4Gi.
Q: How do I monitor token usage to avoid budget overruns?
A: View usage via the /api/v1/observability/token-usage endpoint, combined with the alerting mechanism.
```bash
View hourly token usage¶
curl "http://localhost:8000/api/v1/observability/token-usage?window=hour" | jq .
View the alert list (including token over-budget alerts)¶
curl "http://localhost:8000/api/v1/observability/alerts?source=token_usage" | jq . ```
ModelRouter automatically reduces cost
After configuring SMALL_LLM_API_KEY, lightweight tasks such as intent recognition use the small model (cost is about 1/10 of GPT-4o-mini); the main LLM is only used for generation.
Q: How do I evaluate whether retrieval effectiveness meets the bar?
A: Call /api/v1/evaluation/run to trigger an evaluation; focus on Recall@5 and Hit Rate.
```bash
Evaluate with the built-in 30-case test set¶
curl -X POST http://localhost:8000/api/v1/evaluation/run \ -H "X-API-Key: $API_KEY" \ -H "Content-Type: application/json" \ -d '{}' | jq .
View historical reports¶
curl http://localhost:8000/api/v1/evaluation/reports | jq . ```
Pass criteria (project measurements):
| Metric | Target | Measured |
|---|---|---|
| Recall@5 | ≥ 0.85 | 1.0 |
| Hit Rate | ≥ 0.90 | 0.9333 |
| Hallucination rate | ≤ 0.10 | 0.0 |
Related Documentation¶
- API Reference: detailed descriptions of all endpoints
- Examples: 5 hands-on examples
- Configuration: 30+ configuration items explained
- Performance Optimization: three-layer caching and concurrency optimization
- Contributing Guide: guidelines for submitting Issues and PRs