FAQ¶

Organized into the four categories of Installation / Configuration / Usage / Performance. Click a question to expand the answer. If you cannot find an answer, please submit an issue at GitHub Issues.

Installation¶

Q: What should I do if chromadb installation fails?

A: chromadb depends on onnxruntime and pypika. Common failure causes and solutions:

Upgrade pip and build toolsInstall VS Build Tools on WindowsUse prebuilt wheels

# Upgrade pip to the latest version to avoid older versions failing to resolve the dependency tree
python -m pip install --upgrade pip setuptools wheel
# Reinstall
pip install -r requirements.txt

Some dependencies (such as hnswlib) require a C++ build environment:

Download Visual Studio Build Tools
During installation, select the "Desktop development with C++" workload
Restart the terminal and re-run pip install

# Prefer prebuilt versions to skip source compilation
pip install chromadb --only-binary :all:

Q: The BGE embedding model downloads very slowly. What should I do?

A: The BGE model is hosted on Hugging Face. If access is slow in your region, use a mirror endpoint.

# Option 1: set the HF mirror endpoint (recommended)
export HF_ENDPOINT=https://hf-mirror.com
pip install -r requirements.txt

# Option 2: manually download the model weights to a local path and specify it in .env
# git clone https://hf-mirror.com/BAAI/bge-large-zh-v1.5 models/bge-large-zh
# In .env set:
# EMBEDDING_MODEL=./models/bge-large-zh

First load is cached

After the first load, the model is cached under ~/.cache/huggingface/; subsequent startups do not need to re-download it.

Q: Can I use Python 3.10?

A: We recommend Python 3.11+. Some dependencies may be incompatible on 3.10.

Version	Support	Notes
3.11+	✅ Recommended	All dependencies tested successfully
3.10	⚠️ Partially compatible	Some `chromadb` / `langfuse` features may behave abnormally
3.9 and below	❌ Not supported	Type annotations and syntax incompatible

# Recommend managing the Python version with pyenv or conda
pyenv install 3.11.9
pyenv local 3.11.9

Q: What happens if Redis is not started?

A: The system automatically degrades to an in-memory queue without affecting main-chain functionality, but sessions and caches are not persisted.

# Recommend starting Redis with Docker
docker run -d --name redis -p 6379:6379 redis:7-alpine

# Verify the connection
redis-cli ping  # should return PONG

Redis must be started in production

In production, if Redis is not started, all sessions and caches are lost on process restart.

Configuration¶

Q: Is it safe to leave API_KEY empty?

A: Only for local development mode. Production must set a non-empty API_KEY.

When API_KEY is empty, the verify_api_key dependency skips authentication; any caller can access the system
When API_KEY is non-empty, all endpoints marked ✅ require a matching X-API-Key request header

# .env example
# Development mode (local debugging)
API_KEY=

# Production mode (required)
API_KEY=your-strong-random-key-here

Security risk

Leaving API_KEY empty in production is equivalent to fully opening the interface, which can be exploited by any caller, including sensitive operations like ingestion and deletion.

Q: What happens if SMALL_LLM_API_KEY is not configured?

A: ModelRouter automatically falls back to the main LLM, with no side effects. Only the first-token latency increases slightly.

Configuration	Behavior	First-Token Latency
Small model + main model configured	Intent recognition on small model; generation on main model	~200ms
Only main model configured	Everything on the main model	~800ms
Neither configured	Mock mode (no real LLM)	<50ms

# .env example (Qwen qwen-turbo)
SMALL_LLM_API_KEY=sk-xxxxx
SMALL_LLM_BASE_URL=https://dashscope.aliyuncs.com/compatible-mode/v1
SMALL_LLM_MODEL=qwen-turbo
SMALL_MODEL_THRESHOLD=0.5

Q: Will incorrect Langfuse configuration block the main chain?

A: No. The Langfuse client has a built-in fallback; on configuration errors it automatically becomes a no-op.

# Degradation logic in app/core/langfuse_client.py (simplified)
if not settings.LANGFUSE_ENABLED or not settings.LANGFUSE_PUBLIC_KEY:
    # All methods return None without raising
    return NoOpLangfuseTrace()

Conditions that trigger fallback:

LANGFUSE_ENABLED=False
LANGFUSE_PUBLIC_KEY or LANGFUSE_SECRET_KEY is empty
Langfuse service connection timeout (default 3 seconds)

Impact after fallback

After fallback, traces are not reported to Langfuse, but the local Monitor still records trace summaries, viewable via /api/v1/monitor/traces.

Q: Should BUSINESS_ADAPTER_MODE be mock or http?

A: Use mock for development and testing; switch to http when integrating with a real business system.

# mock mode: in-memory simulated order/membership/return APIs; out of the box
BUSINESS_ADAPTER_MODE=mock

# http mode: calls the real business system REST API
BUSINESS_ADAPTER_MODE=http
BUSINESS_API_BASE_URL=https://your-business-api.com
BUSINESS_API_KEY=your-business-api-key
BUSINESS_API_TIMEOUT=10

http mode fallback

When BUSINESS_API_BASE_URL is empty, the system automatically falls back to mock and emits a warning; it does not block business queries.

Q: How do I adjust working hours (the escalation time window)?

A: Modify WORKING_HOURS_START / WORKING_HOURS_END in .env:

# 24-hour format, half-open interval [START, END). Outside this range, emotion/failure requests are not proactively escalated
WORKING_HOURS_START=9
WORKING_HOURS_END=18
TIMEZONE=Asia/Shanghai

User-initiated transfers are not constrained by the time window

Even outside working hours, if the user explicitly says "transfer to human", escalation is still triggered. Only system-initiated escalations are constrained by the time window.

Usage¶

Q: Answers don't change after a knowledge base update. What should I do?

A: You need to call /api/v1/performance/cache/invalidate to clear the hot cache.

# Clear the hot cache; the next query will go through retrieval again
curl -X POST http://localhost:8000/api/v1/performance/cache/invalidate \
  -H "X-API-Key: $API_KEY"

Reason: HotQueryCache caches replies for high-frequency queries (default 1000 entries). After a knowledge base update, the cache still returns stale replies.

Auto-clear cache after batch ingestion

The batch ingestion script already includes a cache-clear call, so no manual clearing is needed.

Q: Streaming responses (SSE) frequently disconnect. What should I do?

A: Usually Nginx / CDN is buffering the SSE stream; you need to disable buffering.

Nginx configurationCloudflare CDNClient verification

location /api/v1/chat/stream {
    proxy_pass http://backend;
    proxy_buffering off;          # disable response buffering
    proxy_cache off;              # disable caching
    chunked_transfer_encoding on; # enable chunked transfer
    proxy_read_timeout 300s;      # extend read timeout
    proxy_set_header Connection '';  # clear the Connection header
}

Cloudflare buffers SSE by default; disable it in Page Rules: - Match: */api/v1/chat/stream* - Setting: Cache Level: Bypass

# -N disables curl buffering; verify whether SSE returns in real time
curl -N -X POST http://localhost:8000/api/v1/chat/stream \
  -H "X-API-Key: $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"message": "test streaming"}'

The system already sets the X-Accel-Buffering: no response header, but some proxies still require explicit configuration.

Q: Is the session history retained after escalation?

A: Yes. After escalation, the session history is fully retained; the agent can view the entire context.

```bash

View session details; the history field contains all conversation records¶

curl http://localhost:8000/api/v1/agent/sessions/$SESSION_ID \ -H "X-API-Key: $API_KEY" | jq .history ```

The returned `history` includes all user / assistant messages from before the escalation, so the agent can quickly understand the user's request.

Q: How does the system decide whether to escalate to a human?

A: The system automatically triggers escalation in the following three scenarios:

Trigger	Description
User explicit request	Matches keywords such as "transfer to human" or "human agent"
Emotion-sensitive intent	Emotion score below threshold; recognized as `emotion_sensitive`
Consecutive failures reach threshold	`failed_attempts >= ESCALATE_FAILED_THRESHOLD` (default 3)

After escalation, an EscalationCard is generated with the escalation reason, priority, and context summary, and is passed to the agent workbench.

Q: How do I enter a human solution to consolidate back to the knowledge base?

A: Enter it via /api/v1/agent/sessions/{session_id}/solution or /api/v1/escalation/solution.

```bash

Option 1: agent endpoint (recommended; associated with the session)¶

curl -X POST http://localhost:8000/api/v1/agent/sessions/$SESSION_ID/solution \ -H "X-API-Key: $API_KEY" \ -H "Content-Type: application/json" \ -d '{ "question": "How do I return a product?", "solution": "Please click Return on the order page...", "intent": "knowledge_qa" }'

Option 2: escalation endpoint (independent of a session)¶

curl -X POST http://localhost:8000/api/v1/escalation/solution \ -H "X-API-Key: $API_KEY" \ -H "Content-Type: application/json" \ -d '{"session_id": "xxx", "question": "...", "solution": "..."}' ```

After entry, the solution enters the pending review queue. Once approved and ingested as a FAQ, the next bot retrieval can match it.

Q: How do I view traces to troubleshoot?

A: The system provides two ways to view traces:

Local MonitorLangfuse console

```bash

View the 10 most recent traces¶

curl "http://localhost:8000/api/v1/monitor/traces?limit=10" | jq .

View a single trace's details (including per-step duration)¶

curl http://localhost:8000/api/v1/monitor/traces/$TRACE_ID | jq . ```

After configuring LANGFUSE_ENABLED=True, visit Langfuse Cloud to view:

Full-chain trace visualization
name/version for the 11 prompt markers
Automatic token / cost / latency statistics

See Example 5: Integrating with Langfuse.

Performance¶

Q: The first query is very slow (~2.7 seconds). What should I do?

A: The main cost is the DeepSeek query rewrite phase; you can optimize by disabling query_rewrite.

Latency breakdown (typical scenario):

Phase	Latency	Notes
Intent recognition	~300ms	Small model (qwen-turbo)
Query rewrite	~1500ms	Main LLM (DeepSeek); blocks the main chain
Hybrid retrieval	~280ms	Vector + BM25 + RRF
Reranking	~150ms	BGE reranker
Generation	~500ms	Main LLM streaming generation

Optimization options:

```bash

Option 1: disable query rewrite (sacrifice a little recall for latency)¶

Set enable_query_rewrite to false in retrieval_tuner¶

curl -X PUT http://localhost:8000/api/v1/tuner/params \ -H "X-API-Key: $API_KEY" \ -H "Content-Type: application/json" \ -d '{"enable_query_rewrite": false}'

Option 2: use the streaming endpoint (first token <1s; users perceive it as faster)¶

curl -N -X POST http://localhost:8000/api/v1/chat/stream \ -H "X-API-Key: $API_KEY" \ -H "Content-Type: application/json" \ -d '{"message": "return process"}' ```

Hot cache hit gives first token <100ms

The second request with the same query hits HotQueryCache, skipping all orchestration; first token <100ms.

Q: Responses slow down under heavy concurrency. What should I do?

A: Adjust retrieval parameters and enable the hot cache to reduce per-request overhead.

Adjust retrieval parametersConfirm the hot cache is effectiveView comprehensive performance metrics

```bash

Lower VECTOR_TOP_K and BM25_TOP_K to reduce recall volume¶

curl -X PUT http://localhost:8000/api/v1/tuner/params \ -H "X-API-Key: $API_KEY" \ -H "Content-Type: application/json" \ -d '{ "vector_top_k": 15, "bm25_top_k": 15, "rerank_top_k": 3 }' ```

Parameter	Default	Recommended (high concurrency)	Impact
`VECTOR_TOP_K`	25	15	Vector recall volume
`BM25_TOP_K`	25	15	Keyword recall volume
`RERANK_TOP_K`	5	3	Number of chunks entering generation

```bash

View the cache hit rate; should be >50% to be considered effective¶

curl http://localhost:8000/api/v1/performance/cache/stats | jq . ```

If the hit rate is low, check whether HotQueryCache is enabled, or increase the cache capacity.

```bash curl http://localhost:8000/api/v1/performance/metrics | jq .

Focus on cache_hit_rate / avg_response_ms / concurrent_requests¶

```

Q: Memory usage is too high. What should I do?

A: Adjust EMBEDDING_BATCH_SIZE and ChromaDB caching strategy.

```bash

.env adjust batch size to lower the per-embedding memory peak¶

EMBEDDING_BATCH_SIZE=16 # default 32; halve when memory is tight

ChromaDB persistence directory to avoid in-memory indexes¶

CHROMA_PERSIST_DIR=./chroma_data ```

Main sources of memory usage:

Component	Usage	Optimization
BGE embedding model	~2GB	Model resident in memory; cannot be released
ChromaDB index	~500MB	Persist to disk; reduce in-memory indexes
LLM client connection pool	~100MB	Reuse connections; avoid repeated creation
Session history	Depends on concurrency	Adjust `MAX_HISTORY_LENGTH`

Container deployment recommendation

For Docker deployment, reserve 4GB of memory. For Kubernetes, set resources.limits.memory: 4Gi.

Q: How do I monitor token usage to avoid budget overruns?

A: View usage via the /api/v1/observability/token-usage endpoint, combined with the alerting mechanism.

```bash

View hourly token usage¶

curl "http://localhost:8000/api/v1/observability/token-usage?window=hour" | jq .

View the alert list (including token over-budget alerts)¶

curl "http://localhost:8000/api/v1/observability/alerts?source=token_usage" | jq . ```

ModelRouter automatically reduces cost

After configuring SMALL_LLM_API_KEY, lightweight tasks such as intent recognition use the small model (cost is about 1/10 of GPT-4o-mini); the main LLM is only used for generation.

Q: How do I evaluate whether retrieval effectiveness meets the bar?

A: Call /api/v1/evaluation/run to trigger an evaluation; focus on Recall@5 and Hit Rate.

```bash

Evaluate with the built-in 30-case test set¶

curl -X POST http://localhost:8000/api/v1/evaluation/run \ -H "X-API-Key: $API_KEY" \ -H "Content-Type: application/json" \ -d '{}' | jq .

View historical reports¶

curl http://localhost:8000/api/v1/evaluation/reports | jq . ```

Pass criteria (project measurements):

Metric	Target	Measured
Recall@5	≥ 0.85	1.0
Hit Rate	≥ 0.90	0.9333
Hallucination rate	≤ 0.10	0.0

API Reference: detailed descriptions of all endpoints
Examples: 5 hands-on examples
Configuration: 30+ configuration items explained
Performance Optimization: three-layer caching and concurrency optimization
Contributing Guide: guidelines for submitting Issues and PRs

FAQ¶

Installation¶

Configuration¶

Usage¶

View session details; the history field contains all conversation records¶

Option 1: agent endpoint (recommended; associated with the session)¶

Option 2: escalation endpoint (independent of a session)¶

View the 10 most recent traces¶

View a single trace's details (including per-step duration)¶

Performance¶

Option 1: disable query rewrite (sacrifice a little recall for latency)¶

Set enable_query_rewrite to false in retrieval_tuner¶

Option 2: use the streaming endpoint (first token <1s; users perceive it as faster)¶

Lower VECTOR_TOP_K and BM25_TOP_K to reduce recall volume¶

View the cache hit rate; should be >50% to be considered effective¶

Focus on cache_hit_rate / avg_response_ms / concurrent_requests¶

.env adjust batch size to lower the per-embedding memory peak¶

ChromaDB persistence directory to avoid in-memory indexes¶

View hourly token usage¶

View the alert list (including token over-budget alerts)¶

Evaluate with the built-in 30-case test set¶

View historical reports¶

Related Documentation¶