Configuration Guide¶

System configuration is loaded from environment variables and the .env file via pydantic-settings, defined in the Settings class in app/core/config.py. This document covers all configuration options and their impact scope.

Configuration file location

The configuration template is at the project root .env.example; the actual configuration is written to .env. Both are UTF-8 encoded.

Configuration Priority¶

Configuration loading follows this priority (high to low):

Environment Variables  >  .env File  >  Settings Class Defaults

Environment variables: Injected via export or container env vars at deployment, highest priority
.env file: Commonly used for local development, loaded from the same directory as Settings
Default values: Field defaults in the Settings class in app/core/config.py

Global singleton

get_settings() uses @lru_cache to ensure only one Settings instance is created per process, avoiding repeated reads of environment variables. In tests, call get_settings.cache_clear() to reset.

Application Basic Configuration¶

Variable	Default	Description	Impact
`APP_NAME`	`Intelligent Customer Service System`	Application name, used for logs and monitoring tags	Global
`APP_HOST`	`0.0.0.0`	Service listen address	Service startup
`APP_PORT`	`8000`	Service listen port	Service startup
`DEBUG`	`False`	Debug mode; when enabled, logs are more verbose and error stacks are returned directly	Global

APP_NAME=Intelligent Customer Service System
APP_HOST=0.0.0.0
APP_PORT=8000
DEBUG=True

Production environment

In production, always set DEBUG=False to avoid leaking sensitive info via error stacks.

Authentication Configuration¶

Variable	Default	Description	Impact
`API_KEY`	`""` (empty)	Server-side API Key; clients must send it in the `X-API-Key` header; empty enables dev no-auth mode	All API endpoints

# Dev mode (no auth)
API_KEY=

# Production mode (auth required)
API_KEY=your-secret-api-key

Must configure in production

When API_KEY is empty, all API endpoints can be accessed anonymously. Production must set a strong random key, and clients must send it:

curl -H "X-API-Key: your-secret-api-key" http://localhost:8000/api/v1/chat ...

LLM Configuration¶

Main LLM, used for core tasks such as Query rewriting, RAG answer generation, and dialog polishing.

Variable	Default	Description	Impact
`LLM_API_KEY`	`""` (empty)	LLM API Key; empty auto-enables `_MockLLM` mock mode	LLM calls
`LLM_BASE_URL`	`https://api.openai.com/v1`	LLM API base URL, supports any OpenAI-compatible interface	LLM calls
`LLM_MODEL`	`gpt-4o-mini`	Main model name	LLM calls

DeepSeek (recommended)QwenOpenAI

LLM_API_KEY=sk-your-deepseek-key
LLM_BASE_URL=https://api.deepseek.com/v1
LLM_MODEL=deepseek-chat

LLM_API_KEY=sk-your-qwen-key
LLM_BASE_URL=https://dashscope.aliyuncs.com/compatible-mode/v1
LLM_MODEL=qwen-plus

LLM_API_KEY=sk-your-openai-key
LLM_BASE_URL=https://api.openai.com/v1
LLM_MODEL=gpt-4o-mini

Mock mode

When LLM_API_KEY is empty, LLMClient auto-instantiates _MockLLM: it does not call any external service, and assembles a reply by extracting content from the user message. In this mode, intent recognition uses keyword rules, answer generation uses retrieval result concatenation. The full chain runs but without real LLM capability.

:material-speed-box: Small Model Configuration¶

Used for routing simple tasks like intent recognition to lower first-token latency. ModelRouter routes by complexity score: simple queries go to the small model (~1s), complex queries go to the main LLM.

Variable	Default	Description	Impact
`SMALL_LLM_API_KEY`	`""` (empty)	Small model API Key; empty auto-degrades ModelRouter to the main LLM, no side effects	Intent recognition
`SMALL_LLM_BASE_URL`	`https://ark.cn-beijing.volces.com/api/v3`	Small model API base URL (default Doubao)	Intent recognition
`SMALL_LLM_MODEL`	`doubao-lite-4k`	Small model name	Intent recognition
`SMALL_MODEL_THRESHOLD`	`0.5`	Small model routing threshold: queries with complexity below this go to the small model (0-1)	Intent recognition

Doubao (default)Qwen qwen-turbo

SMALL_LLM_API_KEY=your-volc-key
SMALL_LLM_BASE_URL=https://ark.cn-beijing.volces.com/api/v3
SMALL_LLM_MODEL=doubao-lite-4k

SMALL_LLM_API_KEY=sk-your-qwen-key
SMALL_LLM_BASE_URL=https://dashscope.aliyuncs.com/compatible-mode/v1
SMALL_LLM_MODEL=qwen-turbo

Why a small model?

Intent recognition is the first step of every conversation and is latency-sensitive. The main LLM (e.g., DeepSeek-V3) has a first-token time of ~2.7s, while a small model (Doubao/Qwen-turbo) is ~1s. Routing simple intent recognition to the small model reduces overall avg response from 2.7s to 2.27s. With IntentCache hits, it can drop further to ~800ms.

Langfuse Configuration¶

LLM tracing and prompt version management. When not configured or LANGFUSE_ENABLED=False, all degrade to no-op, without affecting the main path.

Variable	Default	Description	Impact
`LANGFUSE_ENABLED`	`False`	Whether to enable Langfuse; `False` degrades all to no-op	LLM tracing
`LANGFUSE_PUBLIC_KEY`	`""` (empty)	Langfuse public key (from Project Settings → API Keys)	LLM tracing
`LANGFUSE_SECRET_KEY`	`""` (empty)	Langfuse secret key, keep confidential	LLM tracing
`LANGFUSE_HOST`	`https://cloud.langfuse.com`	Langfuse service address; for self-hosted, use the intranet address	LLM tracing

# Enable Langfuse cloud
LANGFUSE_ENABLED=True
LANGFUSE_PUBLIC_KEY=pk-lf-xxx
LANGFUSE_SECRET_KEY=sk-lf-xxx
LANGFUSE_HOST=https://cloud.langfuse.com

# Self-hosted deployment
LANGFUSE_HOST=http://your-langfuse-server:3000

Degradation mechanism

When LANGFUSE_ENABLED=False or Key is empty, LangfuseClient degrades to no-op: start_langfuse_trace returns None, finish_langfuse_trace(None, ...) silently skips. LLM calls fall back to the native OpenAI SDK, not wrapped by langfuse.openai, so the main path is completely unaffected.

:material-vector: Embedding and ChromaDB Configuration¶

Variable	Default	Description	Impact
`EMBEDDING_MODEL`	`BAAI/bge-large-zh-v1.5`	Embedding model name (1024 dimensions)	Vector retrieval
`EMBEDDING_LOCAL_CACHE_DIR`	`./models/bge-large-zh`	BGE local cache directory, loaded from here in offline environments	Vector retrieval
`HF_MIRROR_URL`	`https://hf-mirror.com`	HuggingFace mirror source, fallback for domestic networks	Model download
`EMBEDDING_LOAD_TIMEOUT`	`60`	Model load timeout in seconds	Model loading
`EMBEDDING_BATCH_SIZE`	`32`	Embedding batch size, reduce when memory is tight	Document ingestion
`CHROMA_PERSIST_DIR`	`./chroma_data`	ChromaDB persistence directory	Vector store
`CHROMA_COLLECTION_NAME`	`knowledge_base`	ChromaDB collection name, used to isolate different business knowledge bases	Vector store

EMBEDDING_MODEL=BAAI/bge-large-zh-v1.5
EMBEDDING_LOCAL_CACHE_DIR=./models/bge-large-zh
HF_MIRROR_URL=https://hf-mirror.com
EMBEDDING_BATCH_SIZE=32
CHROMA_PERSIST_DIR=./chroma_data
CHROMA_COLLECTION_NAME=knowledge_base

Dimension consistency

BGE outputs 1024 dimensions, and hash fallback also aligns to 1024 dimensions. All vectors in the same vector store must have consistent dimensions, otherwise retrieval reports a schema conflict. Switching embedding models requires clearing CHROMA_PERSIST_DIR and re-ingesting.

Text Splitting Configuration¶

Variable	Default	Description	Impact
`CHUNK_SIZE`	`512`	Text split chunk size (characters)	Document ingestion
`CHUNK_OVERLAP`	`128`	Split overlap (characters), ensures context continuity	Document ingestion

CHUNK_SIZE=512
CHUNK_OVERLAP=128

How to choose splitting parameters

CHUNK_SIZE too large: too much info per chunk, retrieval precision drops
CHUNK_SIZE too small: context breaks, LLM generation quality drops
CHUNK_OVERLAP is typically 20%-30% of CHUNK_SIZE
FAQ-type short docs can be reduced to 256/64; long docs can be increased to 1024/256

Retrieval Threshold Configuration¶

Variable	Default	Description	Impact
`SIMILARITY_THRESHOLD`	`0.6`	Retrieval similarity threshold; recall below this is treated as weakly relevant and filtered out; no forced answer below threshold	RAG retrieval
`DEDUP_THRESHOLD`	`0.95`	Ingestion dedup threshold; above this is treated as a duplicate document, skipped	Document ingestion

SIMILARITY_THRESHOLD=0.6
DEDUP_THRESHOLD=0.95

Threshold tuning tips

SIMILARITY_THRESHOLD too high (e.g., 0.8): recall drops, many valid results are filtered
SIMILARITY_THRESHOLD too low (e.g., 0.3): false recalls increase, LLM may fabricate based on weakly related fragments
Recommended range: 0.5 ~ 0.7; project tested Recall@5=1.0 at 0.6

Hybrid Retrieval Configuration¶

Hybrid retrieval = vector recall + BM25 recall + RRF fusion + Reranker reranking.

Variable	Default	Description	Impact
`VECTOR_TOP_K`	`25`	Vector recall count	RAG retrieval
`BM25_TOP_K`	`25`	BM25 keyword recall count	RAG retrieval
`RRF_K`	`60`	RRF fusion smoothing parameter, smooths rank differences to avoid top-1 dominance	RAG retrieval
`RRF_VECTOR_WEIGHT`	`0.6`	RRF vector path weight (60%)	RAG retrieval
`RRF_KEYWORD_WEIGHT`	`0.4`	RRF keyword path weight (40%)	RAG retrieval
`RERANK_TOP_K`	`5`	Top-K after Reranker reranking to enter the final answer	RAG retrieval
`RERANKER_MODEL`	`BAAI/bge-reranker-base`	CrossEncoder reranker model name	RAG retrieval

VECTOR_TOP_K=25
BM25_TOP_K=25
RRF_K=60
RRF_VECTOR_WEIGHT=0.6
RRF_KEYWORD_WEIGHT=0.4
RERANK_TOP_K=5
RERANKER_MODEL=BAAI/bge-reranker-base

RRF fusion formula

score = Σ weight_i × 1 / (k + rank_i)

rank_i: the rank of a chunk in the i-th recall path (starting from 1)
k=60: empirical smoothing value, prevents rank=1 results from dominating
Vector weight 0.6 + keyword weight 0.4 = 1.0

Why is the vector weight higher? Vector retrieval excels at semantic matching (synonyms, paraphrases). In customer service scenarios, users ask in diverse ways, so semantic matching is more important than keyword matching. But keyword matching captures proper nouns (product models, order numbers), so a 40% weight is retained.

Human Escalation Configuration¶

Variable	Default	Description	Impact
`WORKING_HOURS_START`	`9`	Human service start time (24-hour format, [START, END))	Escalation decision
`WORKING_HOURS_END`	`18`	Human service end time	Escalation decision
`TIMEZONE`	`Asia/Shanghai`	Timezone, used for working hours check	Escalation decision

WORKING_HOURS_START=9
WORKING_HOURS_END=18
TIMEZONE=Asia/Shanghai

Escalation rules

During working hours [9, 18): sentiment sensitive / consecutive failures / user actively requests → escalate to human
Outside working hours: no active escalation due to sentiment/failure (to avoid unanswered escalations)
When a user actively requests "escalate to human", the time window is ignored and always escalates, to avoid blocking explicit user requests

Business Adapter Configuration¶

Variable	Default	Description	Impact
`BUSINESS_ADAPTER_MODE`	`mock`	Adapter mode: `mock`=in-memory mock; `http`=real business system REST API	Business query
`BUSINESS_API_BASE_URL`	`""` (empty)	Real business system API base URL, required for `http` mode; empty auto-degrades to mock with a warning	Business query
`BUSINESS_API_KEY`	`""` (empty)	Real business system API Key, sent in `X-API-Key` header for auth	Business query
`BUSINESS_API_TIMEOUT`	`10`	HTTP call timeout in seconds, avoids long hangs on a single call	Business query

mock mode (default, out of the box)http mode (real business system)

BUSINESS_ADAPTER_MODE=mock
# The following are ignored in mock mode
BUSINESS_API_BASE_URL=
BUSINESS_API_KEY=
BUSINESS_API_TIMEOUT=10

BUSINESS_ADAPTER_MODE=http
BUSINESS_API_BASE_URL=https://your-business-api.com
BUSINESS_API_KEY=your-business-api-key
BUSINESS_API_TIMEOUT=10

http mode degradation

When BUSINESS_ADAPTER_MODE=http but BUSINESS_API_BASE_URL is empty, the system auto-degrades to mock mode and logs a warning. On call failure, BusinessAgent also degrades to the mock business system.

Cache and External Service Configuration¶

Variable	Default	Description	Impact
`REDIS_URL`	`redis://localhost:6379/0`	Redis connection address (session storage, cache); degrades to in-memory queue when unreachable	Session/cache
`ELASTICSEARCH_URL`	`http://localhost:9200`	Elasticsearch address (full-text retrieval enhancement, optional)	Full-text retrieval

REDIS_URL=redis://localhost:6379/0
ELASTICSEARCH_URL=http://localhost:9200

Recommended Configuration¶

Development Environment¶

# .env (development)
APP_NAME=Intelligent Customer Service System
APP_HOST=0.0.0.0
APP_PORT=8000
DEBUG=True

# No auth, for easy debugging
API_KEY=

# Mock mode, no LLM Key needed
LLM_API_KEY=
LLM_BASE_URL=https://api.deepseek.com/v1
LLM_MODEL=deepseek-chat

# Mock business system
BUSINESS_ADAPTER_MODE=mock

# Langfuse disabled
LANGFUSE_ENABLED=False

# Default retrieval parameters
CHUNK_SIZE=512
CHUNK_OVERLAP=128
SIMILARITY_THRESHOLD=0.6
VECTOR_TOP_K=25
BM25_TOP_K=25
RERANK_TOP_K=5

Production Environment¶

# .env (production)
APP_NAME=Customer Service Prod
APP_HOST=0.0.0.0
APP_PORT=8000
DEBUG=False

# Enforce auth
API_KEY=<strong-random-key>

# Real LLM
LLM_API_KEY=sk-your-deepseek-key
LLM_BASE_URL=https://api.deepseek.com/v1
LLM_MODEL=deepseek-chat

# Small model routing (lower latency)
SMALL_LLM_API_KEY=sk-your-qwen-key
SMALL_LLM_BASE_URL=https://dashscope.aliyuncs.com/compatible-mode/v1
SMALL_LLM_MODEL=qwen-turbo
SMALL_MODEL_THRESHOLD=0.5

# Real business system
BUSINESS_ADAPTER_MODE=http
BUSINESS_API_BASE_URL=https://your-business-api.com
BUSINESS_API_KEY=<business-api-key>
BUSINESS_API_TIMEOUT=10

# Enable Langfuse tracing
LANGFUSE_ENABLED=True
LANGFUSE_PUBLIC_KEY=pk-lf-xxx
LANGFUSE_SECRET_KEY=sk-lf-xxx
LANGFUSE_HOST=https://cloud.langfuse.com

# Redis persistence
REDIS_URL=redis://redis:6379/0

# Retrieval parameters (production can increase recall appropriately)
VECTOR_TOP_K=30
BM25_TOP_K=30
RERANK_TOP_K=5

Runtime Effect Notes¶

The following options require a service restart after modification

Configuration is loaded via the @lru_cache global singleton and is not re-read after process startup. The following changes require a service restart to take effect:

Option	Reason
`LLM_API_KEY` / `LLM_BASE_URL` / `LLM_MODEL`	LLMClient singleton initializes on first call
`SMALL_LLM_*`	Small model client singleton initialization
`EMBEDDING_MODEL` / `EMBEDDING_LOCAL_CACHE_DIR`	EmbeddingService singleton caches after model load
`CHROMA_PERSIST_DIR` / `CHROMA_COLLECTION_NAME`	VectorStore singleton connects to ChromaDB
`LANGFUSE_*`	LangfuseClient singleton initialization
`BUSINESS_ADAPTER_MODE` / `BUSINESS_API_*`	Business adapter singleton initialization
`APP_HOST` / `APP_PORT`	Uvicorn startup parameters

Hot-updatable options (take effect indirectly via POST /api/v1/performance/cache/invalidate to clear the cache):

SIMILARITY_THRESHOLD: after clearing HotQueryCache, the next retrieval uses the new threshold
VECTOR_TOP_K / BM25_TOP_K / RERANK_TOP_K: after clearing the cache, the next retrieval uses the new params

Note: Hot updates only affect new requests after cache invalidation; cached results are not auto-updated. It is recommended to call the cache cleanup endpoint after modifying retrieval parameters.