Observability Tutorial¶

The system provides three layers of observability: Langfuse LLM tracing (full-chain visualization), monitoring endpoints (runtime state queries), and circuit breakers and alerts (self-healing and notifications). This tutorial covers configuration and typical troubleshooting scenarios.

Prerequisites

Observability endpoints use the prefixes /api/v1/observability and /api/v1/monitor, and are not authenticated, so ops dashboards can access them without credentials
When Langfuse is not configured, it automatically degrades to a no-op without affecting the main chain

Observability Overview¶

flowchart TB
    A[Chat request] --> B[Monitor trace]
    A --> C[Langfuse trace]
    B --> D[/api/v1/monitor/*<br/>trace list/details/Agent stats]
    C --> E[Langfuse cloud dashboard<br/>full-chain visualization]
    F[Circuit breaker] --> G[/api/v1/observability/circuit-breakers]
    H[Alert engine] --> I[/api/v1/observability/alerts]
    J[Token tracking] --> K[/api/v1/observability/token-usage]
    L[Health check] --> M[/api/v1/observability/health]

Langfuse LLM Tracing¶

Langfuse provides full LLM call-chain visualization, covering intent recognition → query rewrite → RAG → summary → polishing. It automatically reports tokens/cost/latency.

Configuration¶

Enable Langfuse in .env by configuring 4 settings:

# Enable Langfuse (default False, degrades to no-op)
LANGFUSE_ENABLED=True
LANGFUSE_PUBLIC_KEY=pk-lf-xxx
LANGFUSE_SECRET_KEY=sk-lf-xxx
LANGFUSE_HOST=https://cloud.langfuse.com

Fallback mechanism

When any of the following is not met, Langfuse automatically degrades to a no-op (returning None) without affecting the main chain: - LANGFUSE_ENABLED=False - LANGFUSE_PUBLIC_KEY is empty - LANGFUSE_SECRET_KEY is empty - The langfuse library is not installed or fails to initialize

Callers take the no-op branch accordingly, with zero impact on business logic.

11 Prompt Markers¶

The system tags 11 key prompts with name and version, making it easy to filter and compare by prompt type in the Langfuse dashboard:

prompt name	Purpose	Trigger Scenario
`intent_recognition`	Intent recognition	First step of each turn
`query_rewrite`	Query rewrite	Multi-turn coreference resolution
`rag_qa`	RAG Q&A	Knowledge Q&A intent
`knowledge_summary`	Knowledge summary	Summary after RAG hit
`dialog_polish`	Dialog polishing	Final DialogAgent polishing
`business_extract`	Business parameter extraction	Business query intent
`business_format`	Business result formatting	Business query result
`emotion_analyze`	Emotion analysis	Emotion-sensitive intent
`ticket_extract`	Ticket info extraction	Ticket intent
`turn_summary`	Single-turn summary	End of each turn
`session_summary`	Session summary	Session end / escalation

Purpose of the version field

Each prompt is tagged with a version, making A/B testing of different prompt versions easy. After modifying a prompt, increment the version; the Langfuse dashboard can group stats by version for token/cost/latency.

Trace Visualization¶

The full chain is visible in the Langfuse cloud dashboard:

flowchart LR
    A[trace: stream_chat] --> B[generation: intent_recognition]
    B --> C[generation: query_rewrite]
    C --> D[generation: rag_qa]
    D --> E[generation: knowledge_summary]
    E --> F[generation: dialog_polish]
    F --> G[trace complete]

Each generation automatically reports:

tokens: prompt_tokens + completion_tokens
cost: auto-converted by model pricing
latency: single LLM call duration
metadata: session_id / monitor_trace_id for correlation

No Trace Created on Cache Hit¶

Avoid empty traces

On a HotQueryCache hit there is no LLM call, and the system does not create a Langfuse trace, avoiding meaningless empty traces polluting the dashboard. A trace is only created when a cache miss goes through LLM orchestration, and is written to the holder for the outer layer to finish/mark state.

Token/cost/latency Auto-reporting¶

Through the langfuse.openai wrapper, all LLM calls are automatically attached to the current trace, with no manual instrumentation needed:

# The system internally wraps LLMClient with langfuse.openai
# LLM calls automatically report token/cost/latency to the current trace
# The business side does not need to care; trace_id is correlated with the monitor trace via metadata

Monitoring Endpoints¶

GET /api/v1/observability/health — Health Check¶

Runs all health checks and returns an aggregate report. Each check runs independently; a single failure does not affect the others:

curl http://localhost:8000/api/v1/observability/health

{
  "status": "degraded",
  "checks": {
    "llm": {"status": "up", "latency_ms": 320},
    "vector_store": {"status": "up", "size": 342},
    "redis": {"status": "down", "error": "connection refused"},
    "disk": {"status": "up", "free_gb": 12.5}
  }
}

Check descriptions

llm: LLM service connectivity, including latency
vector_store: ChromaDB availability and data size
redis: Redis connectivity (session storage)
disk: free disk space
status: up (all green) / degraded (partial degradation) / down (critical failure)

GET /api/v1/monitor/traces — Trace List¶

Returns the recent trace list (summary, without step details), sorted by time descending:

curl "http://localhost:8000/api/v1/monitor/traces?limit=50"

[
  {
    "trace_id": "trace-xxx",
    "session_id": "sess-9f3c2a1b",
    "status": "success",
    "duration_ms": 1850,
    "started_at": "2026-07-03T10:00:00Z",
    "steps_count": 5
  }
]

GET /api/v1/monitor/traces/{trace_id} — Trace Details¶

Returns the details of a single trace, including each step and sub-tasks:

curl http://localhost:8000/api/v1/monitor/traces/trace-xxx

{
  "trace_id": "trace-xxx",
  "session_id": "sess-9f3c2a1b",
  "status": "success",
  "duration_ms": 1850,
  "steps": [
    {"name": "intent_recognition", "duration_ms": 320, "status": "success"},
    {"name": "retrieval", "duration_ms": 180, "status": "success"},
    {"name": "generation", "duration_ms": 1200, "status": "success"}
  ],
  "sub_tasks": [...]
}

Troubleshooting slow queries

Use trace details to locate the slowest step. generation is usually the large-model call; if it is too slow, consider routing to the small model (see the Performance Optimization Tutorial).

GET /api/v1/monitor/agents — Agent Call Statistics¶

Returns each Agent's current state (call count, average duration, success rate):

curl http://localhost:8000/api/v1/monitor/agents

[
  {
    "agent": "KnowledgeAgent",
    "calls": 156,
    "avg_duration_ms": 850,
    "success_rate": 0.95
  },
  {
    "agent": "BusinessAgent",
    "calls": 42,
    "avg_duration_ms": 1100,
    "success_rate": 0.88
  }
]

Includes uncalled Agents

The result includes registered but uncalled Agents (calls=0), so the dashboard can display the full Agent list.

Other Monitoring Endpoints¶

Endpoint	Description
`GET /api/v1/monitor/overview`	System overview: total traces, success rate, average duration, active sessions
`GET /api/v1/monitor/sessions`	Active session list, sorted by last-active time descending

Circuit Breakers and Fallback¶

The system configures circuit breakers for external dependencies such as the LLM, vector store, and business systems. After consecutive failures reach the threshold, the breaker opens automatically, degrading to a fallback result to avoid cascading failures.

View Circuit Breaker State: GET /api/v1/observability/circuit-breakers¶

curl http://localhost:8000/api/v1/observability/circuit-breakers

{
  "llm": {
    "state": "closed",
    "failure_count": 0,
    "failure_threshold": 5,
    "recovery_timeout": 60,
    "last_failure": null
  },
  "vector_store": {
    "state": "open",
    "failure_count": 7,
    "failure_threshold": 5,
    "recovery_timeout": 60,
    "last_failure": "2026-07-03T10:00:00Z"
  }
}

State	Meaning
`closed`	Closed (normal calls)
`open`	Open (tripping; degrade to fallback)
`half_open`	Half-open (probing pass-through)

Manually Reset a Circuit Breaker: POST /api/v1/observability/circuit-breakers/{name}/reset¶

Forced recovery after ops intervention, independent of recovery timeout:

curl -X POST http://localhost:8000/api/v1/observability/circuit-breakers/llm/reset

{"name": "llm", "reset": true}

Manual reset risk

A manual reset immediately sets the breaker to closed, but if the underlying service is not recovered it will trip again. We recommend confirming the dependency service is healthy before resetting.

Monitoring Alerts¶

The alert engine detects anomalies by rules and generates alerts. It supports filtering by level and source.

Query Alerts: GET /api/v1/observability/alerts¶

# Filter by level
curl "http://localhost:8000/api/v1/observability/alerts?level=error"

# Filter by source
curl "http://localhost:8000/api/v1/observability/alerts?source=token_usage"

# Filter by time
curl "http://localhost:8000/api/v1/observability/alerts?since=2026-07-03T00:00:00Z"

[
  {
    "level": "error",
    "source": "token_usage",
    "message": "Token usage exceeded 80% of the daily budget",
    "timestamp": "2026-07-03T14:00:00Z"
  }
]

Alert levels

info: informational alert (e.g., cache hit rate drop)
warn: warning (e.g., response time increase)
error: error (e.g., circuit breaker open)
critical: critical (e.g., key service unavailable)

Invalid level values return 400.

Token Usage Tracking¶

GET /api/v1/observability/token-usage¶

Returns Token usage statistics for the specified window:

# Per-minute stats
curl "http://localhost:8000/api/v1/observability/token-usage?window=minute"

# Per-hour stats (default)
curl http://localhost:8000/api/v1/observability/token-usage

# Per-day stats
curl "http://localhost:8000/api/v1/observability/token-usage?window=day"

{
  "window": "hour",
  "total_tokens": 125000,
  "prompt_tokens": 98000,
  "completion_tokens": 27000,
  "estimated_cost_usd": 0.42,
  "by_model": {
    "gpt-4o-mini": {"tokens": 95000, "cost": 0.14},
    "gpt-4o": {"tokens": 30000, "cost": 0.28}
  }
}

window values

window supports minute / hour / day; other values return 400. We recommend monitoring the daily budget by hour and investigating bursts by minute.

Token Budget Alerts¶

Combined with the alert engine, an error-level alert is generated automatically when token usage exceeds the budget threshold. It can be queried via /api/v1/observability/alerts?source=token_usage.

Typical Troubleshooting Scenarios¶

Scenario 1: Response Time Spike¶

# 1. View the system overview to confirm average duration and success rate
curl http://localhost:8000/api/v1/monitor/overview

# 2. View recent traces to locate slow queries
curl "http://localhost:8000/api/v1/monitor/traces?limit=10"

# 3. View a specific trace's details to find the slowest step
curl http://localhost:8000/api/v1/monitor/traces/{trace_id}

# 4. Check whether a circuit breaker has opened
curl http://localhost:8000/api/v1/observability/circuit-breakers

Scenario 2: LLM Service Unavailable¶

# 1. Run a health check to confirm the LLM status
curl http://localhost:8000/api/v1/observability/health

# 2. Check whether the breaker has opened automatically
curl http://localhost:8000/api/v1/observability/circuit-breakers | jq .llm

# 3. Manually reset after the service recovers (optional)
curl -X POST http://localhost:8000/api/v1/observability/circuit-breakers/llm/reset

Scenario 3: Token Cost Anomaly¶

# 1. View daily token usage
curl "http://localhost:8000/api/v1/observability/token-usage?window=day"

# 2. View token-related alerts
curl "http://localhost:8000/api/v1/observability/alerts?source=token_usage"

# 3. In the Langfuse dashboard, investigate high-cost prompts by prompt name

Next Steps¶

Performance Optimization Tutorial: performance metrics and cache tuning
Operations Management Tutorial: operations dashboards and go-live checks
Chat Endpoint Tutorial: when traces are generated in a conversation