The service_chat provides an AI-powered assistant for patient healthcare questions using RAG (Retrieval Augmented Generation).
Base URL: http://localhost:8002 (local development)
GET /health- Health checkPOST /triage- AI-powered patient assistance
Basic health check to verify the service is running.
Request:
curl http://localhost:8002/healthResponse:
{
"status": "healthy",
"service": "service_chat"
}The main AI assistant endpoint. Accepts a patient MRN and a question, retrieves relevant patient data from the DB API, and generates an AI-powered response.
Request Body:
| Field | Type | Required | Description |
|---|---|---|---|
patient_mrn |
string | Yes | Patient's medical record number |
query |
string | Yes | The user's question about their health |
Request:
curl -X POST http://localhost:8002/triage \
-H "Content-Type: application/json" \
-d '{
"patient_mrn": "P000123",
"query": "Why did my doctor change my diabetes medication?"
}'Response (Mock LLM Mode):
{
"response": "This is a mock response from the AI assistant. In production, this would be replaced with a real LLM response.",
"trace_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
"patient_mrn": "P000123",
"llm_mode": "mock",
"conversation_id": "CONV-2024-01-20-P000123-a1b2c3d4"
}Response (Real LLM Mode - Qwen3-4B-Thinking-2507):
{
"response": "Based on your medical records, I can see that your doctor changed your diabetes medication during your visit on January 15, 2024. Your recent HbA1c test showed a level of 7.2%, which indicates your diabetes is reasonably well controlled. However, the medication change from Metformin 500mg to Metformin 850mg was likely made to achieve even better blood sugar control and bring your HbA1c closer to the target range of below 7%. This adjustment is a common practice when patients are tolerating their current medication well but could benefit from slightly more aggressive management. If you have concerns about this change or experience any side effects, I recommend discussing them with Dr. Sarah Chen at your next appointment.",
"trace_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
"patient_mrn": "P000123",
"llm_mode": "Qwen3-4B-Thinking-2507",
"conversation_id": "CONV-2024-01-20-P000123-b2c3d4e5"
}A convenient Makefile target is provided for testing the triage endpoint:
# Test with default query ("What are my current medications?")
make test-triage
# Test with a custom query using the 'm' parameter
make test-triage m='Why did my doctor change my diabetes medication?'
# More examples
make test-triage m='What is my diagnosis?'
make test-triage m='When is my next appointment?'import httpx
response = httpx.post(
"http://localhost:8002/triage",
json={
"patient_mrn": "P000123",
"query": "What are my current medications?"
}
)
data = response.json()
print(f"Response: {data['response']}")
print(f"Trace ID: {data['trace_id']}")| Field | Type | Description |
|---|---|---|
response |
string | The AI-generated response to the user's query |
trace_id |
string | Unique identifier for request tracing and debugging |
patient_mrn |
string | Echo of the patient MRN from the request |
llm_mode |
string | The LLM mode used (mock or Qwen3-4B-Thinking-2507) |
conversation_id |
string | ID of the stored chat log (can be used to retrieve the interaction via GET /chat-logs/{conversation_id}) |
When the specified patient MRN doesn't exist in the database.
Request:
curl -X POST http://localhost:8002/triage \
-H "Content-Type: application/json" \
-d '{
"patient_mrn": "INVALID_MRN",
"query": "What is my diagnosis?"
}'Response:
{
"detail": "Patient with MRN INVALID_MRN not found"
}When the request body is missing required fields.
Request:
curl -X POST http://localhost:8002/triage \
-H "Content-Type: application/json" \
-d '{
"patient_mrn": "P000123"
}'Response:
{
"detail": [
{
"loc": ["body", "query"],
"msg": "field required",
"type": "value_error.missing"
}
]
}When the chat service cannot reach the database API.
Response:
{
"detail": "Database API service unavailable"
}- Request Received: The
/triageendpoint receives a patient MRN and query - Trace Started: A unique trace ID is generated for request tracking
- Patient Data Fetched: The service calls
service_db_apito get the patient summary - Prompt Built: Patient data is formatted into a prompt context for the LLM
- LLM Response Generated:
- Mock mode: Returns a placeholder response (fast, for testing)
- Qwen mode: Generates a real response using the Qwen3-4B model (slower, ~5-15s on CPU)
- Chat Log Stored: The interaction (query + response + retrieval events) is stored to MongoDB
- Response Returned: The AI response is returned with trace ID and conversation ID
After a triage request, you can retrieve the stored chat log using the conversation_id:
# Get the specific chat log
curl http://localhost:8001/chat-logs/CONV-2024-01-20-P000123-a1b2c3d4
# List all chat logs for a patient
curl "http://localhost:8001/chat-logs?patient_mrn=P000123"The chat log includes:
- The user query and AI response
- Retrieval events (what data was fetched to answer the query)
- Timestamps and latency metrics
- Trace ID for debugging
The chat service behavior is controlled by environment variables:
| Variable | Default | Description |
|---|---|---|
LLM_MODE |
mock |
LLM mode (see options below) |
DB_API_BASE_URL |
http://localhost:8001 |
URL of the database API service |
MODEL_CACHE_DIR |
./models |
Directory for downloaded models |
VECTOR_MODE |
mock |
Vector DB mode (future use) |
LOG_LEVEL |
INFO |
Logging verbosity |
| Mode | Backend | Description | Typical Latency |
|---|---|---|---|
mock |
- | Returns static test response | <100ms |
gguf |
llama-cpp-python | Fast quantized inference (recommended) | ~2-3 min on CPU |
qwen |
transformers | HuggingFace transformers | ~6-7 min on CPU |
Qwen3-4B-Thinking-2507 |
transformers | Full model name (same as qwen) |
~6-7 min on CPU |
| Variable | Default | Description |
|---|---|---|
GGUF_MODEL_REPO |
Qwen/Qwen2.5-1.5B-Instruct-GGUF |
HuggingFace repo for GGUF model |
GGUF_MODEL_FILE |
qwen2.5-1.5b-instruct-q4_k_m.gguf |
Specific GGUF file to download |
GGUF_N_CTX |
4096 |
Context window size |
GGUF_N_THREADS |
4 |
Number of CPU threads for inference |
GGUF_MAX_TOKENS |
256 |
Maximum tokens to generate |
Fast local development (recommended):
LLM_MODE=gguf make run-chatTesting with mock responses:
LLM_MODE=mock make run-chatUsing transformers backend:
LLM_MODE=qwen make run-chat
# or
LLM_MODE=Qwen3-4B-Thinking-2507 make run-chatCustom GGUF model:
GGUF_MODEL_REPO=Qwen/Qwen2.5-3B-Instruct-GGUF \
GGUF_MODEL_FILE=qwen2.5-3b-instruct-q4_k_m.gguf \
LLM_MODE=gguf make run-chatEvery request generates a unique trace_id that can be used for:
- Debugging issues in logs
- Correlating requests across services
- Performance monitoring
Search logs by trace ID:
grep "a1b2c3d4-e5f6-7890-abcd-ef1234567890" logs/*.log| LLM Mode | Model | Typical Latency | Notes |
|---|---|---|---|
mock |
- | <100ms | For testing and development |
gguf |
Qwen2.5-1.5B-Instruct | ~2-3 min | Recommended for CPU inference |
qwen |
Qwen3-4B-Thinking | ~6-7 min | Slower, higher quality reasoning |
Benchmark results (MacBook Pro, CPU):
- GGUF (1.5B params, Q4_K_M): 156s for 256 tokens
- Transformers (4B params, FP16): ~6-7 min for 128 tokens
For production deployments:
- Recommended: Use
ggufmode with smaller quantized models - Model is cached after first load
- GGUF uses Metal GPU on Mac for acceleration
- Consider GPU instances for faster inference in cloud
- Database API Documentation - Data endpoints used by this service
- Model Management - How to configure and deploy LLM models
- AI Service Upgrade Guide - Deploying with real LLM