Intent recognition for voice AI — 50× faster than LLM routing
<100 ms latency · 50+ intents · 1 GPU · 3 commands to run
Quick Start · API Docs · Realtime (main project) · Report Bug
curl -X POST http://localhost:8000/intent \
-H "Content-Type: application/json" \
-d '{"conversation": [{"role": "user", "content": "Will it rain in Beijing tomorrow?"}]}'
# → {"intent": "agent.information.weather"} # 48 msTip
RealtimeIntent is the intent router for SquadyAI Realtime — the open-source voice AI engine (Rust, <450 ms end-to-end, 100+ concurrent sessions). Use them together for a complete voice assistant, or use RealtimeIntent standalone as a drop-in intent API for any chatbot.
| LLM Function Calling | RealtimeIntent | |
|---|---|---|
| Latency | 500–2 000 ms (full LLM inference) | < 100 ms (embed → vector search → rerank) |
| Cost | Token cost per request | Zero API cost — runs on local 0.6B–4B models |
| Determinism | Prompt-sensitive, may drift | Same input → same output, always |
| Customization | Edit prompts and hope | Add examples via API, instant effect |
| Multi-turn | Context window fills up | Async LLM summary keeps context compact |
RealtimeIntent handles the "what does the user want?" question so your LLM can focus on "how to respond" — faster, cheaper, and more reliably.
"Will it rain in Beijing tomorrow?"
│
▼
┌─────────────┐ ┌────────────────┐ ┌──────────────┐ ┌──────────────┐
│ Embedding │────▶│ Qdrant Vector │────▶│ Reranker │────▶│ Intent Label │
│ (4B model) │ │ Search top-K │ │ (0.6B model) │ │ + Score │
└─────────────┘ └────────────────┘ └──────────────┘ └──────────────┘
│ async
▼
┌──────────────┐
│ LLM Summary │ (optional)
│ for context │ multi-turn
└──────────────┘
No training required. Add intent examples via API or seed script — the system learns from examples, not fine-tuning.
┌──────────┐ ┌──────────┐ ┌───────────────────┐ ┌─────────┐ ┌──────────┐
│ User │────▶│ VAD │────▶│ ASR │────▶│ LLM │────▶│ TTS │
│ Audio │ │ (Silero) │ │ (WhisperLive) │ │ (Qwen) │ │(MiniMax) │
└──────────┘ └──────────┘ └──────────┬────────┘ └────▲────┘ └──────────┘
│ │
▼ │
┌─────────────────┐ │
│ RealtimeIntent │──────────┘
│ < 100ms route │ agent.weather → call weather tool
└─────────────────┘ agent.music → call music tool
__no_intent__ → just chat
See SquadyAI Realtime for the full voice conversation engine that orchestrates this pipeline.
50+ built-in intent categories (click to expand)
| Category | Intents |
|---|---|
| Weather & News | weather, news, date, time, currency, event, movie |
| Q&A | general, domain (wiki), daily, visual (camera) |
| Music & Media | play, query, control, setting, audiobook, podcast, radio |
| Calendar | query, set, remove |
| Reminders | query, set, remove |
| Volume | up, down, mute |
| Lists | query, set, remove |
| Navigation | direction, traffic, taxi, transit tickets |
| Smart Home | lights (on/off/dim/color), plugs, coffee, cleaning |
| Device | battery, camera (photo/video), recorder |
| Language | translate |
| Search | web search, stock, cooking/recipe |
| Social | post, query, email, contacts |
| General | greet, joke, goodbye, creative content, math |
| Special | __no_intent__ (noise / ASR false trigger) |
Custom intents can be added at runtime via the API — no retraining needed.
git clone https://github.com/SquadyAI/RealtimeIntent.git && cd RealtimeIntent
cp .env.example .env
docker compose -f docker-compose.gpu.yml up -dWait for models to load (~2 min), then seed the database:
pip install datasets
python scripts/seed/index_massive_to_qdrant.py # loads ~4,600 intent examples from HuggingFaceTest:
curl -s -X POST http://localhost:8000/intent \
-H "Content-Type: application/json" \
-d '{"conversation": [{"role": "user", "content": "Play some jazz"}]}' | python -m json.toolGPU config: Both models default to GPU 0. Set
EMBEDDING_GPU=0andRERANKER_GPU=1in.envfor separate devices.
Already running embedding / reranker services? Just point to them:
git clone https://github.com/SquadyAI/RealtimeIntent.git && cd RealtimeIntent
cp .env.example .env
# Edit .env: set EMBEDDING_API_URL and RERANK_API_URL
docker compose up -d # starts intent-service + Qdrant onlypip install -r requirements.txt
export EMBEDDING_API_URL=http://localhost:30003/v1/embeddings
export RERANK_API_URL=http://localhost:30000/score
uvicorn app.main:app --host 0.0.0.0 --port 8000 --loop uvloopNo retraining needed. Add examples at runtime:
# Single
curl -X POST http://localhost:8000/insert_entry \
-H "Content-Type: application/json" \
-d '{"label": "agent.smart_home.light", "text": "Turn off the living room lights"}'
# Batch
curl -X POST http://localhost:8000/batch_insert \
-H "Content-Type: application/json" \
-d '{
"items": [
{"label": "agent.smart_home.light", "text": "把卧室灯调暗一点"},
{"label": "agent.smart_home.light", "text": "Dim the bedroom lights"},
{"label": "agent.smart_home.ac", "text": "Set AC to 24 degrees"}
]
}'Or bulk-load from HuggingFace MASSIVE (supports zh-CN, en-US, ja-JP, ko-KR, and more):
python scripts/seed/index_massive_to_qdrant.py --dataset-name SetFit/amazon_massive_intent_en-USInteractive docs at http://localhost:8000/docs (Swagger UI).
| Parameter | Type | Required | Description |
|---|---|---|---|
conversation |
List[Dict] |
Yes | Conversation history in [{"role": "user", "content": "..."}] format |
timeout |
float |
No | Request timeout in seconds |
debug |
bool |
No | Return match details (score, source) |
curl -X POST http://localhost:8000/intent \
-H "Content-Type: application/json" \
-d '{
"conversation": [
{"role": "user", "content": "What is the weather like today?"},
{"role": "assistant", "content": "Sunny, 25°C."},
{"role": "user", "content": "What about tomorrow?"}
],
"debug": true
}'{
"intent": "agent.information.weather",
"payload": {
"payload": { "label": "agent.information.weather", "text": "明天天气怎么样", "source_file": "SetFit/amazon_massive_intent_zh-CN" },
"score": 0.766
}
}Multi-turn works out of the box — the optional LLM summary keeps context across turns ("What about tomorrow?" → still weather).
GET /labels — List all intents
curl http://localhost:8000/labels
# → {"success": true, "count": 4, "labels": ["agent.information.weather", ...]}POST /insert_entry — Add one example
curl -X POST http://localhost:8000/insert_entry \
-H "Content-Type: application/json" \
-d '{"label": "agent.information.weather", "text": "明天会下雨吗"}'
# → {"success": true, "point_id": "..."}POST /batch_insert — Add examples in bulk
curl -X POST http://localhost:8000/batch_insert \
-H "Content-Type: application/json" \
-d '{"items": [
{"label": "agent.information.weather", "text": "今天天气怎么样"},
{"label": "agent.conversation.end", "text": "再见,下次聊"}
]}'
# → {"success": true, "count": 2, "point_ids": ["...", "..."]}POST /get_label_content — Browse examples by label
| Parameter | Type | Required | Description |
|---|---|---|---|
label |
string |
Yes | Intent label to query |
limit |
int |
No | Page size (1–100) |
offset |
string |
No | Pagination cursor from previous response |
curl -X POST http://localhost:8000/get_label_content \
-H "Content-Type: application/json" \
-d '{"label": "agent.calendar.set", "limit": 10}'
# → {"success": true, "count": 10, "has_next": true, "next_page_offset": "...", "points": [...]}DELETE /delete_point/{point_id} — Remove one example
curl -X DELETE http://localhost:8000/delete_point/123e4567-e89b-12d3-a456-426614174000POST /batch_delete_points — Batch remove
curl -X POST http://localhost:8000/batch_delete_points \
-H "Content-Type: application/json" \
-d '{"point_ids": ["id1", "id2"]}'POST /delete_by_label — Remove all examples for a label
Warning: This deletes all data under the label and cannot be undone.
curl -X POST http://localhost:8000/delete_by_label \
-H "Content-Type: application/json" \
-d '{"label": "agent.calendar.set", "confirm": true}'Monitoring
| Method | Path | Description |
|---|---|---|
GET |
/health |
Service health check |
GET |
/metrics |
Latency percentiles, cache hit rates |
GET |
/stats |
Detailed service statistics |
| Status | Description |
|---|---|
200 |
Success |
400 |
Bad request — invalid params or missing required fields |
404 |
Resource not found |
408 |
Timeout — processing exceeded the specified timeout |
500 |
Internal error — dependency or database failure |
Error response format: {"detail": "error message"}
Python
import requests
def detect_intent(conversation, timeout=5.0):
resp = requests.post("http://localhost:8000/intent", json={
"conversation": conversation,
"timeout": timeout,
})
resp.raise_for_status()
return resp.json()["intent"]
intent = detect_intent([
{"role": "user", "content": "今天天气怎么样?"},
{"role": "assistant", "content": "今天天气晴朗。"},
{"role": "user", "content": "明天呢?"},
])
print(intent) # agent.information.weatherJavaScript
async function detectIntent(conversation, timeout = 5.0) {
const res = await fetch("http://localhost:8000/intent", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ conversation, timeout }),
});
if (!res.ok) throw new Error(`${res.status} ${await res.text()}`);
return (await res.json()).intent;
}
const intent = await detectIntent([
{ role: "user", content: "今天天气怎么样?" },
{ role: "assistant", content: "今天天气晴朗。" },
{ role: "user", content: "明天呢?" },
]);
console.log(intent); // agent.information.weather- Conversation context: Provide 2–3 turns for best accuracy on follow-up queries
- Batch operations: Use
/batch_insertfor bulk data loading instead of calling/insert_entryin a loop - Pagination: Use
/get_label_contentwithlimit+offsetfor large datasets
All via environment variables (.env.example has the full list):
| Variable | Default | What it does |
|---|---|---|
EMBEDDING_API_URL |
http://localhost:30003/v1/embeddings |
OpenAI-compatible embedding endpoint |
EMBEDDING_MODEL |
Qwen/Qwen3-Embedding-4B |
Model name sent to embedding API |
RERANK_API_URL |
http://localhost:30000/score |
Cross-encoder reranker endpoint |
RERANK_MODEL |
Qwen3-Reranker-4B |
Model name sent to reranker API |
QDRANT_HOST |
localhost |
Qdrant server |
QDRANT_COLLECTION |
massive_intents |
Collection name |
TOP_K |
6 |
Retrieval candidates before reranking |
NO_INTENT_RERANK_THRESHOLD |
0.55 |
Below this score → __no_intent__ |
INSTRUCT_API_URL |
http://localhost:8001/v1/chat/completions |
(optional) LLM for multi-turn summary |
LOG_LEVEL |
INFO |
Logging verbosity |
docker-compose.gpu.yml handles this automatically. For manual deployment:
| Service | Model | Framework | Command |
|---|---|---|---|
| Embedding | Qwen3-Embedding-4B | SGLang | python -m sglang.launch_server --model-path Qwen/Qwen3-Embedding-4B --port 30003 --is-embedding |
| Reranker | Qwen3-Reranker-0.6B | vLLM | vllm serve Qwen/Qwen3-Reranker-0.6B --task score --port 30000 |
| LLM (optional) | Qwen3-4B-AWQ | Any OpenAI-compatible | — |
Any OpenAI-compatible embedding API and cross-encoder reranker work as drop-in replacements.
RealtimeIntent is one piece of a larger open-source voice AI platform:
| Project | What it does | Link |
|---|---|---|
| RealtimeAPI | Core voice AI engine — ASR→LLM→TTS pipeline orchestration in Rust, <450 ms E2E | SquadyAI/RealtimeAPI |
| RealtimeIntent | Intent classification — vector search + neural reranking, <100 ms | you are here |
| RealtimeSearch | Multi-engine search gateway with automatic failover | SquadyAI/RealtimeSearch |
┌─────────────────────────────────── RealtimeAPI ───────────────────────────────────┐
│ │
│ Audio ──▶ VAD ──▶ ASR ──▶ ┌──────────────────┐ ──▶ LLM ──▶ TTS ──▶ Audio │
│ │ ★ RealtimeIntent │ │ │
│ │ < 100ms intent │ │ │
│ └──────────────────┘ │ │
│ ┌────▼───────────────┐ │
│ │ RealtimeSearch │ │
│ │ web search tool │ │
│ └────────────────────┘ │
│ │
└───────────────────────────────────────────────────────────────────────────────────┘
Important
If you find RealtimeIntent useful, check out RealtimeAPI — the full voice conversation engine that powers real-time voice assistants with <450 ms latency and 100+ concurrent sessions.
Built by SquadyAI contributors · Part of the RealtimeAPI voice AI stack