RealtimeIntent

Intent recognition for voice AI — 50× faster than LLM routing

<100 ms latency · 50+ intents · 1 GPU · 3 commands to run

Quick Start · API Docs · Realtime (main project) · Report Bug

curl -X POST http://localhost:8000/intent \
  -H "Content-Type: application/json" \
  -d '{"conversation": [{"role": "user", "content": "Will it rain in Beijing tomorrow?"}]}'

# → {"intent": "agent.information.weather"}    # 48 ms

Tip

RealtimeIntent is the intent router for SquadyAI Realtime — the open-source voice AI engine (Rust, <450 ms end-to-end, 100+ concurrent sessions). Use them together for a complete voice assistant, or use RealtimeIntent standalone as a drop-in intent API for any chatbot.

Why not just use LLM function calling?

	LLM Function Calling	RealtimeIntent
Latency	500–2 000 ms (full LLM inference)	< 100 ms (embed → vector search → rerank)
Cost	Token cost per request	Zero API cost — runs on local 0.6B–4B models
Determinism	Prompt-sensitive, may drift	Same input → same output, always
Customization	Edit prompts and hope	Add examples via API, instant effect
Multi-turn	Context window fills up	Async LLM summary keeps context compact

RealtimeIntent handles the "what does the user want?" question so your LLM can focus on "how to respond" — faster, cheaper, and more reliably.

How It Works

"Will it rain in Beijing tomorrow?"
        │
        ▼
 ┌─────────────┐     ┌────────────────┐     ┌──────────────┐     ┌──────────────┐
 │  Embedding  │────▶│ Qdrant Vector  │────▶│   Reranker   │────▶│ Intent Label │
 │  (4B model) │     │  Search top-K  │     │ (0.6B model) │     │   + Score    │
 └─────────────┘     └────────────────┘     └──────────────┘     └──────────────┘
                                                                        │ async
                                                                        ▼
                                                                 ┌──────────────┐
                                                                 │ LLM Summary  │ (optional)
                                                                 │  for context  │ multi-turn
                                                                 └──────────────┘

No training required. Add intent examples via API or seed script — the system learns from examples, not fine-tuning.

Where It Fits

┌──────────┐     ┌──────────┐     ┌───────────────────┐     ┌─────────┐     ┌──────────┐
│   User   │────▶│   VAD    │────▶│       ASR         │────▶│   LLM   │────▶│   TTS    │
│  Audio   │     │ (Silero) │     │  (WhisperLive)    │     │ (Qwen)  │     │(MiniMax) │
└──────────┘     └──────────┘     └──────────┬────────┘     └────▲────┘     └──────────┘
                                             │                   │
                                             ▼                   │
                                    ┌─────────────────┐          │
                                    │ RealtimeIntent  │──────────┘
                                    │  < 100ms route  │  agent.weather → call weather tool
                                    └─────────────────┘  agent.music   → call music tool
                                                         __no_intent__ → just chat

See SquadyAI Realtime for the full voice conversation engine that orchestrates this pipeline.

Supported Intents

50+ built-in intent categories (click to expand)

Category	Intents
Weather & News	`weather`, `news`, `date`, `time`, `currency`, `event`, `movie`
Q&A	`general`, `domain` (wiki), `daily`, `visual` (camera)
Music & Media	`play`, `query`, `control`, `setting`, `audiobook`, `podcast`, `radio`
Calendar	`query`, `set`, `remove`
Reminders	`query`, `set`, `remove`
Volume	`up`, `down`, `mute`
Lists	`query`, `set`, `remove`
Navigation	`direction`, `traffic`, `taxi`, `transit tickets`
Smart Home	`lights` (on/off/dim/color), `plugs`, `coffee`, `cleaning`
Device	`battery`, `camera` (photo/video), `recorder`
Language	`translate`
Search	`web search`, `stock`, `cooking/recipe`
Social	`post`, `query`, `email`, `contacts`
General	`greet`, `joke`, `goodbye`, `creative content`, `math`
Special	`__no_intent__` (noise / ASR false trigger)

Custom intents can be added at runtime via the API — no retraining needed.

Quick Start

Full GPU stack (recommended)

git clone https://github.com/SquadyAI/RealtimeIntent.git && cd RealtimeIntent
cp .env.example .env
docker compose -f docker-compose.gpu.yml up -d

Wait for models to load (~2 min), then seed the database:

pip install datasets
python scripts/seed/index_massive_to_qdrant.py    # loads ~4,600 intent examples from HuggingFace

Test:

curl -s -X POST http://localhost:8000/intent \
  -H "Content-Type: application/json" \
  -d '{"conversation": [{"role": "user", "content": "Play some jazz"}]}' | python -m json.tool

GPU config: Both models default to GPU 0. Set EMBEDDING_GPU=0 and RERANKER_GPU=1 in .env for separate devices.

Bring your own models

Already running embedding / reranker services? Just point to them:

git clone https://github.com/SquadyAI/RealtimeIntent.git && cd RealtimeIntent
cp .env.example .env
# Edit .env: set EMBEDDING_API_URL and RERANK_API_URL
docker compose up -d    # starts intent-service + Qdrant only

No Docker

pip install -r requirements.txt
export EMBEDDING_API_URL=http://localhost:30003/v1/embeddings
export RERANK_API_URL=http://localhost:30000/score
uvicorn app.main:app --host 0.0.0.0 --port 8000 --loop uvloop

Adding Your Own Intents

No retraining needed. Add examples at runtime:

# Single
curl -X POST http://localhost:8000/insert_entry \
  -H "Content-Type: application/json" \
  -d '{"label": "agent.smart_home.light", "text": "Turn off the living room lights"}'

# Batch
curl -X POST http://localhost:8000/batch_insert \
  -H "Content-Type: application/json" \
  -d '{
    "items": [
      {"label": "agent.smart_home.light", "text": "把卧室灯调暗一点"},
      {"label": "agent.smart_home.light", "text": "Dim the bedroom lights"},
      {"label": "agent.smart_home.ac", "text": "Set AC to 24 degrees"}
    ]
  }'

Or bulk-load from HuggingFace MASSIVE (supports zh-CN, en-US, ja-JP, ko-KR, and more):

python scripts/seed/index_massive_to_qdrant.py --dataset-name SetFit/amazon_massive_intent_en-US

API Reference

Interactive docs at http://localhost:8000/docs (Swagger UI).

`POST /intent` — Detect intent

Parameter	Type	Required	Description
`conversation`	`List[Dict]`	Yes	Conversation history in `[{"role": "user", "content": "..."}]` format
`timeout`	`float`	No	Request timeout in seconds
`debug`	`bool`	No	Return match details (score, source)

curl -X POST http://localhost:8000/intent \
  -H "Content-Type: application/json" \
  -d '{
    "conversation": [
      {"role": "user", "content": "What is the weather like today?"},
      {"role": "assistant", "content": "Sunny, 25°C."},
      {"role": "user", "content": "What about tomorrow?"}
    ],
    "debug": true
  }'

{
  "intent": "agent.information.weather",
  "payload": {
    "payload": { "label": "agent.information.weather", "text": "明天天气怎么样", "source_file": "SetFit/amazon_massive_intent_zh-CN" },
    "score": 0.766
  }
}

Multi-turn works out of the box — the optional LLM summary keeps context across turns ("What about tomorrow?" → still weather).

GET /labels — List all intents

curl http://localhost:8000/labels
# → {"success": true, "count": 4, "labels": ["agent.information.weather", ...]}

POST /insert_entry — Add one example

curl -X POST http://localhost:8000/insert_entry \
  -H "Content-Type: application/json" \
  -d '{"label": "agent.information.weather", "text": "明天会下雨吗"}'
# → {"success": true, "point_id": "..."}

POST /batch_insert — Add examples in bulk

curl -X POST http://localhost:8000/batch_insert \
  -H "Content-Type: application/json" \
  -d '{"items": [
    {"label": "agent.information.weather", "text": "今天天气怎么样"},
    {"label": "agent.conversation.end", "text": "再见，下次聊"}
  ]}'
# → {"success": true, "count": 2, "point_ids": ["...", "..."]}

POST /get_label_content — Browse examples by label

Parameter	Type	Required	Description
`label`	`string`	Yes	Intent label to query
`limit`	`int`	No	Page size (1–100)
`offset`	`string`	No	Pagination cursor from previous response

curl -X POST http://localhost:8000/get_label_content \
  -H "Content-Type: application/json" \
  -d '{"label": "agent.calendar.set", "limit": 10}'
# → {"success": true, "count": 10, "has_next": true, "next_page_offset": "...", "points": [...]}

DELETE /delete_point/{point_id} — Remove one example

curl -X DELETE http://localhost:8000/delete_point/123e4567-e89b-12d3-a456-426614174000

POST /batch_delete_points — Batch remove

curl -X POST http://localhost:8000/batch_delete_points \
  -H "Content-Type: application/json" \
  -d '{"point_ids": ["id1", "id2"]}'

POST /delete_by_label — Remove all examples for a label

Warning: This deletes all data under the label and cannot be undone.

curl -X POST http://localhost:8000/delete_by_label \
  -H "Content-Type: application/json" \
  -d '{"label": "agent.calendar.set", "confirm": true}'

Monitoring

Method	Path	Description
`GET`	`/health`	Service health check
`GET`	`/metrics`	Latency percentiles, cache hit rates
`GET`	`/stats`	Detailed service statistics

Error Handling

Status	Description
`200`	Success
`400`	Bad request — invalid params or missing required fields
`404`	Resource not found
`408`	Timeout — processing exceeded the specified `timeout`
`500`	Internal error — dependency or database failure

Error response format: {"detail": "error message"}

Client Examples

Python

import requests

def detect_intent(conversation, timeout=5.0):
    resp = requests.post("http://localhost:8000/intent", json={
        "conversation": conversation,
        "timeout": timeout,
    })
    resp.raise_for_status()
    return resp.json()["intent"]

intent = detect_intent([
    {"role": "user", "content": "今天天气怎么样？"},
    {"role": "assistant", "content": "今天天气晴朗。"},
    {"role": "user", "content": "明天呢？"},
])
print(intent)  # agent.information.weather

JavaScript

async function detectIntent(conversation, timeout = 5.0) {
  const res = await fetch("http://localhost:8000/intent", {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({ conversation, timeout }),
  });
  if (!res.ok) throw new Error(`${res.status} ${await res.text()}`);
  return (await res.json()).intent;
}

const intent = await detectIntent([
  { role: "user", content: "今天天气怎么样？" },
  { role: "assistant", content: "今天天气晴朗。" },
  { role: "user", content: "明天呢？" },
]);
console.log(intent); // agent.information.weather

Best Practices

Conversation context: Provide 2–3 turns for best accuracy on follow-up queries
Batch operations: Use /batch_insert for bulk data loading instead of calling /insert_entry in a loop
Pagination: Use /get_label_content with limit + offset for large datasets

Configuration

All via environment variables (.env.example has the full list):

Variable	Default	What it does
`EMBEDDING_API_URL`	`http://localhost:30003/v1/embeddings`	OpenAI-compatible embedding endpoint
`EMBEDDING_MODEL`	`Qwen/Qwen3-Embedding-4B`	Model name sent to embedding API
`RERANK_API_URL`	`http://localhost:30000/score`	Cross-encoder reranker endpoint
`RERANK_MODEL`	`Qwen3-Reranker-4B`	Model name sent to reranker API
`QDRANT_HOST`	`localhost`	Qdrant server
`QDRANT_COLLECTION`	`massive_intents`	Collection name
`TOP_K`	`6`	Retrieval candidates before reranking
`NO_INTENT_RERANK_THRESHOLD`	`0.55`	Below this score → `__no_intent__`
`INSTRUCT_API_URL`	`http://localhost:8001/v1/chat/completions`	(optional) LLM for multi-turn summary
`LOG_LEVEL`	`INFO`	Logging verbosity

Model Deployment

docker-compose.gpu.yml handles this automatically. For manual deployment:

Service	Model	Framework	Command
Embedding	Qwen3-Embedding-4B	SGLang	`python -m sglang.launch_server --model-path Qwen/Qwen3-Embedding-4B --port 30003 --is-embedding`
Reranker	Qwen3-Reranker-0.6B	vLLM	`vllm serve Qwen/Qwen3-Reranker-0.6B --task score --port 30000`
LLM (optional)	Qwen3-4B-AWQ	Any OpenAI-compatible	—

Any OpenAI-compatible embedding API and cross-encoder reranker work as drop-in replacements.

The SquadyAI Ecosystem

RealtimeIntent is one piece of a larger open-source voice AI platform:

Project	What it does	Link
RealtimeAPI	Core voice AI engine — ASR→LLM→TTS pipeline orchestration in Rust, <450 ms E2E	SquadyAI/RealtimeAPI
RealtimeIntent	Intent classification — vector search + neural reranking, <100 ms	you are here
RealtimeSearch	Multi-engine search gateway with automatic failover	SquadyAI/RealtimeSearch

┌─────────────────────────────────── RealtimeAPI ───────────────────────────────────┐
│                                                                                   │
│   Audio ──▶ VAD ──▶ ASR ──▶ ┌──────────────────┐ ──▶ LLM ──▶ TTS ──▶ Audio      │
│                             │ ★ RealtimeIntent  │      │                          │
│                             │   < 100ms intent  │      │                          │
│                             └──────────────────┘      │                          │
│                                                  ┌────▼───────────────┐           │
│                                                  │  RealtimeSearch    │           │
│                                                  │  web search tool   │           │
│                                                  └────────────────────┘           │
│                                                                                   │
└───────────────────────────────────────────────────────────────────────────────────┘

Important

If you find RealtimeIntent useful, check out RealtimeAPI — the full voice conversation engine that powers real-time voice assistants with <450 ms latency and 100+ concurrent sessions.

License

Apache License 2.0

Built by SquadyAI contributors · Part of the RealtimeAPI voice AI stack

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
app		app
scripts		scripts
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
README_CN.md		README_CN.md
dev.sh		dev.sh
docker-compose.gpu.yml		docker-compose.gpu.yml
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RealtimeIntent

Why not just use LLM function calling?

How It Works

Where It Fits

Supported Intents

Quick Start

Full GPU stack (recommended)

Bring your own models

No Docker

Adding Your Own Intents

API Reference

`POST /intent` — Detect intent

Error Handling

Client Examples

Best Practices

Configuration

Model Deployment

The SquadyAI Ecosystem

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RealtimeIntent

Why not just use LLM function calling?

How It Works

Where It Fits

Supported Intents

Quick Start

Full GPU stack (recommended)

Bring your own models

No Docker

Adding Your Own Intents

API Reference

POST /intent — Detect intent

Error Handling

Client Examples

Best Practices

Configuration

Model Deployment

The SquadyAI Ecosystem

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`POST /intent` — Detect intent

Packages