Skip to content

feat(telegram): Voice pipeline refactor with STT integration and configurable routing#33

Open
maemreyo wants to merge 9 commits intonextlevelbuilder:mainfrom
maemreyo:maemreyo/telegram-voice-stt
Open

feat(telegram): Voice pipeline refactor with STT integration and configurable routing#33
maemreyo wants to merge 9 commits intonextlevelbuilder:mainfrom
maemreyo:maemreyo/telegram-voice-stt

Conversation

@maemreyo
Copy link
Contributor

@maemreyo maemreyo commented Mar 1, 2026

🎯 Overview

This PR introduces a comprehensive refactor of the Telegram voice pipeline, improving code organization, testability, and adding robust voice agent routing capabilities with STT (Speech-to-Text) integration.

✨ Key Features

1. Nested Voice Configuration Structure

  • Introduced TelegramVoiceConfig to group all voice-related settings under a single voice JSON key
  • Clear separation between base channel settings and voice pipeline configuration
  • Backward compatible with legacy flat config layout via automatic promotion

2. Voice Agent Routing System

  • Configurable voice agent routing with priority-based decision chain:
    1. Audio/Voice Media → Always routes to voice agent (highest priority)
    2. /start Command → Bootstraps voice session with customizable message
    3. Intent Keywords → Text-based routing via configurable keyword matching
    4. Session Affinity → Sticky routing with TTL-based expiration
    5. Affinity Clear Keywords → User-initiated switch back to default agent
  • DM-only routing logic (groups excluded except for audio media)
  • Case-insensitive keyword matching with defensive normalization

3. STT (Speech-to-Text) Integration

  • Multipart form-data contract with /transcribe_audio endpoint
  • Bearer token authentication support
  • Tenant ID forwarding for multi-tenant deployments
  • Configurable timeout (default: 30s)
  • Concurrency control via buffered-channel semaphore (max 4 concurrent calls per channel)
  • Shared HTTP client with connection pooling for performance

4. Audio Guard System

  • Extracted into dedicated voiceguard package for better separation of concerns
  • Zero dependencies on Telegram SDK or message bus
  • Pure string→string transformation for easy unit testing
  • Intercepts technical error language in voice agent replies
  • User-friendly fallback messages with transcript support
  • Customizable error markers (replaces built-in defaults when set)
  • Supports both English and Vietnamese error detection

5. Enhanced Testability

  • resolveTargetAgent() extracted as pure function (no I/O side effects)
  • 14 table-driven test cases covering all routing scenarios
  • Race condition testing with -race flag support
  • 13 unit tests for audio guard logic
  • Comprehensive STT test coverage

6. Agent Loop Improvements

  • Rate limit model fallback support
  • ForwardMedia field for delegation artifact forwarding
  • Improved error handling and tracing

📊 Changes Summary

18 files changed
+2020 insertions
-247 deletions

New Files

  • internal/channels/telegram/voiceguard/guard.go - Audio guard logic
  • internal/channels/telegram/voiceguard/guard_test.go - Audio guard tests (13 tests)
  • internal/channels/telegram/handlers_voice_routing_test.go - Voice routing tests (14 tests)
  • internal/config/config_load_voice_test.go - Voice config tests
  • internal/agent/loop_fallback_test.go - Model fallback tests
  • cmd/gateway_consumer_audio_sanitize_test.go - Audio sanitization tests

Modified Files

  • internal/config/config_channels.go - New TelegramVoiceConfig struct
  • internal/channels/telegram/factory.go - Legacy config promotion logic
  • internal/channels/telegram/handlers.go - Voice routing implementation
  • internal/channels/telegram/stt.go - STT concurrency control & HTTP client pooling
  • internal/agent/loop.go - ForwardMedia support & improved structure
  • cmd/gateway_consumer.go - Integration with voiceguard package

🔄 Migration Path

For Existing Deployments

No immediate action required! The refactor is fully backward compatible:

  • Existing DB rows with flat config layout continue to work
  • Legacy fields are automatically promoted to nested structure on load
  • No database migration needed

For New Deployments

Use the nested structure for cleaner configuration:

{
  "voice": {
    "agent_id": "speaking-agent",
    "stt_proxy_url": "https://stt.example.com",
    "stt_api_key": "secret-key",
    "intent_keywords": ["speaking", "pronunciation"],
    "affinity_clear_keywords": ["homework", "payment"],
    "affinity_ttl_minutes": 360,
    "dm_context_template": "Context:\n- tenant: {tenant_id}\n- user_id: {user_id}",
    "audio_guard_fallback_transcript": "🎙️ Got your voice: \"%s\". Please try again!",
    "audio_guard_error_markers": ["system error", "rate limit"]
  }
}

🧪 Testing

All tests pass:

✅ internal/channels/telegram - 14 routing tests
✅ internal/channels/telegram/voiceguard - 13 audio guard tests  
✅ internal/channels/telegram - STT tests updated
✅ cmd - audio sanitization tests

Run with race detector:

go test ./internal/channels/telegram/... -race -v

🔧 Environment Variables

New environment variable support:

  • GOCLAW_VOICE_AGENT_ID - Override voice agent ID
  • GOCLAW_STT_TENANT_ID - Override STT tenant ID
  • GOCLAW_VOICE_DM_CONTEXT_TEMPLATE - Override DM context template
  • GOCLAW_AUDIO_GUARD_FALLBACK_TRANSCRIPT - Override transcript fallback
  • GOCLAW_AUDIO_GUARD_FALLBACK_NO_TRANSCRIPT - Override no-transcript fallback

📝 Documentation

Voice Routing Priority Chain

  1. Audio/voice media present → voice agent (applies to groups too)
  2. /start or start text (DM only) → voice agent + rewrite content
  3. Text matches intent keywords (DM only) → voice agent + set affinity
  4. Existing non-expired affinity (DM only) → continue routing to affinity agent
  5. Text matches clear keywords (DM only) → evict affinity, route to default
  6. Fallback → default agent

Audio Guard Behavior

  • Only triggers for voice agent on Telegram DMs with audio/voice media
  • Checks reply for technical error language
  • Replaces with user-friendly fallback when error detected
  • Supports custom error markers (replaces defaults when set)
  • Extracts and includes transcript in fallback when available

🐛 Bug Fixes

  • Fixed group affinity leak (affinity no longer stored for group chats)
  • Fixed variable assignment in resolveTargetAgent call
  • Normalized voice routing keywords to lowercase for case-insensitive matching
  • Fixed STT contract to use audio field (not legacy file field)

🔍 Code Quality

  • Zero breaking changes for existing deployments
  • Comprehensive test coverage (27 new tests)
  • Clear separation of concerns (voiceguard package)
  • Improved code organization and maintainability
  • Detailed inline documentation
  • Performance optimizations (HTTP client pooling, concurrency control)

📚 Related Issues

Closes: (if any issue numbers)

🙏 Acknowledgments

This refactor builds upon the existing voice pipeline foundation and improves it with better structure, testability, and configurability for production deployments.

maemreyo added 9 commits March 1, 2026 20:04
- Add dmAgentAffinity map for sticky DM routing to voice agent
- Add STT config fields (STTProxyURL, STTAPIKey, STTTenantID, STTTimeoutSec, VoiceAgentID)
- Implement looksLikeSpeakingIntent and looksLikeNonSpeakingIntent for smart routing
- Add session affinity with 6h TTL for DM conversations
- Improve STT URL handling with proper trimming
- Add logging for transcript attachment
- Add modelFallbacks to Loop config for fallback model support
- Implement callProviderWithFallback for automatic model switching on 429 errors
- Add modelCandidates helper to deduplicate primary + fallback models
- Add isRateLimitFailure detection for 429 status and common rate limit error messages
- Update emitLLMSpan to track actual model used in span
- Change form field from 'file' to 'audio' for speaking-service contract
- Add default tenant_id fallback ('default') when not configured
- Add speaking-agent Telegram audio guard for student replies
- Add internal identity prompt for speaking-agent in DM
- Add sanitizeSpeakingAudioStudentReply to handle technical errors
- Update STT tests for new contract
Replace hardcoded speaking-agent logic with configurable Telegram channel settings:
- VoiceStartMessage, VoiceIntentKeywords, VoiceAffinityClearKeywords, VoiceAffinityTTLMinutes
- VoiceDMContextTemplate (injects context with {user_id} substitution)
- AudioGuardFallbackTranscript/NoTranscript for custom fallback messages
- GOCLAW_STT_TENANT_ID and GOCLAW_VOICE_DM_CONTEXT_TEMPLATE env var overrides

This allows deployments to customize voice routing behavior and error fallback messages without code changes. Includes new tests for voice routing logic and audio guard sanitization.
- Replace fmt.Sprintf with strings.ReplaceAll in audio fallback template handling to prevent "%!(EXTRA string=...)" garbage when custom templates lack %s placeholder
- Lowercase config keywords defensively in matchesVoiceIntent and matchesAffinityClear since inbound text is normalized but DB keywords may have mixed case
- Add comprehensive test coverage for custom fallback templates with and without placeholders
- Add test cases for mixed-case keyword matching in voice intent and affinity-clear routing
- Ensure operators can safely configure keywords with any casing without breaking voice routing logic
- Extract voice agent reply sanitization into new voiceguard package with Guard type
- Add voiceguard.SanitizeReply function to handle technical error detection and fallback messaging
- Refactor voice configuration from flat fields (VoiceAgentID, VoiceDMContextTemplate) to nested Voice struct
- Support both nested and flat JSON layouts in telegramInstanceConfig for backward compatibility
- Add sttSem field to Channel for bounding parallel STT HTTP calls
- Update gateway_consumer to use voiceguard package instead of inline sanitization logic
- Remove sanitizeVoiceAgentReply, containsTechnicalErrorLanguage, and extractTranscriptFromInbound functions from gateway_consumer
- Clean up unused imports (html, regexp) from gateway_consumer
- Fix assignment operator from `=` to `:=` in handleMessage for proper variable declaration
- Remove duplicate test code block at end of handlers_voice_routing_test.go
- Clean up test file structure to eliminate redundant package declaration and imports
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant