Skip to content

feat: add multimedia support for Telegram and Discord adapters#196

Open
slysian wants to merge 2 commits intoRightNow-AI:mainfrom
slysian:pr/multimedia-support
Open

feat: add multimedia support for Telegram and Discord adapters#196
slysian wants to merge 2 commits intoRightNow-AI:mainfrom
slysian:pr/multimedia-support

Conversation

@slysian
Copy link

@slysian slysian commented Mar 2, 2026

Add comprehensive media handling for channel adapters:

Telegram (receive):

  • Voice messages: download + transcribe via Groq Whisper (fallback: OpenAI Whisper)
  • Photos: download + recognize via Gemini Vision API
  • Documents: download + extract text content or recognize images

Telegram & Discord (send):

  • File sending via multipart upload (sendDocument / Discord files API)
  • Image sending with optional captions

Discord (receive):

  • Attachment processing: images via Gemini Vision, text files extracted
  • Mixed content (text + attachments) handled correctly

Shared utilities (new media_utils module):

  • Gemini Vision image recognition
  • MIME type detection from magic bytes
  • Text file detection by extension/MIME
  • HTTP download helper
  • Attachment-to-text processing pipeline

Closes #158

Add comprehensive media handling for channel adapters:

Telegram (receive):
- Voice messages: download + transcribe via Groq Whisper (fallback: OpenAI Whisper)
- Photos: download + recognize via Gemini Vision API
- Documents: download + extract text content or recognize images

Telegram & Discord (send):
- File sending via multipart upload (sendDocument / Discord files API)
- Image sending with optional captions

Discord (receive):
- Attachment processing: images via Gemini Vision, text files extracted
- Mixed content (text + attachments) handled correctly

Shared utilities (new media_utils module):
- Gemini Vision image recognition
- MIME type detection from magic bytes
- Text file detection by extension/MIME
- HTTP download helper
- Attachment-to-text processing pipeline

Closes RightNow-AI#158

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
});

let url = format!(
"https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash:generateContent?key={gemini_key}"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it'll be better if we not hardcode model names and rely on global model resolution flow through config.toml

Replace hardcoded model names and API endpoints with LazyLock statics
that read from environment variables at first use, with sensible defaults:

- VISION_MODEL (default: gemini-2.5-flash)
- VISION_API_BASE (default: generativelanguage.googleapis.com/v1beta)
- GROQ_STT_MODEL (default: whisper-large-v3-turbo)
- GROQ_STT_URL (default: api.groq.com/openai/v1/audio/transcriptions)
- OPENAI_STT_MODEL (default: whisper-1)
- OPENAI_STT_URL (default: api.openai.com/v1/audio/transcriptions)

This allows users to swap models or providers without recompiling.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature Request: Telegram media sending - agent should send images/videos natively

2 participants