Skip to content

Conversation

@apappascs
Copy link
Contributor

Implements Spring AI TranscriptionModel interface for ElevenLabs Speech-to-Text API, providing audio transcription with support for multiple languages, speaker diarization, word-level timestamps, and advanced transcription features.

Implementation:

  • ElevenLabsAudioTranscriptionModel: Main model implementing TranscriptionModel
  • ElevenLabsAudioTranscriptionOptions: Configuration with 15 transcription parameters
  • ElevenLabsSpeechToTextApi: Low-level API client for ElevenLabs STT endpoints
  • ElevenLabsAudioTranscriptionMetadata: Rich metadata including language detection and word timing

Thank you for taking time to contribute this pull request!
You might have already read the contributor guide, but as a reminder, please make sure to:

  • Add a Signed-off-by line to each commit (git commit -s) per the DCO
  • Rebase your changes on the latest main branch and squash your commits
  • Add/Update unit tests as needed
  • Run a build and make sure all tests pass prior to submission

For more details, please check the contributor guide.
Thank you upfront!

Implements Spring AI TranscriptionModel interface for ElevenLabs Speech-to-Text API,
providing audio transcription with support for multiple languages, speaker diarization,
word-level timestamps, and advanced transcription features.

Implementation:
- ElevenLabsAudioTranscriptionModel: Main model implementing TranscriptionModel
- ElevenLabsAudioTranscriptionOptions: Configuration with 15 transcription parameters
- ElevenLabsSpeechToTextApi: Low-level API client for ElevenLabs STT endpoints
- ElevenLabsAudioTranscriptionMetadata: Rich metadata including language detection and word timing

Key Features:
- Full TranscriptionModel interface compliance
- Speaker diarization with configurable speaker count
- Word-level timestamps and audio event tagging
- Async transcription support via webhook integration
- Language detection with confidence scores
- Builder pattern for model and options construction
- Comprehensive error handling with retry support

Design Decisions:
- Uses transcribe(Resource) convenience method from interface (not call(Resource))
- Result-level metadata for transcription-specific information
- Defensive copying for immutability (List.copyOf, HashMap copy)
- Proper null handling: returns empty AudioTranscription("") instead of null

Testing:
- 15 unit tests with MockRestServiceServer
- 4 integration tests against live API
- Tests cover basic transcription, options, diarization, and convenience methods

Documentation:
- Comprehensive Antora reference documentation
- Module README with TTS and STT capabilities
- Alignment report documenting design decisions vs OpenAI/Azure implementations

All tests passing (26 integration tests total across module).

Signed-off-by: Alexandros Pappas <apappascs@gmail.com>
@apappascs apappascs force-pushed the feature/elevenlabs-speech-to-text branch from 9d30cf5 to 4f951dd Compare January 2, 2026 15:11
@apappascs
Copy link
Contributor Author

cc @markpollack

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant