Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
272 changes: 272 additions & 0 deletions TRANSCRIPTION_CAPABILITY.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,272 @@
# Audio Transcription Capability

## Overview

ADK-Java provides an optional audio transcription capability that allows agents to transcribe audio data to text. The feature is designed to be optional, lazily loaded, and configurable via environment variables.

## Features

- **Optional Feature**: Works without configuration, enables when configured
- **Lazy Loading**: Services created only when needed, cached for reuse
- **Multiple Service Support**: Extensible architecture supporting multiple transcription services
- **Async Processing**: Built on RxJava for efficient asynchronous operations
- **Streaming Support**: Supports both batch and streaming transcription
- **Environment Configuration**: All configuration via environment variables (12-Factor App compliant)

## Architecture

The transcription capability follows several design patterns:

- **Strategy Pattern**: Pluggable transcription service implementations
- **Factory Pattern**: Lazy-loaded service creation with caching
- **Builder Pattern**: Flexible configuration management
- **Optional Pattern**: Graceful degradation when not configured

### Package Structure

```
com.google.adk.transcription/
├── ServiceType.java # Service type enumeration
├── AudioFormat.java # Audio format specifications
├── TranscriptionException.java # Custom exception
├── ServiceHealth.java # Health status DTO
├── TranscriptionResult.java # Result DTO
├── TranscriptionEvent.java # Event DTO for streaming
├── TranscriptionService.java # Core interface
├── TranscriptionConfig.java # Configuration class
├── config/
│ └── TranscriptionConfigLoader.java # Environment config loader
├── client/
│ ├── WhisperRequest.java # Request DTO
│ ├── WhisperResponse.java # Response DTO
│ └── WhisperApiClient.java # HTTP client
├── strategy/
│ ├── WhisperTranscriptionService.java # Service implementation
│ └── TranscriptionServiceFactory.java # Factory
└── processor/
└── AudioChunkAggregator.java # Chunk aggregation
```

## Configuration

### Environment Variables

**Required (for transcription to work):**
```bash
ADK_TRANSCRIPTION_ENDPOINT=https://your-transcription-service:port
```

**Optional:**
```bash
# Service type (default: inferred from endpoint)
ADK_TRANSCRIPTION_SERVICE_TYPE=whisper

# API key if required by service
ADK_TRANSCRIPTION_API_KEY=your-api-key

# Language code (default: auto-detect)
ADK_TRANSCRIPTION_LANGUAGE=en

# Timeout in seconds (default: 30)
ADK_TRANSCRIPTION_TIMEOUT_SECONDS=30

# Max retries (default: 3)
ADK_TRANSCRIPTION_MAX_RETRIES=3

# Chunk size in milliseconds for streaming (default: 500)
ADK_TRANSCRIPTION_CHUNK_SIZE_MS=500
```

## Usage

### Basic Usage: Agent Tool

The simplest way to use transcription is through the `TranscriptionTool`, which can be added to any agent:

```java
import com.google.adk.agents.LlmAgent;
import com.google.adk.tools.transcription.TranscriptionTool;
import com.google.adk.tools.FunctionTool;

// Create transcription tool (returns null if not configured)
FunctionTool transcriptionTool = TranscriptionTool.create();

if (transcriptionTool != null) {
LlmAgent agent = LlmAgent.builder()
.name("audio_agent")
.model("gemini-2.0-flash")
.instruction("Analyze audio files. Use transcribe_audio tool when needed.")
.addTool(transcriptionTool)
.build();

// Agent can now automatically call transcribe_audio tool
}
```

### Check Availability

```java
if (TranscriptionTool.isAvailable()) {
// Transcription is configured and available
FunctionTool tool = TranscriptionTool.create();
agent.addTool(tool);
} else {
// Work without transcription
System.out.println("Transcription not configured");
}
```

### Advanced Usage: Direct Service Access

For more control, you can use the transcription service directly:

```java
import com.google.adk.transcription.*;
import com.google.adk.transcription.config.TranscriptionConfigLoader;
import com.google.adk.transcription.strategy.TranscriptionServiceFactory;

// Load configuration from environment
Optional<TranscriptionConfig> config = TranscriptionConfigLoader.loadFromEnvironment();

if (config.isPresent()) {
// Get service (lazy loaded, cached)
TranscriptionService service = TranscriptionServiceFactory.getOrCreate(config.get());

// Synchronous transcription
byte[] audioData = ...; // Your audio bytes
TranscriptionResult result = service.transcribe(audioData, config.get());
System.out.println("Transcribed: " + result.getText());
}
```

### Async Transcription

```java
// Use RxJava Single for async transcription
Single<TranscriptionResult> resultFuture =
service.transcribeAsync(audioData, config.get());

resultFuture.subscribe(
result -> System.out.println("Transcribed: " + result.getText()),
error -> System.err.println("Error: " + error.getMessage())
);
```

### Streaming Transcription

```java
// Stream audio chunks and get transcription events
Flowable<byte[]> audioStream = ...; // Your audio stream
Flowable<TranscriptionEvent> transcriptionEvents =
service.transcribeStream(audioStream, config.get());

transcriptionEvents.subscribe(
event -> {
if (event.isFinished()) {
System.out.println("Final: " + event.getText());
} else {
System.out.println("Partial: " + event.getText());
}
}
);
```

## Tool Function Signature

When used as an agent tool, transcription exposes the following function:

**Function Name:** `transcribe_audio`

**Parameters:**
- `audio_data` (required): Base64-encoded audio data
- `language` (optional): Language code (e.g., "en", "es", "fr")

**Returns:**
```json
{
"text": "Transcribed text",
"language": "en",
"confidence": 0.95,
"duration": 5000
}
```

## Supported Services

Currently implemented:
- **Whisper**: HTTP-based Whisper API integration

Future support planned:
- Gemini Live API
- Azure Speech Services
- AWS Transcribe

## Error Handling

Transcription operations throw `TranscriptionException` for errors. The service includes:
- Retry logic with exponential backoff
- Health check support
- Comprehensive error messages

## Thread Safety

- Service factory uses thread-safe caching
- Services are stateless and thread-safe
- Configuration objects are immutable

## Testing

### Compilation

```bash
mvn compile -DskipTests
```

### Unit Tests

```bash
mvn test -Dtest=TranscriptionConfigTest
```

## Implementation Details

### Service Factory

The `TranscriptionServiceFactory` implements lazy loading:
- Services are created only when first accessed
- Services are cached and reused
- Thread-safe implementation using `ConcurrentHashMap`

### HTTP Client

The Whisper implementation uses OkHttp for HTTP requests:
- Connection pooling
- Configurable timeouts
- Retry logic with exponential backoff
- Health check support

### Audio Processing

- Supports multiple audio formats (PCM, WAV, MP3)
- Configurable sample rates and channels
- Chunk aggregation for efficient streaming

## Limitations

- Live streaming integration (real-time audio) is not yet implemented
- PostgreSQL storage integration is not yet implemented
- Additional service implementations (Gemini, Azure, AWS) are planned

## Future Enhancements

- Real-time streaming handler integration
- Persistent storage for audio and metadata
- Additional transcription service implementations
- Enhanced error handling and retry strategies
- Performance optimizations and caching

## License

Copyright 2025 Google LLC

Licensed under the Apache License, Version 2.0.
Loading