Skip to content

feat: Add OpenAI Whisper transcription and Gemini translation support#3

Open
danyuchn wants to merge 1 commit intoop7418:mainfrom
danyuchn:feature/multi-api-support
Open

feat: Add OpenAI Whisper transcription and Gemini translation support#3
danyuchn wants to merge 1 commit intoop7418:mainfrom
danyuchn:feature/multi-api-support

Conversation

@danyuchn
Copy link
Copy Markdown

🎯 Motivation

  • Many YouTube videos lack existing subtitles
  • Claude API costs can be prohibitive for batch processing
  • Users need more flexible API options

✨ New Features

1. Auto-transcription with OpenAI Whisper API

  • Handles videos without existing subtitles
  • Supports long audio (no token limits)
  • Cost: ~$0.006/minute
  • Tested: 73-minute video → 209 seconds processing, 2427 segments

2. Gemini API integration for translation

  • 93% cost reduction vs Claude API
  • Gemini 2.5 Flash Lite for batch translation (30 items/batch)
  • Gemini 2.5 Flash for content generation
  • Maintains translation quality

3. YouTube HTTP 403 bypass

  • Uses iOS/Android client parameters
  • Documented in references/yt-dlp-guide.md
  • Tested successfully on multiple videos

📊 Tested On

  • Video: 73-minute Chinese GMAT lecture (no existing subtitles)
  • Chapters: 18 processed
  • Success rate: 100%
  • Total cost: $0.74-0.89 (vs $6.30 with Claude API only)
  • Processing time: 25-30 minutes

Cost Comparison (18 chapters)

API Translation Content Total
Claude API ~$2.70 ~$3.60 ~$6.30
Gemini API ~$0.15 ~$0.30 ~$0.45
Savings 94% 92% 93%

🔧 Technical Details

New Scripts

  • scripts/transcribe_with_openai.py - Whisper API transcription
  • scripts/translate_with_gemini.py - Gemini batch translation (30 items/batch)
  • scripts/merge_bilingual_from_json.py - JSON to SRT format conversion

Updated Documentation

  • TECHNICAL_NOTES.md: Added sections 11-15 for new technical issues
    • Section 11: YouTube HTTP 403 Forbidden
    • Section 12: Whisper API transcription
    • Section 13: Gemini batch translation optimization
    • Section 14: Content generation anti-truncation
    • Section 15: JSON → SRT format conversion
  • FIXES_AND_IMPROVEMENTS.md: Added 2026-01-25 version with complete test results
  • README.md & README.zh-CN.md: Added API keys configuration section
  • references/yt-dlp-guide.md: Added HTTP 403 solution
  • .env.example: Added OPENAI_API_KEY and GEMINI_API_KEY

API Keys Required

# OpenAI API Key (for Whisper transcription)
OPENAI_API_KEY=sk-proj-...

# Gemini API Key (for translation and content generation)
GEMINI_API_KEY=AIza...

📋 Breaking Changes

None - This PR only adds new optional features:

  • Original Claude API translation support is retained
  • All new features are opt-in via API keys
  • Existing workflows continue to work unchanged

🔍 Key Implementation Details

Whisper Transcription

  • No token limits for long videos (unlike Gemini 2.0 Flash: 1M token limit)
  • Automatic language detection
  • High-quality timestamps in VTT format
  • Handles 73-minute video without issues

Gemini Translation

  • Batch size optimized from 20 → 30 items
  • Temperature: 0.3 for consistency
  • JSON output format for easy validation
  • 95% reduction in API calls vs single-item requests

Anti-truncation Mechanism

  • max_output_tokens: Increased from 3000 → 8000
  • 3-retry system with completeness validation
  • Checks for all required sections (小红书/抖音/微信公众号)
  • 100% success rate after optimization

🧪 Testing

  • ✅ Fully tested on production workload
  • ✅ 73-minute video, 18 chapters, 100% success
  • ✅ All scripts validated with real API calls
  • ✅ Documentation verified and cross-referenced

💡 Future Improvements (Not in this PR)

  • Support for more Gemini models
  • Parallel chapter processing
  • Auto-retry on API failures

Note: This PR represents real-world usage and optimization based on processing a complete 73-minute video with 18 chapters. All features have been tested and validated in production scenarios.

## New Features
- **Auto-transcription**: OpenAI Whisper API for videos without subtitles
  - Supports long audio (no token limits)
  - Cost: ~$0.006/minute
  - Tested on 73-minute video (2427 segments, 209 seconds)

- **Gemini API integration**: 93% cost reduction vs Claude API
  - Gemini 2.5 Flash Lite for translation (batch size: 30)
  - Gemini 2.5 Flash for content generation
  - Cost: ~$0.45 vs ~$6.30 for 18 chapters

- **YouTube HTTP 403 bypass**: iOS/Android client parameters
  - Documented in references/yt-dlp-guide.md

## New Scripts
- scripts/transcribe_with_openai.py: Whisper API transcription
- scripts/translate_with_gemini.py: Gemini batch translation
- scripts/merge_bilingual_from_json.py: JSON to SRT conversion

## Updated Documentation
- TECHNICAL_NOTES.md: Added sections 11-15 for new technical issues
- FIXES_AND_IMPROVEMENTS.md: Added 2026-01-25 version with test results
- README.md & README.zh-CN.md: Added API keys configuration
- references/yt-dlp-guide.md: Added HTTP 403 solution
- .env.example: Added OPENAI_API_KEY and GEMINI_API_KEY

## Test Results
- 73-minute Chinese video (no subtitles)
- 18 chapters processed
- 100% success rate
- Total cost: $0.74-0.89
- Processing time: 25-30 minutes

## Backward Compatibility
- Original Claude API support retained
- All new features are optional
- No breaking changes

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant