Skip to content

Latest commit

 

History

History
172 lines (117 loc) · 5.5 KB

File metadata and controls

172 lines (117 loc) · 5.5 KB

🎥 YouTube Transcript Processor

Transform YouTube videos into language learning materials by extracting transcripts and adapting them to different proficiency levels

Python License Maintenance

✨ Features

  • 📝 Automatic Transcript Extraction - Fetch transcripts from YouTube videos
  • 🌏 Multi-language Support - Specialized for Cantonese (粵語) transcripts
  • 🧠 AI-Powered Processing - Transform content to match different language proficiency levels
  • Smart Text Chunking - Intelligently split content based on token limits
  • 📊 Token Counting - Precise token management using tiktoken
  • 💾 File Output - Save processed results to text files
  • 🎧 Podcast Integration - Compatible with ElevenReader to convert YouTube videos into English podcasts

🚀 Quick Start

Basic Usage

from main import process_youtube

# Process a YouTube video
video_url = "https://www.youtube.com/watch?v=YOUR_VIDEO_ID"
results = process_youtube(video_url, level="b1", max_tokens=4000)

# Results are automatically saved to text files

Command Line Usage

python main.py

📋 How It Works

  1. 🔗 URL Parsing - Extracts video ID from YouTube URLs
  2. 📜 Transcript Retrieval - Fetches Cantonese transcripts using YouTube Transcript API
  3. ✂️ Smart Chunking - Splits text into manageable chunks while preserving sentence integrity
  4. 🤖 AI Processing - Sends chunks to AI model for language level adaptation
  5. 💾 File Export - Saves processed content to organized text files

🛠️ Configuration

Language Levels

  • a1 - Beginner
  • a2 - Elementary
  • b1 - Intermediate
  • b2 - Upper Intermediate
  • c1 - Advanced
  • c2 - Proficient

Token Limits

Default: 4000 tokens per chunk

  • Adjustable based on your AI model's context window
  • Automatically handles sentences that exceed token limits

📁 Project Structure

youtube-transcript-processor/
├── main.py              # Main processing script
├── robot.py            # AI model interface
├── text/               # Output directory for processed files
├── requirements.txt    # Python dependencies
└── README.md          # This file

🔧 API Reference

process_youtube(link, level, max_tokens=4000, is_chinese=True)

Parameters:

  • link (str): YouTube video URL
  • level (str): Target language proficiency level
  • max_tokens (int): Maximum tokens per chunk
  • is_chinese (bool): Enable Chinese text processing

Returns:

  • List of processed text chunks

get_youtube_transcript(video_url)

Parameters:

  • video_url (str): YouTube video URL

Returns:

  • Full transcript text or None if error

🎯 Use Cases

  • Language Learning - Adapt YouTube content to your proficiency level
  • Content Creation - Generate educational materials from videos
  • Research - Process video content for analysis
  • Accessibility - Create readable transcripts from video content
  • 🎧 Podcast Creation - Use with ElevenReader to transform YouTube videos into English podcasts for on-the-go learning

🔍 Example Output

Input: Complex Cantonese YouTube video
Output: Simplified text adapted to B1 level with proper sentence structure and vocabulary

Original: 今日我哋要講嘅係一個好複雜嘅概念...
Processed: Today what we're going to talk about is a very complex concept...

🤝 Contributing

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

📞 Support

If you encounter any issues or have questions:

  • Open an issue on GitHub
  • Check the Wiki for detailed documentation
  • Join our Discussions for community support

🔄 Integration with ElevenReader

Transform your processed transcripts into engaging audio content:

  1. Process YouTube Video - Extract and adapt transcript using this tool
  2. Export Text File - Save the processed content to a text file
  3. Upload to ElevenReader - Visit ElevenReader.io and upload your text file
  4. Generate Podcast - Convert your adapted transcript into an English podcast
  5. Listen & Learn - Enjoy your personalized audio content on any device

Perfect Workflow:

YouTube Video → Transcript Extraction → AI Processing → Text File → ElevenReader → English Podcast

This integration allows you to:

  • Turn any YouTube video into an English learning transcript
  • Create audio content at your desired proficiency level (use with ElevenReader)
  • Learn through multiple modalities (reading + listening)

Made with ❤️ for language learners worldwide