This project builds an automated data pipeline to extract, store, process, and visualize trending YouTube videos data. It fetches daily trending video data using the YouTube Data API v3, stores it in a PostgreSQL database, processes it with Pandas, and visualizes insights using Streamlit. The pipeline is scheduled using cron jobs for automation, making it a lightweight, cost-free solution for tracking YouTube trends.
- Extract daily trending YouTube videos (titles, views, likes, categories, etc.).
- Store and manage data in a structured PostgreSQL database.
- Clean and transform data for analysis.
- Automate the pipeline for daily updates.
- Visualize trends via an interactive dashboard.
- Create a portfolio-worthy project with clear documentation.
- Python: Core scripting language.
- YouTube Data API v3: Data source for trending videos (free with Google Cloud account).
- PostgreSQL: Relational database for data storage.
- Pandas: Data cleaning and transformation.
- Streamlit: Interactive dashboard for visualization.
- Cron Jobs: Scheduling daily pipeline runs.
- Google Colab / Local Machine: Development environment.
youtube-trending-pipeline/
├── scripts/
│ ├── youtube_data_ingestion.py # Fetches data from YouTube API
│ ├── postgres_setup.py # Loads data into PostgreSQL
│ ├── data-cleaning-transform.py # Cleans and enriches data
│ ├── dashboard.py # Streamlit dashboard
│ ├── run_pipeline.sh # Bash script for cron scheduling
├── dataset/ # Generated Dataset
├── requirements.txt # Python dependencies
├── README.md # Project documentation
- Python 3.8+
- Google Cloud account with YouTube Data API v3 enabled
- PostgreSQL installed locally or hosted (e.g., free tier on services like Heroku or Supabase)
- Git
-
Clone the Repository:
git clone https://github.com/your-username/youtube-trending-pipeline.git cd youtube-trending-pipeline -
Install Dependencies:
pip install -r requirements.txt
Note: Ensure
psycopg2orpsycopg2-binaryis included inrequirements.txtfor PostgreSQL connectivity. -
Set Up YouTube API:
- Create a Google Cloud project: Google Cloud Console.
- Enable the YouTube Data API v3.
- Generate an API key and store it securely in a
.envfile:YOUTUBE_API_KEY=your-api-key-here
-
Set Up PostgreSQL:
- Install PostgreSQL locally or use a hosted service.
- Create a database (e.g.,
youtube_trending):psql -U postgres -c "CREATE DATABASE youtube_trending;" - Configure database credentials in a
.envfile:DB_HOST=localhost DB_NAME=youtube_trending DB_USER=your-username DB_PASSWORD=your-password DB_PORT=5432
- Run
postgres_setup.pyto create tables (videos,categories,fetch_log) and load data.
-
Optional: Scheduling:
- Set up a cron job to run the pipeline daily:
crontab -e
- Add the following line to run
schedule.shdaily at 1 AM:0 1 * * * /path/to/youtube-trending-pipeline/scripts/schedule.sh
- Set up a cron job to run the pipeline daily:
-
Fetch Data:
- Run the ingestion script to fetch trending videos:
python scripts/youtube_data_ingestion.py
- Output: Raw data saved in
dataset/.
- Run the ingestion script to fetch trending videos:
-
Load Data:
- Load raw data into the PostgreSQL database:
python scripts/postgres_setup.py
- Load raw data into the PostgreSQL database:
-
Transform Data:
- Clean and enrich data using Pandas:
python scripts/data-cleaning-transform.py
- Output: Processed data saved in
dataset/.
- Clean and enrich data using Pandas:
-
Visualize Data:
- Launch the Streamlit dashboard:
streamlit run scripts/dashboard.py
- View insights like top 10 trending videos, category trends, and view/like ratios.
- Launch the Streamlit dashboard:
-
Automate Pipeline:
- Use
schedule.shto run all scripts sequentially:bash scripts/run_pipeline.sh
- Use
- Database: PostgreSQL database
youtube_trendingcontains tables (videos,categories,fetch_log) with structured data. - Visualizations: Charts in show trends like:
- Top 10 videos by views.
- Trending categories over time.
- View/like ratio analysis.
- Dashboard: Interactive Streamlit app at
http://localhost:8501(default).
Top 10 Trending Videos
Video and Age Distribution
- Add support for multiple regions using the
regionCodeAPI parameter. - Implement error handling for API rate limits with exponential backoff.
- Enhance the dashboard with filters for date ranges or categories.
- Store historical trends for long-term analysis.
Feel free to fork this repository, submit issues, or create pull requests with improvements. Ensure to follow the coding style and include tests for new features.
This project is licensed under the MIT License. See the LICENSE file for details.
For questions or feedback, reach out via GitHub Issues.

