Skip to content

Computer Science 6705. Fall 2025. Applied Cloud Computing. Weber State University. Professor Joshua Jensen. Final Project - YouTube Analytics using AWS. Collaborator: Jacob Child

License

Notifications You must be signed in to change notification settings

bell-kevin/YouTubePipelineAWS

CS 6705 Final Project

YouTube Logo Powered by AWS Cloud Computing

Overview

This repository contains the artifacts for Cloud-Based YouTube Trending Analytics Pipeline on AWS, a serverless data and ML workflow that ingests daily trending-video metadata and comments, curates analytics-ready datasets, and produces next-day trending predictions. The system emphasizes managed AWS services to minimize operations overhead while keeping costs predictable.

Key Capabilities

  • Daily ingestion: EventBridge triggers Lambda functions that call the YouTube Data API and land raw JSON into partitioned S3 buckets by region and ingest date.
  • Curated ETL layers: AWS Glue jobs flatten, type-cast, deduplicate, and partition trending and comment data into Parquet, enabling efficient Athena queries and downstream feature computation.
  • Sentiment + feature engineering: Comments are scored with AWS Comprehend; trending metrics are joined with sentiment aggregates to build labeled feature sets capturing engagement momentum and audience tone.
  • Model training and predictions: Glue jobs retrain models on updated features and emit next-day view-growth estimates and stay-trending probabilities for analytics dashboards.
  • Secure, modular infrastructure: A dedicated VPC, Secrets Manager–backed credentials, and clearly separated scripts/notebooks keep ingestion, ETL, ML, and presentation assets organized.

Repository Map

  • Python/ETL/: Glue scripts for trending ingestion, comments processing, sentiment integration, and feature labeling used throughout the workflow.
  • Notebooks/: Exploratory analysis, validation, and visualization notebooks.
  • Diagrams/: Architecture and network diagrams referenced in the presentation materials.
  • Resources/: Branding assets displayed in this README.
  • FinalReport.md: Full project write-up covering scope, architecture, results, and cost considerations.
  • PresentationScript.md and Presentation assets: Slide narrative outlining workflow stages and infrastructure.
  • Notes.md and TODO.md: Working design notes, schemas, and milestone tracking.

Quick Links

How to Use This Repo

  1. Review FinalReport.md for the full architecture description, operational flow, and future work ideas.
  2. Browse Python/ETL/ to see Glue job implementations for trending ingestion, comments curation, and sentiment/feature engineering.
  3. Open Notebooks/ for exploratory analyses and validation steps used during model development.
  4. Check Diagrams/ alongside PresentationScript.md for visual references to the pipeline and network layout.
  5. Track outstanding tasks or design decisions in TODO.md and Notes.md when iterating on the pipeline.

https://docs.google.com/document/d/1DZMIBqV4MBL5r5YGobFmRfblJUcRrJ7MrwSeqJXpMOs/edit?tab=t.0#heading=h.abprv9xacjmy

back to top

About

Computer Science 6705. Fall 2025. Applied Cloud Computing. Weber State University. Professor Joshua Jensen. Final Project - YouTube Analytics using AWS. Collaborator: Jacob Child

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Sponsor this project

Packages

No packages published

Contributors 2

  •  
  •