Youtube Shorts Dagster pipeline #1734

mbertrand · 2025-10-17T21:28:37Z

What are the relevant tickets?

Part of https://github.com/mitodl/hq/issues/8880

Description (What does it do?)

Creates a dagster pipeline for uploading youtube short videos, metadata, and thumbnails to S3 then sending a webhook to mit-learn.
Each of the above plus the webhook is modeled as an asset.

How can this be tested?

Set the following in your .env file in addition to the usual required for data-platform:

AWS_ACCESS_KEY_ID=<use vault to generate or use mitlearn-rc value>
AWS_SECRET_ACCESS_KEY=<use vault to generate or use mitlearn-rc value>
YOUTUBE_SHORTS_BUCKET=ol-mitlearn-app-storage-rc

Change the minimum_interval_seconds value for the youtube_shorts_sensor to something more frequent like 60 (every minute):

  @sensor(
   description=(
       "Sensor to monitor YouTube channels for new video shorts "
   ),
   minimum_interval_seconds=60,

Run docker compose up --build
Go to http://localhost:3000/automation and turn on youtube_shorts_discovery_sensor and youtube_shorts_version_sensor
Go to http://localhost:3000/runs, a bunch of runs should be generated shortly. Assets for video_content, video_thumbnail, video_metadata should all be created successfully, but the video_webhook assets will all fail because they're trying to reach a non-existent endpoint.
The sensor should add new jobs for the webhooks every 60 seconds (or whatever frequency you picked) because the webhook keeps failing.

Check the bucket and make sure some of the objects under the youtube_shorts prefix have the expected data.

gitguardian · 2025-10-17T21:28:41Z

⚠️ GitGuardian has uncovered 1 secret following the scan of your pull request.

Please consider investigating the findings and remediating the incidents. Failure to do so may lead to compromising the associated services or software components.

🔎 Detected hardcoded secret in your pull request

GitGuardian id	GitGuardian status	Secret	Commit	Filename
9430286	Triggered	Generic Password	`f0445ad`	docker-compose.yaml	View secret

🛠 Guidelines to remediate hardcoded secrets

Understand the implications of revoking this secret by investigating where it is used in your code.
Replace and store your secret safely. Learn here the best practices.
Revoke and rotate this secret.
If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.

To avoid such incidents in the future consider

following these best practices for managing and storing secrets including API keys and other credentials
install secret detection on pre-commit to catch secret before it leaves your machine and ease remediation.

^{🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.}

Copilot

Pull Request Overview

This PR implements a Dagster pipeline for processing YouTube Shorts videos, extracting metadata, downloading video content and thumbnails, and uploading them to S3. The pipeline includes sensor-based monitoring of YouTube playlists to automatically detect and process new videos.

Key changes:

Added YouTube Shorts processing pipeline with three assets: video metadata, content, and thumbnails
Implemented sensor to monitor YouTube channels/playlists for new videos and trigger processing
Added webhook notifications to MIT Learn API when videos are processed or deleted

Reviewed Changes

Copilot reviewed 13 out of 14 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
dg_projects/learning_resources/learning_resources/assets/youtube_shorts.py	Core asset definitions for downloading and uploading video metadata, content, and thumbnails to S3
dg_projects/learning_resources/learning_resources/sensors/youtube_shorts.py	Sensor for monitoring YouTube channels and triggering video processing jobs
dg_projects/learning_resources/learning_resources/resources/youtube_client.py	YouTube API client resource with Vault integration
dg_projects/learning_resources/learning_resources/resources/youtube_config.py	Configuration provider for YouTube playlist monitoring
dg_projects/learning_resources/learning_resources/resources/api_client_factory.py	Factory for creating API clients with Vault credentials
dg_projects/learning_resources/learning_resources/definitions.py	Updated Dagster definitions with YouTube assets, jobs, and sensors
packages/ol-orchestrate-lib/src/ol_orchestrate/resources/learn_api.py	Added webhook notification methods for video processing events
dg_projects/learning_resources/pyproject.toml	Added dependencies for YouTube API, YAML parsing, and video downloading
dg_projects/learning_resources/Dockerfile	Added ffmpeg installation for video processing
docker-compose.yaml	Added environment variables for AWS credentials and YouTube API configuration

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

dg_projects/learning_resources/learning_resources/sensors/youtube_shorts.py

dg_projects/learning_resources/learning_resources/assets/youtube_shorts.py

dg_projects/learning_resources/Dockerfile

mbertrand · 2025-10-22T18:59:23Z

dg_projects/learning_resources/learning_resources/definitions.py

        ),
+        "yt_s3file_io_manager": S3FileObjectIOManager(
+            bucket=os.environ.get("YOUTUBE_SHORTS_BUCKET"),
+            path_prefix="youtube_shorts",


Should this be /frontend/static/youtube_shorts for MIT-Learn, to be publicly accessible and cached by Fastly?

Copilot

Pull Request Overview

Copilot reviewed 11 out of 13 changed files in this pull request and generated 1 comment.

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-10-22T19:38:57Z

dg_projects/learning_resources/learning_resources/sensors/youtube_shorts.py

+    description=(
+        "Monitor YouTube playlists daily and process the 16 most recent videos."
+    ),
+    minimum_interval_seconds=60,  # Check once per minute


The comment states 'Check once per minute' but line 71 in the description says 'Monitor YouTube playlists daily'. This creates inconsistency between the intended behavior and the configured interval. Update the comment to clarify this is for testing purposes, or adjust the interval to match the daily monitoring intent (e.g., 86400 seconds).

Suggested change

minimum_interval_seconds=60, # Check once per minute

minimum_interval_seconds=86400, # Check once per day

Stray commit of a change made for testing, fixed

blarghmatey · 2025-10-24T17:03:54Z

dg_projects/learning_resources/learning_resources/assets/youtube_shorts.py

+    # This detects changes in metadata (e.g., title edits) and triggers re-processing
+    # Hash ensures consistent 16-char version string for file paths
+    version_string = f"{video_id}|{video_title}|{video_published_at}"
+    data_version = hashlib.sha256(version_string.encode()).hexdigest()[:16]


What's the motivation for truncating to 16 characters?

Was originally thinking of using the version in the s3 file path, and at 16 characters would make the path shorter and more readable while keeping collision risk low. But ultimately ended up not including the version in the file path, so will get rid of that truncation.

blarghmatey · 2025-10-24T17:09:12Z

dg_projects/learning_resources/learning_resources/assets/youtube_shorts.py

+    materialization = get_latest_materialization(
+        context, metadata_asset_key, partition_key=video_id
+    )


Rather than fetching the materialization, the metadata file object is passed as an input to this function automatically by Dagster, so you can just use json.loads(video_metadata.read_text()) to get the actual contents.

Will do, that's actually how I originally had it, but changed it because I thought that function was doing a read from S3 (is it reading from a local file instead?), which seemed a bit wasteful since the data was already present locally.

blarghmatey · 2025-10-24T17:10:50Z

dg_projects/learning_resources/learning_resources/assets/youtube_shorts.py

+    metadata_content = metadata_value.value
+
+    # Convert S3 paths to strings for webhook payload
+    video_content_path = str(video_content)


The video_content object is a UPath object, so you should be able to just use the .path attribute (https://github.com/fsspec/universal_pathlib)

Getting rid of this from the payload altogether since it's not going to be used on the other end.

blarghmatey · 2025-10-24T17:12:34Z

dg_projects/learning_resources/learning_resources/assets/youtube_shorts.py

+    video_content: Any,
+    video_thumbnail: Any,
+    video_metadata: Any,


Suggested change

video_content: Any,

video_thumbnail: Any,

video_metadata: Any,

video_content: UPath,

video_thumbnail: UPath,

video_metadata: UPath,

blarghmatey · 2025-10-24T17:13:55Z

dg_projects/learning_resources/learning_resources/lib/youtube.py

+def fetch_youtube_shorts_config(config_url: str) -> list[dict[str, Any]]:
+    """
+    Fetch YouTube shorts playlist configuration from GitHub.
+
+    Args:
+        config_url: URL to the YAML configuration file
+
+    Returns:
+        List of configuration dictionaries containing channel and playlist info
+    """
+    response = httpx.get(config_url)
+    response.raise_for_status()
+    return yaml.safe_load(response.text)


Rather than being a method, this may be better modeled as an asset itself to return the data directly.

I think this may complicate things because the sensor relies on this config and resulting list of video ids to determine what the run requests should be:

run_requests = [ RunRequest( asset_selection=asset_keys, partition_key=video_id, ) for video_id in videos_to_process ]

How about just moving most of the code that calls this function and determines the value of videos_to_process into a separate helper function?

result = get_videos_to_process(....) which calls fetch_youtube_shorts_config() and ultimately returns the list of video ids (videos_to_process?

It looks like there is a asset external_playlist_config_key = AssetKey(["youtube_shorts", "external_playlist_config"]) but it's materialized / triggered by sensor only?

mbertrand · 2025-10-27T13:35:47Z

@blarghmatey ready for another look

blarghmatey · 2025-10-30T21:03:36Z

dg_projects/learning_resources/learning_resources/lib/youtube.py

+    for channel_config in config:
+        if playlists := channel_config.get("playlists"):
+            playlist_ids.extend([p["id"] for p in playlists])
+
+    # Fetch all videos from all playlists
+    all_video_ids = set()
+    for playlist_id in playlist_ids:
+        playlist_items = youtube_client.get_playlist_items(playlist_id)
+        video_ids = [
+            extract_video_id_from_playlist_item(item) for item in playlist_items
+        ]
+        all_video_ids.update(video_ids)


It's not a functional difference, but for the sake of visibility and full lineage tracking it might be useful to track the playlists and videos as external assets. That way they get included in the full lineage graph of these assets. The video metadata etc. can also be modeled against those external assets https://docs.dagster.io/guides/build/assets/external-assets

Added as external assets

rachellougee · 2025-11-04T16:46:29Z

dg_projects/learning_resources/learning_resources/assets/youtube_shorts.py

+external_playlist_config_key = AssetKey(["youtube_shorts", "external_playlist_config"])
+external_playlist_api_key = AssetKey(
+    [
+        "youtube_shorts",
+        "external_playlist_api",
+    ]
+)
+external_playlist_key = AssetKey(["youtube_shorts", "external_playlist"])
+external_video_key = AssetKey(["youtube_shorts", "external_video"])


Minor thing: For clarity, should we move these external assets into external group? Right now, they are all in the same group. Also, I think it might be clearer to have separate playlist and video as external assets. Do we need these external_playlist_config and external_playlist_api assets in the graph?

I moved them to a different group, does this look better?

Tobias suggested having external assets, and I included the playlist config and the youtube api, to make it clear in the lineage graph where the videos are coming from. But I can remove them if you prefer.

yes, it looks better now. One thing I noticed that I forgot to comment on is that there is no lineage between playlist_api and video_metadata

rachellougee · 2025-11-04T16:55:01Z

dg_projects/learning_resources/learning_resources/definitions.py

            bucket=s3_uploads_bucket(DAGSTER_ENV)["bucket"],
            path_prefix=s3_uploads_bucket(DAGSTER_ENV)["prefix"],
        ),
+        "yt_s3file_io_manager": S3FileObjectIOManager(


We probably should use the default IO manager to makes easier to test in local environment. I had to modify this as I ran into S3 permission issue.

"yt_s3file_io_manager": default_file_object_io_manager( dagster_env=DAGSTER_ENV, bucket=os.environ.get( "YOUTUBE_SHORTS_BUCKET", f"ol-mitlearn-app-storage-{DAGSTER_ENV}" ), path_prefix=os.environ.get("LEARN_SHORTS_PREFIX", "shorts/"), ),```

rachellougee · 2025-11-04T17:22:23Z

dg_projects/learning_resources/learning_resources/lib/youtube.py

+def fetch_youtube_shorts_config(config_url: str) -> list[dict[str, Any]]:
+    """
+    Fetch YouTube shorts playlist configuration from GitHub.
+
+    Args:
+        config_url: URL to the YAML configuration file
+
+    Returns:
+        List of configuration dictionaries containing channel and playlist info
+    """
+    response = httpx.get(config_url)
+    response.raise_for_status()
+    return yaml.safe_load(response.text)


It looks like there is a asset external_playlist_config_key = AssetKey(["youtube_shorts", "external_playlist_config"]) but it's materialized / triggered by sensor only?

rachellougee

Nice work! functionality works well overall

mbertrand · 2025-11-04T21:45:55Z

It looks like there is a asset external_playlist_config_key = AssetKey(["youtube_shorts", "external_playlist_config"]) but it's materialized / triggered by sensor only?

Yes, I'm not quite sure how else to do it. It's just a yaml file at https://raw.githubusercontent.com/mitodl/open-video-data/refs/heads/mitopen/youtube/shorts.yaml and is pretty static (it might change once or twice a year if ever).

Is there a more dagster-ish way to do it, or should I just remove it as an external asset to avoid any confusion?

mbertrand · 2025-11-05T22:01:25Z

I moved code to retrieve the yaml config and youtube api results from the discovery sensor to the assets, but now it takes 2 runs of the discovery sensor to retrieve everything. On the first run it just materializes the config and api, on the 2nd run it materializes the video assets.

mbertrand · 2025-11-06T12:17:03Z

I did some more refactoring and reduced the number of external assets and changed the dependencies a bit, now it seems to be working as expected.

mbertrand · 2025-11-06T15:30:27Z

More refactoring - since the playlist config and api results are actually materialized, I removed their "external" status. And the discovery sensor does less work. Best tested locally by reducing the sensor's frequency to every 1-2 minutes.

rachellougee

Looks good. I ran another test - the youtube_shorts_api_schedule and youtube_shorts_discovery_sensor seem to work as expected. But the new video didn't trigger a run, which I've commented below. Otherwise, everything looks good to me.

rachellougee · 2025-11-06T16:52:01Z

dg_projects/learning_resources/learning_resources/assets/youtube_shorts.py

+    automation_condition=(
+        upstream_or_code_changes()
+        | AutomationCondition.on_missing()
+        | AutomationCondition.on_cron("0 * * * *")  # Check hourly for metadata changes


Can you check if this is working as expected? There is one video ID partition added by the youtube_shorts_discovery_sensor but I don't see it got picked up by this automation condition?

My understanding that AutomationCondition.on_cron will only triggers run if the default_automation_condition_sensor is enable and running.

Since youtube_shorts_discovery_sensor detects the new videos and add new partitions, why not trigger RunRequest directly from the sensor instead of relying on a cron job to pick it up?

I assumed default_automation_condition_sensor is supposed to be running for the automation_condition parameters to be in effect? And ideally that should handle triggering materializations for new videos instead of the sensor?

I tested it this way:

Changed max # videos to process to 8

Changed frequency of youtube_shorts_api_schedule and youtube_shorts_sensor each to 60 seconds

Enabled default_automation_condition_sensor , youtube_shorts_api_schedule, and youtube_shorts_discovery_sensor

Waited a few minutes - eventually, 8 videos got fully processed/materialized.

Changed max # videos to process back to 12

Restarted containers

After another few minutes, 4 additional videos were processed, bringing the total to 12

If it is okay for the sensor to do more work and start run requests, then I can add that back.

You are right. The default_automation_condition_sensor needs to be enabled, which I didn't. In that case, the new video materialization should be triggered by the upstream_or_code_changes() condition, because newly_missing = AutomationCondition.newly_missing() should be evaluated as True for a new asset.

I initially got confused with the additional AutomationCondition.on_missing() and AutomationCondition.on_cron("0 * * * *") conditions, but the metadata changes should already be handled by upstream_or_code_changes() unless I am overlooking something.

Either way, if the new asset is already handled by the automation policy, we don’t need to add it to the sensor. But in my opinion, it doesn’t hurt to request it from the sensor, since it doesn’t have to wait for the automation to pick it up. That said, It's up to you.

Just added run requests in the sensor

rachellougee · 2025-11-06T17:00:13Z

dg_projects/learning_resources/learning_resources/assets/youtube_shorts.py

+@asset(
+    key=playlist_config_key,
+    group_name="youtube_shorts",
+    automation_condition=upstream_or_code_changes(),


Since this assets defines an automation_condition, we should add code_version here. Otherwise, dagster will generates a different code version on each run.

Added code_version , but got rid of the automation_condition here and for playlist_api because those should be freshly retrieved on every youtube_shorts_api_schedule job run

rachellougee

👍

rachellougee · 2025-11-06T21:56:01Z

dg_projects/learning_resources/learning_resources/assets/youtube_shorts.py

+            json.dump(processed_metadata, f, indent=2)
+
+        # S3 path: youtube_shorts/{video_id}/{video_id}.json
+        metadata_s3_path = f"{video_id}/{video_id}.json"


Just a question - Is it intended that we don't track the data version for each video on S3? If a video file changes, we only store the latest version?

mbertrand added the work in progress label Oct 17, 2025

blarghmatey requested review from blarghmatey and Copilot October 17, 2025 21:39

Copilot AI reviewed Oct 17, 2025

View reviewed changes

mbertrand force-pushed the mb/youtube_shorts branch 2 times, most recently from bdf1c50 to a7ae73f Compare October 20, 2025 14:57

mbertrand added needs review work in progress and removed work in progress needs review labels Oct 20, 2025

mbertrand force-pushed the mb/youtube_shorts branch from a7afc9b to 905c5cd Compare October 22, 2025 02:15

mbertrand commented Oct 22, 2025

View reviewed changes

blarghmatey requested a review from Copilot October 22, 2025 19:38

Copilot AI reviewed Oct 22, 2025

View reviewed changes

blarghmatey reviewed Oct 24, 2025

View reviewed changes

blarghmatey reviewed Oct 30, 2025

View reviewed changes

mbertrand force-pushed the mb/youtube_shorts branch 3 times, most recently from 9268559 to bfe3fa6 Compare November 4, 2025 15:38

rachellougee reviewed Nov 4, 2025

View reviewed changes

mbertrand force-pushed the mb/youtube_shorts branch from 86894cc to 2d73019 Compare November 5, 2025 21:59

mbertrand force-pushed the mb/youtube_shorts branch from 2d73019 to 04e793c Compare November 6, 2025 04:09

rachellougee approved these changes Nov 6, 2025

View reviewed changes

mbertrand force-pushed the mb/youtube_shorts branch 2 times, most recently from 3b5b55f to 5ce8cca Compare November 6, 2025 21:19

rachellougee approved these changes Nov 6, 2025

View reviewed changes

Youtube shorts dagster project

f0445ad

mbertrand force-pushed the mb/youtube_shorts branch from 5ce8cca to f0445ad Compare November 7, 2025 16:32

mbertrand merged commit e1ac768 into main Nov 7, 2025
5 checks passed

mbertrand deleted the mb/youtube_shorts branch November 7, 2025 17:15

	minimum_interval_seconds=60, # Check once per minute
	minimum_interval_seconds=86400, # Check once per day

Youtube Shorts Dagster pipeline #1734

Youtube Shorts Dagster pipeline #1734

Uh oh!

Conversation

mbertrand commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What are the relevant tickets?

Description (What does it do?)

How can this be tested?

Uh oh!

gitguardian bot commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ GitGuardian has uncovered 1 secret following the scan of your pull request.

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Copilot AI Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mbertrand commented Oct 27, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rachellougee left a comment

Choose a reason for hiding this comment

Uh oh!

mbertrand commented Nov 4, 2025

Uh oh!

mbertrand commented Nov 5, 2025

Uh oh!

mbertrand commented Nov 6, 2025

mbertrand commented Oct 17, 2025 •

edited

Loading

gitguardian bot commented Oct 17, 2025 •

edited

Loading

rachellougee Nov 6, 2025 •

edited

Loading

rachellougee Nov 6, 2025 •

edited

Loading