Skip to content

Feature/library etl#153

Open
AyBruno wants to merge 8 commits intomainfrom
feature/library-etl
Open

Feature/library etl#153
AyBruno wants to merge 8 commits intomainfrom
feature/library-etl

Conversation

@AyBruno
Copy link
Collaborator

@AyBruno AyBruno commented Feb 13, 2026

Summary

  • Add cronjob deployment support for jobs/* targets in deploy-base.yml, including cron schedule setup, validation, and verification on EC2.
  • Add Dockerfile.library-etl to build/run the library ETL job container and update workflow detection to include job targets.
  • Introduce the new library-etl job implementation with legacy ETL logic, format parsing, artist/genre handling, and cronjob run tracking.
  • Extend legacy database access utilities (e.g., MirrorSQL cleanup/timeout handling) and update schema/migrations to support new ETL fields (cronjob runs, code volume letters, etc.).

@AyBruno AyBruno force-pushed the feature/library-etl branch 2 times, most recently from b3b6edf to 900fefd Compare February 13, 2026 04:52
@AyBruno AyBruno force-pushed the feature/library-etl branch from 900fefd to bcb14fa Compare February 13, 2026 05:02
) => {
const normalizedLetters = codeLetters ?? '??';
const normalizedNumber = codeArtistNumber ?? 0;
const artistKey = `${artistName.toLowerCase()}|${normalizedLetters}|${normalizedNumber}`;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ensureArtist cache key does not account for genre

The artist cache key at line 216 is:

const artistKey = `${artistName.toLowerCase()}|${normalizedLetters}|${normalizedNumber}`;

But the DB query for non-various artists filters by genre_id (line 229):

.where(isVarious ? and(...baseConditions) : and(...baseConditions, eq(artists.genre_id, genreId)))

If the same artist name+code appears in multiple genres, the cache returns the ID from the first genre without querying the DB for the second. There are 13 artists in the legacy database that would be affected:

norm_name          | norm_letters | genre_ids
-------------------+--------------+----------
a.m. architect     | AM           | 6,15
boris kovac        | KO           | 4,9
brian harnetty     | HA           | 4,5
dan crary          | CR           | 9,14
das torpedoes      | DA           | 6,15
davenport          | DA           | 9,11
jpp                | JP           | 4,9
kwame              | KW           | 6,11
patrick o'hearn    | OH           | 7,11
randy greif        | GR           | 13,15
rhythm & sound     | RH           | 10,15
various            | Z-           | 7,10
wzt hearts         | WZ           | 4,11

For these artists, releases in the second genre encountered would be linked to an artist row with the wrong genre_id. The genre_artist_crossreference entry would still be created correctly, but the artist row's primary genre would be wrong.

Fix: include genreId in the cache key for non-various artists:

const artistKey = isVarious
  ? `${artistName.toLowerCase()}|${normalizedLetters}|${normalizedNumber}`
  : `${artistName.toLowerCase()}|${normalizedLetters}|${normalizedNumber}|${genreId}`;

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I meant to get rid of the genre column in the artists table entirely. The genre cross reference table is all we need

Copy link
Collaborator Author

@AyBruno AyBruno Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want multiple rows in the artist table for the same artist in different genres

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants