Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ jobs:
python-version: "3.12"
- run: pip install ruff
- run: ruff check .
- run: ruff format --check .

test:
runs-on: ubuntu-latest
Expand Down
27 changes: 16 additions & 11 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,17 +14,18 @@ ETL pipeline for building and maintaining a PostgreSQL cache of Discogs release
1. **Download** Discogs monthly data dumps (XML) from https://discogs-data-dumps.s3.us-west-2.amazonaws.com/index.html
2. **Convert** XML to CSV using [discogs-xml2db](https://github.com/philipmat/discogs-xml2db) (not a PyPI package; must be cloned separately)
3. **Fix newlines** in CSV fields (`scripts/fix_csv_newlines.py`)
4. **Filter** CSVs to library-matching artists only (`scripts/filter_csv.py`) -- ~70% data reduction
5. **Create schema** (`schema/create_database.sql`)
6. **Import** filtered CSVs into PostgreSQL (`scripts/import_csv.py`)
7. **Create indexes** including trigram GIN indexes (`schema/create_indexes.sql`)
8. **Deduplicate** by master_id (`scripts/dedup_releases.py`)
9. **Prune** to library matches (`scripts/verify_cache.py --prune`) -- ~89% data reduction (3 GB -> 340 MB)
10. **Vacuum** to reclaim disk space (`VACUUM FULL`)
4. **Enrich** `library_artists.txt` with WXYC cross-references (`scripts/enrich_library_artists.py`, optional)
5. **Filter** CSVs to library-matching artists only (`scripts/filter_csv.py`) -- ~70% data reduction
6. **Create schema** (`schema/create_database.sql`)
7. **Import** filtered CSVs into PostgreSQL (`scripts/import_csv.py`)
8. **Create indexes** including trigram GIN indexes (`schema/create_indexes.sql`)
9. **Deduplicate** by master_id (`scripts/dedup_releases.py`)
10. **Prune** to library matches (`scripts/verify_cache.py --prune`) -- ~89% data reduction (3 GB -> 340 MB)
11. **Vacuum** to reclaim disk space (`VACUUM FULL`)

`scripts/run_pipeline.py` supports two modes:
- `--xml` mode: runs steps 2-10 (XML conversion through vacuum)
- `--csv-dir` mode: runs steps 5-10 (database build from pre-filtered CSVs)
- `--xml` mode: runs steps 2-11 (XML conversion through vacuum)
- `--csv-dir` mode: runs steps 6-11 (database build from pre-filtered CSVs)

Step 1 (download) is always manual.

Expand Down Expand Up @@ -54,7 +55,8 @@ docker compose up db -d # just the database (for tests)

### Key Files

- `scripts/run_pipeline.py` -- Pipeline orchestrator (--xml for steps 2-10, --csv-dir for steps 5-10)
- `scripts/run_pipeline.py` -- Pipeline orchestrator (--xml for steps 2-11, --csv-dir for steps 6-11)
- `scripts/enrich_library_artists.py` -- Enrich artist list with WXYC cross-references (pymysql)
- `scripts/filter_csv.py` -- Filter Discogs CSVs to library artists
- `scripts/import_csv.py` -- Import CSVs into PostgreSQL (psycopg COPY)
- `scripts/dedup_releases.py` -- Deduplicate releases by master_id (copy-swap with `DROP CASCADE`)
Expand Down Expand Up @@ -86,12 +88,15 @@ pytest tests/unit/ -v
DATABASE_URL_TEST=postgresql://discogs:discogs@localhost:5433/postgres \
pytest -m postgres -v

# MySQL integration tests (needs WXYC MySQL on port 3307)
pytest -m mysql -v

# E2E tests (runs full pipeline as subprocess against test Postgres)
DATABASE_URL_TEST=postgresql://discogs:discogs@localhost:5433/postgres \
pytest -m e2e -v
```

Markers: `postgres` (needs PostgreSQL), `e2e` (full pipeline), `integration` (needs library.db). Integration and E2E tests are excluded from the default `pytest` run via `addopts` in `pyproject.toml`.
Markers: `postgres` (needs PostgreSQL), `mysql` (needs WXYC MySQL), `e2e` (full pipeline), `integration` (needs library.db). Integration and E2E tests are excluded from the default `pytest` run via `addopts` in `pyproject.toml`.

Test fixtures are in `tests/fixtures/` (CSV files, library.db, library_artists.txt). Regenerate with `python tests/fixtures/create_fixtures.py`.

Expand Down
3 changes: 2 additions & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,8 @@ RUN pip install --no-cache-dir \
"psycopg[binary]>=3.1.0" \
"asyncpg>=0.29.0" \
"rapidfuzz>=3.0.0" \
"lxml>=4.9.0"
"lxml>=4.9.0" \
"pymysql>=1.0.0"

# Copy application code
COPY scripts/ scripts/
Expand Down
18 changes: 18 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@ All 9 steps are automated by `run_pipeline.py` (or Docker Compose). The script s
|------|--------|-------------|
| 1. Convert | discogs-xml2db | XML data dump to CSV |
| 2. Fix newlines | `scripts/fix_csv_newlines.py` | Clean embedded newlines in CSV fields |
| 2.5. Enrich | `scripts/enrich_library_artists.py` | Enrich artist list with cross-references (optional) |
| 3. Filter | `scripts/filter_csv.py` | Keep only library artists (~70% reduction) |
| 4. Create schema | `schema/create_database.sql` | Set up tables and constraints |
| 5. Import | `scripts/import_csv.py` | Bulk load CSVs via psycopg COPY |
Expand All @@ -40,6 +41,8 @@ All 9 steps are automated by `run_pipeline.py` (or Docker Compose). The script s
| 8. Prune | `scripts/verify_cache.py --prune` | Remove non-library releases (~89% reduction) |
| 9. Vacuum | `VACUUM FULL` | Reclaim disk space |

Step 2.5 generates `library_artists.txt` from `library.db` and optionally enriches it with alternate artist names and cross-references from the WXYC MySQL catalog database. This reduces false negatives at the filtering stage for artists known by multiple names (e.g., "Body Count" filed under Ice-T).

### Docker Compose

The easiest way to run the full pipeline:
Expand Down Expand Up @@ -70,6 +73,18 @@ python scripts/run_pipeline.py \
--database-url postgresql://localhost:5432/discogs
```

To enrich `library_artists.txt` with alternate names and cross-references from the WXYC catalog database, add `--wxyc-db-url`:

```bash
python scripts/run_pipeline.py \
--xml /path/to/releases.xml.gz \
--xml2db /path/to/discogs-xml2db/ \
--library-artists /path/to/library_artists.txt \
--library-db /path/to/library.db \
--wxyc-db-url mysql://user:pass@host:port/wxycmusic \
--database-url postgresql://localhost:5432/discogs
```

Database build from pre-filtered CSVs (steps 4-9):

```bash
Expand Down Expand Up @@ -161,6 +176,9 @@ pytest tests/unit/ -v
DATABASE_URL_TEST=postgresql://discogs:discogs@localhost:5433/postgres \
pytest -m postgres -v

# MySQL integration tests (needs WXYC MySQL on port 3307)
pytest -m mysql -v

# E2E tests (needs PostgreSQL, runs full pipeline as subprocess)
DATABASE_URL_TEST=postgresql://discogs:discogs@localhost:5433/postgres \
pytest -m e2e -v
Expand Down
221 changes: 221 additions & 0 deletions docs/discogs-cache-technical-overview.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,221 @@
# Discogs Cache: Technical Overview

Several WXYC tools — song search, catalog lookup, flowsheet entries, and the request bot — need artist and album metadata from Discogs. Previously, every lookup hit the Discogs API in real time. A single request could require 2-22 API calls depending on the search path, and the API rate limit is 60 requests/minute. This made searches slow (6-30 seconds) and bulk operations impractical.

## Solution

The Discogs cache is an ETL pipeline that produces a local PostgreSQL database from Discogs' monthly data dumps. It filters the full 48 GB Discogs database down to a 340 MB database containing only releases by artists in the WXYC library.

**Key results** (benchmarked on staging, 2026-02-10):

- **52x average speedup** — median response drops from 19.6 seconds to 379 ms
- **Worst case**: 30s song lookup down to 454 ms (66x)
- **Best case**: artist-only search down from 6.3s to 216 ms (29x)
- **Zero Discogs API calls** on cached requests (previously 2-22 per request)

Full benchmark methodology and results are in the [Performance](#performance) section.

## Pipeline

The pipeline runs monthly when Discogs publishes new data dumps. It can be run via Docker Compose or the orchestration script.

| Step | Description | Tool |
|------|-------------|------|
| 1. Download | Fetch monthly XML dump from Discogs | Manual |
| 2. Convert | XML to CSV | discogs-xml2db |
| 3. Fix newlines | Clean embedded newlines in CSV fields | `fix_csv_newlines.py` |
| 4. Enrich artists | Add alternate names and cross-references to artist list | `enrich_library_artists.py` |
| 5. Filter | Keep only releases by library artists (~70% reduction) | `filter_csv.py` |
| 6. Create schema | Set up PostgreSQL tables and constraints | `create_database.sql` |
| 7. Import | Bulk load CSVs via psycopg COPY | `import_csv.py` |
| 8. Create indexes | Trigram GIN indexes for fuzzy text search | `create_indexes.sql` |
| 9. Deduplicate | Keep best release per master_id | `dedup_releases.py` |
| 10. Prune | Remove releases that don't match library entries (~89% reduction) | `verify_cache.py` |
| 11. Vacuum | Reclaim disk space | `VACUUM FULL` |

### Two-stage filtering

The pipeline filters data in two stages to make the 48 GB dump manageable:

**Stage 1 (step 5):** Filters by artist name. If an artist in the Discogs data matches an artist in the WXYC library, all of that artist's releases are kept. This is a coarse cut that removes ~70% of the data.

**Stage 2 (step 10):** Filters by release. Uses multi-index fuzzy matching to compare each remaining release against the WXYC library catalog. Releases that don't match any library entry are pruned. This is a fine-grained cut that removes another ~89% of what survived Stage 1.

### Artist name enrichment (step 4)

Stage 1 uses exact name matching, which misses releases credited under alternate names. The enrichment step addresses this by expanding the artist list with data from the WXYC catalog database:

##### Alternate artist names
Releases filed under one artist but credited to another (e.g., "Body Count" filed under Ice-T, "Bobby Digital" filed under RZA). Source: `LIBRARY_RELEASE.ALTERNATE_ARTIST_NAME` (~3,935 names).

##### Artist cross-references
Links between related artists: solo projects, band members, name variations (e.g., "Crooked Fingers" cross-referenced with Eric Bachmann). Source: `LIBRARY_CODE_CROSS_REFERENCE` (~189 names).

##### Release cross-references
Artists linked to specific releases filed under other artists, such as collaborations and remixes. Source: `RELEASE_CROSS_REFERENCE` (~29 names).

### Fuzzy text search

The database uses PostgreSQL's `pg_trgm` extension with GIN indexes for fuzzy matching. This handles:

- Spelling variations ("Thee Oh Sees" vs "OHSEES")
- Data entry inconsistencies in the WXYC catalog
- Partial matches and typos

Four trigram indexes cover track titles, artist names on releases, artist names on tracks (for compilations), and release titles.

## Database schema

| Table | Description |
|-------|-------------|
| `release` | Release metadata: id, title, year, artwork URL |
| `release_artist` | Artists credited on releases |
| `release_track` | Tracks with position and duration |
| `release_track_artist` | Artists on specific tracks (compilations) |
| `cache_metadata` | Data freshness tracking |

Consumers connect via the `DATABASE_URL_DISCOGS` environment variable.

## Performance

### What's being measured

The benchmarks measure end-to-end request latency through the full `/request` pipeline: AI parsing, library search, Discogs lookups, and artwork fetching. Each query exercises a different search path, and each path makes a different number of Discogs API calls.

Two modes are compared:

- **Cached**: Normal operation. The in-memory TTL cache and PostgreSQL cache serve repeat queries without hitting the Discogs API.
- **Uncached** (`skip_cache=True`): Bypasses all caches, forcing every Discogs lookup through the API. This simulates a cold start or first-time query.

### Network flow

#### Cached request

When caches are warm, most Discogs data is served from the in-memory TTL cache. No external API calls are needed for repeat queries.

```mermaid
sequenceDiagram
participant Client
participant FastAPI
participant Groq as Groq AI
participant MemCache as In-Memory Cache
participant Library as Library DB (SQLite)

Client->>FastAPI: POST /request
FastAPI->>Groq: Parse message
Groq-->>FastAPI: {artist, song, album}

FastAPI->>MemCache: Album lookup?
MemCache-->>FastAPI: Cached result

FastAPI->>Library: Search (artist + album)
Library-->>FastAPI: Library results

FastAPI->>MemCache: Artwork search
MemCache-->>FastAPI: Cached artwork

FastAPI-->>Client: Response (~300ms)
```

#### Uncached request (`skip_cache=True`)

With caches bypassed, every Discogs lookup hits the external API. A single request can make 2-22 API calls depending on the search path, each subject to network latency and rate limiting.

```mermaid
sequenceDiagram
participant Client
participant FastAPI
participant Groq as Groq AI
participant Discogs as Discogs API
participant Library as Library DB (SQLite)

Client->>FastAPI: POST /request (skip_cache=true)
FastAPI->>Groq: Parse message
Groq-->>FastAPI: {artist, song, album}

rect rgb(255, 240, 240)
note right of Discogs: 1-2 API calls
FastAPI->>Discogs: Search releases by track
Discogs-->>FastAPI: Release list
end

FastAPI->>Library: Search (artist + album)
Library-->>FastAPI: Library results

rect rgb(255, 240, 240)
note right of Discogs: 1-5 API calls per result
loop Each library result
FastAPI->>Discogs: Search for artwork
Discogs-->>FastAPI: Artwork URL
end
end

rect rgb(255, 240, 240)
note right of Discogs: 2-3 API calls per result (Path C only)
loop Track validation (if fallback)
FastAPI->>Discogs: Search release
Discogs-->>FastAPI: Release ID
FastAPI->>Discogs: Get release tracklist
Discogs-->>FastAPI: Tracklist
end
end

FastAPI-->>Client: Response (~6-30 sec)
```

### Search paths

| Path | Description | Trigger | Discogs API Calls |
|------|-------------|---------|-------------------|
| **A** | Artist + Album | Album provided in query | 1-5 (artwork only) |
| **B** | Song lookup | Song without album; Discogs resolves album | 2-7 |
| **C** | Track validation | Library falls back to artist-only; validates each album's tracklist | 12-22 |
| **D** | Compilation search | Primary search finds nothing; cross-references Discogs tracklists | 3-9 |
| **E** | Artist only | No song or album parsed | 1-5 (artwork only) |

### Results

Server: `https://request-o-matic-staging.up.railway.app`
Date: 2026-02-10

| Path | Label | Uncached | Cached (median) | Cached (p95) | Speedup | API Calls |
|------|-------|----------|-----------------|--------------|---------|-----------|
| A | Artist + Album | 18,804 ms | 273 ms | 308 ms | 68.8x | 0 (1-5) |
| B | Song lookup | 30,137 ms | 454 ms | 492 ms | 66.4x | 23 (2-7) |
| C | Track validation | 19,331 ms | 551 ms | 580 ms | 35.1x | 18 (12-22) |
| D | Compilation | 23,624 ms | 402 ms | 461 ms | 58.8x | 22 (3-9) |
| E | Artist only | 6,335 ms | 216 ms | 219 ms | 29.3x | 7 (1-5) |
| | **Average** | **19,646 ms** | **379 ms** | | **51.8x** | |

Cached iterations per query: 5. Uncached iterations per query: 1 (to preserve API rate limits).

#### Notes

- **API calls column** shows actual calls observed (uncached), with the expected range in parentheses. Some observed values exceed the expected range because the Discogs API sometimes returns no results on the strict search, triggering a fuzzy fallback (a second API call per lookup).
- **Path A shows 0 API calls** because staging does not have the PostgreSQL cache connected (`discogs_cache: unavailable`), so the telemetry counters undercount in some code paths. With PG cache enabled, this column would be more accurate.
- **Path C** is marked `xfail` in integration tests due to a known inconsistency in the track validation fallback. Despite this, the benchmark completes successfully.

### Reproducing

```bash
# Against staging (default 10 cached iterations)
venv/bin/python scripts/benchmark_requests.py --staging

# More iterations for tighter confidence
venv/bin/python scripts/benchmark_requests.py --staging -n 50

# Against local server
venv/bin/python scripts/benchmark_requests.py --local

# Skip warmup if caches are already populated
venv/bin/python scripts/benchmark_requests.py --staging --skip-warmup
```

## Consumers

- **request-parser** (Python/FastAPI) — `discogs/cache_service.py` queries the cache for album lookups, track validation, and artwork
- **Backend-Service** (TypeScript/Node.js) — planned

## Repository

https://github.com/WXYC/discogs-cache
36 changes: 36 additions & 0 deletions hooks/pre-commit
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
#!/usr/bin/env bash
# Pre-commit hook: run ruff check and ruff format on staged Python files.
# Install: ln -sf ../../hooks/pre-commit .git/hooks/pre-commit

set -euo pipefail

# Collect staged .py files (added, copied, modified, renamed)
staged_files=$(git diff --cached --name-only --diff-filter=ACMR -- '*.py')

if [ -z "$staged_files" ]; then
exit 0
fi

# Use venv ruff if available, otherwise fall back to system ruff
if [ -x ".venv/bin/ruff" ]; then
RUFF=".venv/bin/ruff"
elif command -v ruff &>/dev/null; then
RUFF="ruff"
else
echo "ruff not found. Install it or create a .venv with ruff."
exit 1
fi

# shellcheck disable=SC2086
$RUFF check $staged_files || {
echo ""
echo "ruff check failed. Fix the issues above before committing."
exit 1
}

# shellcheck disable=SC2086
$RUFF format --check $staged_files || {
echo ""
echo "ruff format check failed. Run 'ruff format .' to fix."
exit 1
}
Loading