Batch processing for Longest Listening Session #555

webysther · 2025-09-12T20:04:00Z

Split the processing in range of 1 year, exaclty the same pipeline but after loop every year aggr the result and return the top 5 based on sessionLength:

Docker image (based on LSIO) using this code:

docker pull ghcr.io/webysther/your_spotify:nightly

Build:

docker build --build-arg BUILD_DATE=2025-09-12 --build-arg 
VERSION=nightly --pull -t ghcr.io/webysther/your_spotify:nightly .

Performance

GET /api/spotify/top/sessions?start=2015-12-13T03:12:42.000Z&end=2025-09-12T19:49:25.889Z: ~6secs / 3MB

your_spotify> db.getCollectionNames().forEach(function(c) {
...   const count = db.getCollection(c).estimatedDocumentCount();
...   print(c + ": " + count);
... });
tracks: 54754
users: 2
globalpreferences: 1
importerstates: 3
migrations: 1
artists: 22232
infos: 369660
albums: 38667
privatedatas: 1

Fixes #499

PS: Without this, only worked for ~7 years of data.

Yooooomi · 2025-09-14T11:47:02Z

Hello, thanks a lot for this contribution. I am wondering, maybe this is nitpick, but am I right saying that this would not take into account listening sessions for new year's eve? As this would cut the session in two parts?

Also, excluding the issue where mongo stores too much info in memory, do you have the the comparison of the http request time given the old and new implementation?

Many thanks again for this pull request.

webysther · 2025-09-14T13:18:43Z

Hello, thanks a lot for this contribution. I am wondering, maybe this is nitpick, but am I right saying that this would not take into account listening sessions for new year's eve? As this would cut the session in two parts?

Yeap, for sure, but I give you some options:

Address in a issue because today this don't work.
Minimize this issue change the yearsStep to 5 years, but 1 year give more room and allow running in small hardware.
Create a .ENV entry for allow the user change the yearsStep because some users maybe never have any problem for the amount of data that they have.

Also, excluding the issue where mongo stores too much info in memory, do you have the the comparison of the http request time given the old and new implementation?

It's more fast because you built using asyncio and all request are send at same time and mongo was able to process at same time.

/api/spotify/top/sessions?start=2020-09-14T00:00:00.000Z&end=2025-09-14T00:00:00.000Z

New method for 5 years: ~3sec (2.8-3.2)
Old method for 5 years: ~10sec (9.8-10.1)

Send using async small requets to mongo always will be more fast and less resource hungry

webysther · 2025-09-30T17:53:01Z

Hi @Yooooomi, just a reminder in case you forgot.

SwiftExplorer567 · 2025-10-30T23:37:05Z

hoping this will be pulled. Nice

Copilot

Pull request overview

This PR introduces batch processing for the longest listening session query to handle large datasets that previously caused timeouts. The implementation splits the date range into 1-year batches, processes each batch independently, and aggregates the results to return the top 5 longest sessions.

Key changes:

Refactored the aggregation pipeline into a reusable buildPipeline function that processes data in yearly batches
Added date validation and window creation logic to split the processing into manageable chunks
Implemented post-aggregation sorting and limiting to combine results from all batches

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-03T23:53:04Z