Chunked trajectory download #172

Tvpower · 2025-06-13T17:40:46Z

…datasets in the data page. - Add loading indicators to "Export QR Codes" and "Download ZIP" buttons for improved UX. - Refactor ZIP generation for tokens and trajectories to optimize performance and memory usage. - Update `DB_HOST` and `STUDY_CONFIG` in `docker-compose-dev.yml` for NREL Commute study configuration.

…ta, including summaries per day and overall.

TeachMeTW · 2025-06-15T00:33:19Z

@Tvpower
Just feedback on the commits that I am seeing -- I do not wish for you to make the same mistakes I did, but generally make your commit messages on a certain type of format; i.e:

Keep the title under 70 (?) characters for better readability.
Use imperative mood - write as if completing "This commit will..."
Squash things/commits that are related/incremental

Now your latest commit is pretty good, I do advise however to squash/remove commits that are just cleanup/mistake fixes that happened during development such as 17d8f43 or 2bfedf4.

For example a good example of a commit is that of @JGreenlee's, for example: add config-update workflow and update_admin_access script

add config-update workflow and update_admin_access script

The config-update workflow runs one of the scripts in bin/config_update, commits and PRs the config changes if there are any, and auto-merges that PR if it passes checks.
The workflow can be triggered from op-admin-dashboard given that it has credentials to trigger workflows (which are provided through a Github app: https://github.com/settings/apps/op-config-updates)
Currently, this works to add/remove admin users from the admin_access list:
https://github.com/e-mission/op-admin-dashboard/issues/167#issuecomment-2801984007

Tested end-to-end from admin dash:
https://github.com/e-mission/op-admin-dashboard/pull/168

Tvpower · 2025-06-16T18:23:05Z

In query_trajectories.

-Collect first chunk of 250k and save as a list. When the limit is hit at 249999 get that query time stamp

-With the last query time stamp request the trajectories again from that date to the selected end date.

-With the next 250k request add that to the zip and check if its less than 250k if it is end at that point. If is more than or equal to check again for the next set of trajectories until the end date is reach and subsequent queries will be adjusted with the new start date.

Tvpower · 2025-06-16T19:52:21Z

Better format to keep in mind to finish this:
Problem:

The system has a 250k limit for trajectory queries, which prevents retrieving all trajectories in a single request when the dataset exceeds this limit.

Why it's a problem:

When users request trajectories spanning a large date range, the system fails to return complete data due to the 250k limit restriction.
This results in incomplete data retrieval and potential gaps in trajectory analysis.

Solution:

Implement pagination using timestamps:
- Collect first 250k trajectories and store the last timestamp
- Use this timestamp as the start point for the next query
- Continue this process until reaching the end date
Combine results:
- Add each 250k chunk to a zip file
- Stop when either:
  a) The remaining data is less than 250k
  b) The end date is reached

…g support for large datasets with timestamp-based pagination and detailed summaries.

JGreenlee · 2025-06-25T18:12:31Z

utils/db_utils.py

+                # Build MongoDB query for this chunk
+                mongo_query = {
+                    "metadata.key": {"$in": key_list},
+                    "data.ts": {"$gte": current_start_ts, "$lt": end_ts}
+                }
+
+                # Add UUID exclusion to the query
+                if excluded_uuids:
+                    excluded_uuid_objects = [UUID(uuid) for uuid in excluded_uuids]
+                    mongo_query["user_id"] = {"$nin": excluded_uuid_objects}
+
+                # Query this chunk with limit
+                db = edb.get_analysis_timeseries_db()
+                cursor = db.find(mongo_query).sort("data.ts", 1).limit(chunk_limit)


Please use methods from https://github.com/e-mission/e-mission-server/blob/master/emission/storage/timeseries/abstract_timeseries.py instead of raw Mongo queries

You can use the original query_trajectories as an example
It calls ts.find_entries

JGreenlee · 2025-06-25T18:24:57Z

utils/db_utils.py

+        )
+
+        # Stage 2: Iterate through chunks using timestamp pagination
+        while current_start_ts < end_ts:


Everything inside the while seems to be the core of the chunking and I think that can be extracted to a generic function that will allow us to work around this 250k entry limit for ANY type of entry (i.e. not specific to trajectories), and not specific to Plotly Dash (i.e. it could be used outside the admin dashboard)

This function will basically be a wrapper around ts.find_entries with the chunking logic around it. It should essentially take the same arguments as find_entries:

key_list

for trajectories, that key was analysis/recreation_location or background/location. In your generic function, it could be anything, and that gets passed to ts.find_entries

time_query

geo_query

extra_query_list

the query limit (default = 250k)

return a list (or iterator) of all the combined entries (which may be > 250k)

…ries_chunked`, introducing adaptive time windows and cleaner emission library integration.

… to use `date_query` parameter, remove adaptive time windows, and simplify chunked data processing with a fixed record limit. I was able to get 499k entries with this which is weird. Might need a reminder on how many were there originally

JGreenlee · 2025-07-10T16:13:00Z

utils/db_utils.py

    return esds.cleaned2inferred_section_list(sections)

+
+def query_entries_chunked(key_list, date_query,


The current version of query_entries_chunked does not meet the requirements:

This function will basically be a wrapper around ts.find_entries with the chunking logic around it. It should essentially take the same arguments as find_entries:

key_list

for trajectories, that key was analysis/recreation_location or background/location. In your generic function, it could be anything, and that gets passed to ts.find_entries

time_query

geo_query

extra_query_list

the query limit (default = 250k)

return a list (or iterator) of all the combined entries (which may be > 250k)

Namely, it has date_query instead of time_query, and it yields and returns dataframes instead of simply returning a list.

JGreenlee · 2025-07-15T17:54:01Z

@Tvpower I found out why this approach is not working.

BuiltinTimeseries is where find_entries and all our other timeseries query methods are implemented. It's meant to be used for one particular user.
In op-admin-dashboard, we are dealing with data from many users at once, so it uses AggregateTimeseries (which inherits from BuiltinTimeseries)

However, AggregateTimeseries behaves differently in that sorting is disabled. I was not aware of this override.
https://github.com/e-mission/e-mission-server/blob/c684fca916a63f15d0cc8bd8bcf2f553353e2a1f/emission/storage/timeseries/aggregate_timeseries.py#L22-L23

This seems to be a deliberate decision made in e-mission/e-mission-server@453b6a8, so I think we should keep this as the default behavior of the AggregateTimeseries

But we can add an optional sort_key parameter to find_entries which will force it to sort when sort_key is defined

diff --git a/emission/storage/timeseries/builtin_timeseries.py b/emission/storage/timeseries/builtin_timeseries.py
index 93fd46bc..058dcbdb 100644
--- a/emission/storage/timeseries/builtin_timeseries.py
+++ b/emission/storage/timeseries/builtin_timeseries.py
@@ -200,8 +200,9 @@ class BuiltinTimeSeries(esta.TimeSeries):
         return (orig_ts_db_keys, analysis_ts_db_keys)
 
     def find_entries(self, key_list = None, time_query = None, geo_query = None,
-                     extra_query_list=None):
-        sort_key = self._get_sort_key(time_query)
+                     extra_query_list=None, sort_key=None):
+        if sort_key is None:
+            sort_key = self._get_sort_key(time_query)
         logging.debug("curr_query = %s, sort_key = %s" % 
             (self._get_query(key_list, time_query, geo_query,
                              extra_query_list), sort_key))

JGreenlee · 2025-07-15T18:03:59Z

With this fix applied, I wrote a basic implementation of query_entries_chunked and it is working!

DEBUG:root:finished querying values for ['analysis/recreated_location'], count = 250000
DEBUG:root:orig_ts_db_matches = 0, analysis_ts_db_matches = 250000
DEBUG:root:curr_query = {'invalid': {'$exists': False}, '$or': [{'metadata.key': 'analysis/recreated_location'}], 'data.ts': {'$lte': 1752599636.790563, '$gte': 1690965755.4742825}, '_id': {'$ne': ObjectId('64cb4651c55a8df6377acfb8')}}, sort_key = data.ts
DEBUG:root:orig_ts_db_keys = [], analysis_ts_db_keys = ['analysis/recreated_location']
DEBUG:root:finished querying values for [], count = 0
DEBUG:root:finished querying values for ['analysis/recreated_location'], count = 250000
DEBUG:root:orig_ts_db_matches = 0, analysis_ts_db_matches = 250000
DEBUG:root:curr_query = {'invalid': {'$exists': False}, '$or': [{'metadata.key': 'analysis/recreated_location'}], 'data.ts': {'$lte': 1752599636.790563, '$gte': 1705958489.0731728}, '_id': {'$ne': ObjectId('65aee72212e59054b80d8eb0')}}, sort_key = data.ts
DEBUG:root:orig_ts_db_keys = [], analysis_ts_db_keys = ['analysis/recreated_location']
DEBUG:root:finished querying values for [], count = 0
DEBUG:root:finished querying values for ['analysis/recreated_location'], count = 250000
DEBUG:root:orig_ts_db_matches = 0, analysis_ts_db_matches = 250000
DEBUG:root:curr_query = {'invalid': {'$exists': False}, '$or': [{'metadata.key': 'analysis/recreated_location'}], 'data.ts': {'$lte': 1752599636.790563, '$gte': 1741815238.2157795}, '_id': {'$ne': ObjectId('67d2225310a95f519f4e8b67')}}, sort_key = data.ts
DEBUG:root:orig_ts_db_keys = [], analysis_ts_db_keys = ['analysis/recreated_location']
DEBUG:root:finished querying values for [], count = 0
DEBUG:root:finished querying values for ['analysis/recreated_location'], count = 3898
DEBUG:root:orig_ts_db_matches = 0, analysis_ts_db_matches = 3898
753898

(753898 matches the actual number of analysis/recreated_location entries reported by MongoDB)

@Tvpower Let me know whether you want to keep working on query_entries_chunked or if you want me to give you my solution and we can move onto the next part of this. Some questions we will need to consider:

how are we implementing caching? zip files? day-by-day basis? where are we storing that on the server?
is this memory safe? I think ideally we do not want more than 250k entries read into memory at once

Tvpower added 5 commits June 13, 2025 10:22

deleted random limiter I put

17d8f43

edited the comments and the change to the docker I put

f173d13

fix qr code I put on accident

2bfedf4

- Enhance CSV export to generate daily breakdowns for trajectories da…

3f53c1a

…ta, including summaries per day and overall.

Introduce chunked export functionality for trajectories data, enablin…

e47f354

…g support for large datasets with timestamp-based pagination and detailed summaries.

JGreenlee requested changes Jun 25, 2025

View reviewed changes

Tvpower added 2 commits June 30, 2025 10:54

Refactor query_all_trajectories_chunked to generalize as `query_ent…

3b1a5e9

…ries_chunked`, introducing adaptive time windows and cleaner emission library integration.

JGreenlee reviewed Jul 10, 2025

View reviewed changes

JGreenlee mentioned this pull request Jul 15, 2025

downloading more than 250k rows e-mission/e-mission-docs#1135

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Chunked trajectory download #172

Chunked trajectory download #172

Uh oh!

Tvpower commented Jun 13, 2025

Uh oh!

TeachMeTW commented Jun 15, 2025

Uh oh!

Tvpower commented Jun 16, 2025

Uh oh!

Tvpower commented Jun 16, 2025

Uh oh!

JGreenlee Jun 25, 2025

Uh oh!

JGreenlee Jun 25, 2025

Uh oh!

JGreenlee Jun 25, 2025

Uh oh!

JGreenlee Jul 10, 2025

Uh oh!

JGreenlee commented Jul 15, 2025

Uh oh!

JGreenlee commented Jul 15, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		return esds.cleaned2inferred_section_list(sections)


		def query_entries_chunked(key_list, date_query,

Chunked trajectory download #172

Are you sure you want to change the base?

Chunked trajectory download #172

Uh oh!

Conversation

Tvpower commented Jun 13, 2025

Uh oh!

TeachMeTW commented Jun 15, 2025

Uh oh!

Tvpower commented Jun 16, 2025

Uh oh!

Tvpower commented Jun 16, 2025

Uh oh!

JGreenlee Jun 25, 2025

Choose a reason for hiding this comment

Uh oh!

JGreenlee Jun 25, 2025

Choose a reason for hiding this comment

Uh oh!

JGreenlee Jun 25, 2025

Choose a reason for hiding this comment

Uh oh!

JGreenlee Jul 10, 2025

Choose a reason for hiding this comment

Uh oh!

JGreenlee commented Jul 15, 2025

Uh oh!

JGreenlee commented Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

JGreenlee commented Jul 15, 2025 •

edited

Loading