Skip to content

Conversation

@Tvpower
Copy link

@Tvpower Tvpower commented Jun 13, 2025

Tvpower added 5 commits June 13, 2025 10:22
…datasets in the data page.

- Add loading indicators to "Export QR Codes" and "Download ZIP" buttons for improved UX.
- Refactor ZIP generation for tokens and trajectories to optimize performance and memory usage.
- Update `DB_HOST` and `STUDY_CONFIG` in `docker-compose-dev.yml` for NREL Commute study configuration.
…ta, including summaries per day and overall.
@TeachMeTW
Copy link
Contributor

@Tvpower
Just feedback on the commits that I am seeing -- I do not wish for you to make the same mistakes I did, but generally make your commit messages on a certain type of format; i.e:

  • Keep the title under 70 (?) characters for better readability.
  • Use imperative mood - write as if completing "This commit will..."
  • Squash things/commits that are related/incremental

Now your latest commit is pretty good, I do advise however to squash/remove commits that are just cleanup/mistake fixes that happened during development such as 17d8f43 or 2bfedf4.

For example a good example of a commit is that of @JGreenlee's, for example: add config-update workflow and update_admin_access script

add config-update workflow and update_admin_access script

The config-update workflow runs one of the scripts in bin/config_update, commits and PRs the config changes if there are any, and auto-merges that PR if it passes checks.
The workflow can be triggered from op-admin-dashboard given that it has credentials to trigger workflows (which are provided through a Github app: https://github.com/settings/apps/op-config-updates)
Currently, this works to add/remove admin users from the admin_access list:
https://github.com/e-mission/op-admin-dashboard/issues/167#issuecomment-2801984007

Tested end-to-end from admin dash:
https://github.com/e-mission/op-admin-dashboard/pull/168

@Tvpower
Copy link
Author

Tvpower commented Jun 16, 2025

In query_trajectories.

-Collect first chunk of 250k and save as a list. When the limit is hit at 249999 get that query time stamp

-With the last query time stamp request the trajectories again from that date to the selected end date.

-With the next 250k request add that to the zip and check if its less than 250k if it is end at that point. If is more than or equal to check again for the next set of trajectories until the end date is reach and subsequent queries will be adjusted with the new start date.

@Tvpower
Copy link
Author

Tvpower commented Jun 16, 2025

Better format to keep in mind to finish this:
Problem:

  • The system has a 250k limit for trajectory queries, which prevents retrieving all trajectories in a single request when the dataset exceeds this limit.

Why it's a problem:

  • When users request trajectories spanning a large date range, the system fails to return complete data due to the 250k limit restriction.
  • This results in incomplete data retrieval and potential gaps in trajectory analysis.

Solution:

  1. Implement pagination using timestamps:
    • Collect first 250k trajectories and store the last timestamp
    • Use this timestamp as the start point for the next query
    • Continue this process until reaching the end date
  2. Combine results:
    • Add each 250k chunk to a zip file
    • Stop when either:
      a) The remaining data is less than 250k
      b) The end date is reached

…g support for large datasets with timestamp-based pagination and detailed summaries.
Comment on lines 552 to 565
# Build MongoDB query for this chunk
mongo_query = {
"metadata.key": {"$in": key_list},
"data.ts": {"$gte": current_start_ts, "$lt": end_ts}
}

# Add UUID exclusion to the query
if excluded_uuids:
excluded_uuid_objects = [UUID(uuid) for uuid in excluded_uuids]
mongo_query["user_id"] = {"$nin": excluded_uuid_objects}

# Query this chunk with limit
db = edb.get_analysis_timeseries_db()
cursor = db.find(mongo_query).sort("data.ts", 1).limit(chunk_limit)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use the original query_trajectories as an example
It calls ts.find_entries

)

# Stage 2: Iterate through chunks using timestamp pagination
while current_start_ts < end_ts:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything inside the while seems to be the core of the chunking and I think that can be extracted to a generic function that will allow us to work around this 250k entry limit for ANY type of entry (i.e. not specific to trajectories), and not specific to Plotly Dash (i.e. it could be used outside the admin dashboard)

This function will basically be a wrapper around ts.find_entries with the chunking logic around it. It should essentially take the same arguments as find_entries:

  • key_list
    • for trajectories, that key was analysis/recreation_location or background/location. In your generic function, it could be anything, and that gets passed to ts.find_entries
  • time_query
  • geo_query
  • extra_query_list
  • the query limit (default = 250k)

return a list (or iterator) of all the combined entries (which may be > 250k)

Tvpower added 2 commits June 30, 2025 10:54
…ries_chunked`, introducing adaptive time windows and cleaner emission library integration.
… to use `date_query` parameter, remove adaptive time windows, and simplify chunked data processing with a fixed record limit.

I was able to get 499k entries with this which is weird. Might need a reminder on how many were there originally
return esds.cleaned2inferred_section_list(sections)


def query_entries_chunked(key_list, date_query,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current version of query_entries_chunked does not meet the requirements:

This function will basically be a wrapper around ts.find_entries with the chunking logic around it. It should essentially take the same arguments as find_entries:

  • key_list
    • for trajectories, that key was analysis/recreation_location or background/location. In your generic function, it could be anything, and that gets passed to ts.find_entries
  • time_query
  • geo_query
  • extra_query_list
  • the query limit (default = 250k)

return a list (or iterator) of all the combined entries (which may be > 250k)

Namely, it has date_query instead of time_query, and it yields and returns dataframes instead of simply returning a list.

@JGreenlee
Copy link
Collaborator

@Tvpower I found out why this approach is not working.

BuiltinTimeseries is where find_entries and all our other timeseries query methods are implemented. It's meant to be used for one particular user.
In op-admin-dashboard, we are dealing with data from many users at once, so it uses AggregateTimeseries (which inherits from BuiltinTimeseries)

However, AggregateTimeseries behaves differently in that sorting is disabled. I was not aware of this override.
https://github.com/e-mission/e-mission-server/blob/c684fca916a63f15d0cc8bd8bcf2f553353e2a1f/emission/storage/timeseries/aggregate_timeseries.py#L22-L23

This seems to be a deliberate decision made in e-mission/e-mission-server@453b6a8, so I think we should keep this as the default behavior of the AggregateTimeseries

But we can add an optional sort_key parameter to find_entries which will force it to sort when sort_key is defined

diff --git a/emission/storage/timeseries/builtin_timeseries.py b/emission/storage/timeseries/builtin_timeseries.py
index 93fd46bc..058dcbdb 100644
--- a/emission/storage/timeseries/builtin_timeseries.py
+++ b/emission/storage/timeseries/builtin_timeseries.py
@@ -200,8 +200,9 @@ class BuiltinTimeSeries(esta.TimeSeries):
         return (orig_ts_db_keys, analysis_ts_db_keys)
 
     def find_entries(self, key_list = None, time_query = None, geo_query = None,
-                     extra_query_list=None):
-        sort_key = self._get_sort_key(time_query)
+                     extra_query_list=None, sort_key=None):
+        if sort_key is None:
+            sort_key = self._get_sort_key(time_query)
         logging.debug("curr_query = %s, sort_key = %s" % 
             (self._get_query(key_list, time_query, geo_query,
                              extra_query_list), sort_key))

@JGreenlee
Copy link
Collaborator

JGreenlee commented Jul 15, 2025

With this fix applied, I wrote a basic implementation of query_entries_chunked and it is working!

DEBUG:root:finished querying values for ['analysis/recreated_location'], count = 250000
DEBUG:root:orig_ts_db_matches = 0, analysis_ts_db_matches = 250000
DEBUG:root:curr_query = {'invalid': {'$exists': False}, '$or': [{'metadata.key': 'analysis/recreated_location'}], 'data.ts': {'$lte': 1752599636.790563, '$gte': 1690965755.4742825}, '_id': {'$ne': ObjectId('64cb4651c55a8df6377acfb8')}}, sort_key = data.ts
DEBUG:root:orig_ts_db_keys = [], analysis_ts_db_keys = ['analysis/recreated_location']
DEBUG:root:finished querying values for [], count = 0
DEBUG:root:finished querying values for ['analysis/recreated_location'], count = 250000
DEBUG:root:orig_ts_db_matches = 0, analysis_ts_db_matches = 250000
DEBUG:root:curr_query = {'invalid': {'$exists': False}, '$or': [{'metadata.key': 'analysis/recreated_location'}], 'data.ts': {'$lte': 1752599636.790563, '$gte': 1705958489.0731728}, '_id': {'$ne': ObjectId('65aee72212e59054b80d8eb0')}}, sort_key = data.ts
DEBUG:root:orig_ts_db_keys = [], analysis_ts_db_keys = ['analysis/recreated_location']
DEBUG:root:finished querying values for [], count = 0
DEBUG:root:finished querying values for ['analysis/recreated_location'], count = 250000
DEBUG:root:orig_ts_db_matches = 0, analysis_ts_db_matches = 250000
DEBUG:root:curr_query = {'invalid': {'$exists': False}, '$or': [{'metadata.key': 'analysis/recreated_location'}], 'data.ts': {'$lte': 1752599636.790563, '$gte': 1741815238.2157795}, '_id': {'$ne': ObjectId('67d2225310a95f519f4e8b67')}}, sort_key = data.ts
DEBUG:root:orig_ts_db_keys = [], analysis_ts_db_keys = ['analysis/recreated_location']
DEBUG:root:finished querying values for [], count = 0
DEBUG:root:finished querying values for ['analysis/recreated_location'], count = 3898
DEBUG:root:orig_ts_db_matches = 0, analysis_ts_db_matches = 3898
753898

(753898 matches the actual number of analysis/recreated_location entries reported by MongoDB)

@Tvpower Let me know whether you want to keep working on query_entries_chunked or if you want me to give you my solution and we can move onto the next part of this. Some questions we will need to consider:

  • how are we implementing caching? zip files? day-by-day basis? where are we storing that on the server?
  • is this memory safe? I think ideally we do not want more than 250k entries read into memory at once

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants