Feat: Unified Scene Data Flow by WCJ-BERT · Pull Request #56 · NVlabs/alpasim

WCJ-BERT · 2026-03-19T15:12:17Z

Summary

This PR migrates AlpaSim runtime from USDZ artifact-based scene loading to trajdata UnifiedDataset. It introduces a unified interface for loading scenes from multiple autonomous driving datasets (e.g., USDZ, NuPlan, Waymo) and improves memory efficiency through lazy loading.

Key Changes

Breaking Changes

RuntimeContext no longer stores full SceneDataSource objects; it now uses a lightweight scene_id_to_idx mapping.
DaemonEngine now loads scenes on demand in each worker.
build_pending_jobs_from_request() now takes a callable instead of a dict.

New Functionality

Added a SceneDataSource protocol to abstract scene data access.
Added TrajdataDataSource for direct trajdata-based scene loading.
Added a prepare_data CLI tool for preprocessing trajdata caches.
Enabled lazy scene loading in workers to reduce memory usage.

Code and Documentation Updates

Updated existing tests for the new data flow and added integration tests for trajdata functionality.
Added data preparation instructions to README and TUTORIAL.md.
Documented the trajdata-alpasim dependency in ONBOARDING.
Updated wizard config defaults for data_source.

Introduce SceneDataSource protocol to abstract scene data loading, enabling support for multiple data sources (Artifact, trajdata, etc). Add DataSourceConfig to runtime config for unified data source configuration via trajdata's UnifiedDataset. Changes: - Add SceneDataSource protocol with standard interface (rig, traffic_objects, map, metadata) - Add DataSourceConfig with trajdata UnifiedDataset parameters - Support for both USDZ files and standard trajdata datasets

Implement TrajdataDataSource as a SceneDataSource that loads data directly from trajdata datasets without requiring USDZ conversion. Features: - Lazy loading of rig, traffic_objects, map, and metadata - Pre-created scene_cache to avoid pickle errors in multiprocessing - Support for coordinate frame transformations (world to local NRE) - Trajectory smoothing with cubic splines - Camera calibration extraction from scene metadata - Map loading and transformation to local coordinates Benefits: - Eliminate USDZ conversion overhead - Reduce startup time with on-demand loading - Lower memory usage per worker

Replace heavy scene_id_to_data_source dict with lightweight scene_id_to_idx mapping in RuntimeContext. This enables efficient serialization and reduces memory overhead when passing context to worker processes. Changes: - RuntimeContext.scene_id_to_data_source → scene_id_to_idx - Build scene_id to trajdata index mapping once at startup - Workers can reconstruct data sources on-demand using the mapping Benefits: - Lightweight RuntimeContext (dict[str, int] vs dict[str, DataSource]) - Fully serializable with pickle for multiprocessing - Avoids duplicating heavy data objects across workers - Aligns with trajdata's index-based API BREAKING CHANGE: RuntimeContext.scene_id_to_data_source field replaced with scene_id_to_idx. Code accessing the old field must be updated to use the new scene_id_to_idx mapping and load data sources on-demand.

Adapt DaemonEngine to work with new RuntimeContext and support on-demand scene data source loading from trajdata. Changes: - Store scene_id_to_idx mapping from RuntimeContext - Create UnifiedDataset at engine startup for scene loading - Add _get_data_source() method for lazy loading with caching - Update build_pending_jobs_from_request to use callback pattern - Pre-create scene_cache to avoid pickle errors This enables daemon mode to efficiently handle multiple scenes without loading all data upfront.

Add command-line tool for preprocessing scene data and building trajdata cache before running simulations. Features: - Basic mode: preprocess all scenes in a dataset - YAML config mode: batch process specific scenes from YAML files - Central token mode: process scenes around specific tokens (NuPlan) - Support for smooth_trajectories parameter - Configurable cache rebuilding and vector map inclusion Usage: # Basic preprocessing python -m alpasim_runtime.prepare_data --user-config user.yaml # With explicit parameters python -m alpasim_runtime.prepare_data \ --desired-data nuplan_test \ --data-dir /path/to/data \ --cache-location /tmp/cache This preprocessing step improves simulation startup time by pre-building the trajdata cache.

Update worker processes and simulation entry points to use the new SceneDataSource abstraction instead of direct Artifact access. Changes: - Worker IPC: PendingRolloutJob and AssignedRolloutJob use SceneDataSource - Worker main: Pass data_source from job to rollout execution - UnboundRollout: Accept SceneDataSource parameter - Simulate CLI: Use RuntimeContext.scene_id_to_idx for scene lookup This completes the migration from Artifact-based to SceneDataSource-based data loading, enabling support for multiple data source backends.

Follow CONTRIBUTING.md naming conventions for coordinate frames. Changes in trajdata_data_source.py: - poses_vec3 → positions_agent_world - poses_quat → quaternions_agent_world - first_pose_position → position_ego_first_world - first_pose_local → position_ego_first_local These changes improve code readability by making coordinate frames explicit in variable names, following the position_{what}_{frame} naming pattern required by CONTRIBUTING.md.

Update all runtime tests to work with the new trajdata-based data flow: - test_daemon_engine: Replace scene_id_to_artifact_path with scene_id_to_idx and mock scene_id_to_data_source cache in engine tests - test_daemon_main: Remove usdz_glob parameter from DaemonEngine construction - test_daemon_request_plumbing: Update build_pending_jobs_from_request tests to use get_data_source callable instead of scene_id_to_artifact_path dict - test_config: Add unit tests for new DataSourceConfig class - test_trajdata_integration: New integration tests for trajdata data source functionality, including caching, error handling, and job creation - test_runtime_integration_replay: Add TODO note for future update (this manual integration test needs significant rework for new data flow) All tests now reflect the shift from USDZ artifacts to lazy-loaded SceneDataSource instances via trajdata UnifiedDataset. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Update documentation and wizard configuration to support the new trajdata-based data flow: Documentation updates: - README: Add Quick Start section explaining data preparation - ONBOARDING: Document trajdata dependency (alpasim branch) - TUTORIAL: Add Data Preparation section with prepare_data examples - TUTORIAL: Fix deprecated --usdz-glob references in debug examples Wizard configuration: - Add defines.trajdata_cache for unified cache location - Add runtime.data_source with USDZ defaults - Configure recursive scanning of all-usdzs directory - Set sensible defaults: 10Hz, vector maps, 4 workers This ensures users understand the data preparation workflow and wizard-generated configs work without manual edits. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

mwatson-nvidia

Thanks for the submission!

I have not tested this internally, and that is something that we'll want to do, I think, before proceeding with this PR. However, before I do this, I think it's worth addressing/discussing this first round of comments.

mwatson-nvidia · 2026-03-19T17:43:18Z

src/utils/alpasim_utils/trajdata_data_source.py

@@ -0,0 +1,1053 @@
+# SPDX-License-Identifier: Apache-2.0
+# Copyright (c) 2025-2026 NVIDIA Corporation


nit: for the new files, should be 2026 only. Applies elsewhere, too

Sure, will apply.

mwatson-nvidia · 2026-03-19T17:58:13Z

src/wizard/configs/base_config.yaml

+    rebuild_maps: false    # Set to true to force rebuild maps
+
+    # Trajectory processing
+    desired_dt: 0.1        # 10 Hz sampling rate (matches control_timestep_us)


The comment seems to indicate that this needs to match the control_timestep_us. Is this the case, or is this just a conveniently high frequency? I suspect it might just be the latter, in which case I would drop this part of the comment as it may be misleading.

If this does need to match the control timestep, we might want to see about some derived parameter

Good catch on the misleading comment. I'll remove it.

Regarding whether desired_dt should match control_timestep_us: they serve different purposes and should remain independent:

desired_dt: Data-layer parameter that controls trajectory sampling rate during cache preparation (via trajdata). This happens before simulation runs.

control_timestep_us: Runtime parameter that controls the simulation time step and control command frequency.

mwatson-nvidia · 2026-03-19T18:09:28Z

src/wizard/configs/base_config.yaml

+    # Optional: Base path for MTGS rendering assets
+    asset_base_path: null
+
+    # YAML config mode parameters (for NuPlan central token processing)


It looks like this data structure is pretty flat, meaning that, as we add additional data source types, this may grow unwieldy and it will be hard for readers to understand which fields are needed for which data sources and which fields can affect which types of data sources. I wonder if it makes sense to try and switch this to a more hierarchical structure now. Do you have any thoughts on this?

What about this structure?

runtime: data_source: cache_location: "/path/to/cache" incl_vector_map: true rebuild_cache: false rebuild_maps: false num_workers: 4 sources: usdz: enabled: true data_dir: "/path/to/usdz" desired_dt: 0.1 extra_params: smooth_trajectories: true nuplan: enabled: false data_dir: "/path/to/nuplan" extra_params: config_dir: "/path/to/yaml" asset_base_path: null num_timesteps_before: 30 num_timesteps_after: 80 waymo: enabled: false data_dir: "/path/to/waymo"

Also, I introduce a GenericSourceConfig to support hierarchical structure in config.py.

@dataclass class GenericSourceConfig: """Generic configuration for any trajdata-supported dataset. This unified config supports all trajdata datasets (USDZ, NuPlan, Waymo, nuScenes, Lyft, Argoverse, etc.) with a flexible extra_params field for dataset-specific options. Attributes: enabled: Whether this data source is enabled data_dir: Path to dataset directory desired_dt: Desired time delta between trajectory frames in seconds incl_vector_map: Whether to load vector maps (roads, lanes, etc.) extra_params: Dataset-specific parameters (e.g., NuPlan's config_dir, USDZ's asset_base_path, etc.) Example extra_params: - NuPlan: {"config_dir": "/path", "num_timesteps_before": 30, "num_timesteps_after": 80} - USDZ: {"asset_base_path": "/assets"} - Waymo: {} (no extra params needed) """ enabled: bool = True data_dir: str = MISSING desired_dt: float = 0.1 # 10 Hz sampling extra_params: Dict[str, Any] = field(default_factory=dict)

mwatson-nvidia · 2026-03-19T18:20:34Z

src/runtime/alpasim_runtime/unbound_rollout.py

        camera_configs = list(simulation_config.cameras)

+        # Get time range from data source rig
+        rig_time_range_start = data_source.rig.trajectory.time_range_us.start


nit: I suspect the reason you added this was to make the lines shorter? but now we have some new variable that readers of the code need to internalize to understand the meaning. Suggest just using data_source.rig.trajectory.time_range_us.start and data_source.rig.trajectory.time_range_us.stop rather than introducing a new variable on the stack here

Good suggestion - will change this.

mwatson-nvidia · 2026-03-19T18:28:56Z

src/wizard/configs/base_config.yaml


  enable_autoresume: false
  # How many scenes (in particular maps) to cache in the worker local artifact cache.
  artifact_cache_size: 10


is this still used?

Not used. Will delete.

mwatson-nvidia · 2026-03-19T19:02:03Z

docs/TUTORIAL.md

+for unified data loading across different autonomous driving datasets. The trajdata library is
+automatically installed via `uv` when you run `setup_local_env.sh`.
+
+Before running simulations, you need to prepare a trajdata cache from your scene data.


This comment seems inaccurate (specifically the word "need"). As I understand it, this is an optional pre-step for optimization?

Yeah, i'll change the word.

mwatson-nvidia · 2026-03-19T19:08:05Z

src/runtime/alpasim_runtime/daemon/engine.py

+        self._config = None  # Will be set during startup
+        self._dataset: UnifiedDataset | None = None
+        self._scene_id_to_idx: dict[str, int] = {}
+        self._scene_id_to_data_source: dict[str, SceneDataSource] = {}


we have 3 fields: self._dataset, self._scene_id_to_idx, and self._scene_id_to_data_source. I wonder if it would be better to package up these three concepts into a single class so we can pass a single argument? it could also help with readabiltiy

Good suggestion - I prepare to refactor the three related fields (_dataset, _scene_id_to_idx, _scene_id_to_data_source) into a single SceneLoader class. This improves encapsulation and readability.

The SceneLoader is initialized from RuntimeContext and provides a single get_data_source(scene_id) method for lazy loading.

mwatson-nvidia · 2026-03-19T19:12:34Z

src/runtime/alpasim_runtime/prepare_data/__main__.py

@@ -0,0 +1,720 @@
+# SPDX-License-Identifier: Apache-2.0


Concerns from Claude:

Several concerns: 1. load_config_from_file unpacks a dataclass into a dict, then everything uses config["key"] / config.get("key", default). This throws away the type safety from DataSourceConfig. The function should just return the DataSourceConfig (or UserSimulatorConfig) and the callers should use it directly. The dict intermediary means defaults are re-specified in multiple places (e.g., num_workers defaults to 8 on lines 653 and 710, but 1 in the dataclass). 2. Inconsistent defaults across paths. desired_dt defaults to 0.1 via CLI (line 415), 0.5 in preprocess_from_yaml_configs (line 164) and run_yaml_batch_preprocessing (line 555), and 0.1 in DataSourceConfig. num_workers defaults to 1 in the dataclass, 1 in CLI, but 8 in the config.get() calls (lines 653, 710). These will silently produce different behavior depending on which code path you take. 3. Commented-out code. Lines 135-141 have a commented-out "extract every central_token" block — this is dead code that should either be implemented or removed. 4. run_yaml_batch_preprocessing is a trivial wrapper. It just calls preprocess_from_yaml_configs and converts bool → int. It doesn't justify being a separate function. 5. Deferred import in load_config_from_file (line 487). The codebase convention is imports at the top of the file. 6. --verbose defaults to True with store_true. This means --verbose is always true unless --quiet is passed. The flag is a no-op — you can never not be verbose except via --quiet. Confusing UX. 7. The --smooth-trajectories flag accepts string choices ["true", "false", "True", "False"] then parses them manually. A BooleanOptionalAction or simply --smooth-trajectories / --no-smooth-trajectories would be cleaner. 8. The whole file is ~720 lines for what amounts to "call UnifiedDataset(...) with the right args." The real work is in trajdata. Most of the complexity here is plumbing config between two input formats (CLI args vs. YAML) and two modes (basic vs. YAML batch), all through a lossy dict intermediate. Simplifying the config flow (just use the dataclass end-to-end) would cut this significantly.

mwatson-nvidia · 2026-03-19T19:22:06Z

src/utils/alpasim_utils/trajdata_data_source.py

+    # artifacts = {data_source.scene_id: data_source}
+"""
+
+from __future__ import annotations


Claude comments:

1. The map property is a 250-line monster (lines 670-999). It has two completely different code paths (scene.map_data vs. dataset._map_api) with duplicated transformation logic, followed by ~100 lines of post-transform verification and data-type fixups. This should be broken into separate methods. The transform logic itself (transform_map_points) is defined as a nested closure, which makes it harder to test. 2. Excessive defensive hasattr checks everywhere. The code is littered with hasattr(lane, "center"), hasattr(lane.center, "points"), hasattr(agent.extent, "length"), etc. If these are typed trajdata objects, these checks shouldn't be necessary. If the schema is genuinely unstable, that's a bigger problem. This reads like the author wasn't sure what the trajdata API actually guarantees. [mwatson]: this might actually be expected--I recall that trajdata has a lot of optional stuff. At the same time, we might do better to assume we have what we need and catch exceptions when fields don't exist? I'd like to hear your thoughts on this 3. The _extract_agent_trajectory state accessor pattern (lines 305-330) is wild. It does state.get_attr("x") if hasattr(state, "get_attr") else state.x, then checks if the result is scalar, wraps it in an array, then immediately unwraps it with float(x[0] if x.ndim > 0 else x). This suggests uncertainty about what get_raw_state returns — it should be pinned down once and handled simply. 4. n/a 5. Coordinate transformation is copy-pasted three times. Ego trajectory (line 424), traffic objects (line 609), and map (lines 805-838) all apply positions + translation. This should be a single utility function. 6. from_agent_batch is dead code. It calls _load_from_batch which immediately raises NotImplementedError. This should be removed — it's not a stub, it's misleading. 7. The smoothing code in traffic_objects (lines 618-642) has a deferred import csaps inside a loop body. This violates the codebase convention (imports at top of file), and the import happens once per traffic object if smoothing is enabled. 8. metadata generates random UUIDs and datetime.now() (lines 1038-1039). This makes it non-deterministic — running the same scene twice produces different metadata. dataset_hash being a random UUID defeats the purpose of a hash. 9. Implicit ordering dependency. traffic_objects silently forces self.rig to be loaded first (line 605) to get world_to_nre. map does the same (lines 691-693, 766-768). This coupling is hidden — if someone accesses map or traffic_objects before rig, it triggers a chain of lazy loading with side effects. The dependency should be explicit (e.g., require world_to_nre as a constructor argument, or compute it once eagerly). 10. _scene_id as a mutable dataclass field with a property setter (lines 82-187) is odd for what should be an immutable identifier. The scene_id property has three resolution paths (the field, _scene.name, or raise). This complexity isn't justified.

mwatson-nvidia · 2026-03-19T19:25:27Z

src/utils/alpasim_utils/scene_data_source.py

+
+
+@runtime_checkable
+class SceneDataSource(Protocol):


I feel like there should probably be a tie between this and the existing Artifact. For instance, should we indicate that Artifact implements this protocol, or just keep with the current duck typed approach?

Good question about the relationship between Artifact and SceneDataSource!

Design Intent:
Our goal is to unify all data sources (USDZ, NuPlan, Waymo) through trajdata, using the SceneDataSource protocol as the user-facing interface.

Current Architecture:

SceneDataSource - Protocol defining the unified interface for Runtime

TrajdataDataSource - Implements SceneDataSource, provides unified access to all data formats

Artifact - Internal utility for reading USDZ files, used by trajdata during USDZ data conversion

Why Artifact Still Exists:
Artifact is still needed as an internal component. When trajdata processes USDZ data, it uses Artifact under the hood to read the USDZ file format.

In terms of protocols:

Artifact and SceneDataSource operate at different abstraction levels

Artifact is an implementation detail (USDZ file reader)

SceneDataSource is the public interface (unified data access)

What do you think of this design?

WCJ-BERT and others added 9 commits March 19, 2026 14:54

mwatson-nvidia reviewed Mar 19, 2026

View reviewed changes

This was referenced Mar 20, 2026

Simulating with custom sequences #25

Closed

Custom sensorsim environment #26

Closed

		@@ -0,0 +1,1053 @@
		# SPDX-License-Identifier: Apache-2.0
		# Copyright (c) 2025-2026 NVIDIA Corporation

Conversation

WCJ-BERT commented Mar 19, 2026

Summary

Key Changes

Breaking Changes

New Functionality

Code and Documentation Updates

Uh oh!

mwatson-nvidia left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

WCJ-BERT Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

WCJ-BERT Mar 20, 2026 •

edited

Loading