Move central run_etl into each example by saikrishnanc-nv · Pull Request #38 · NVIDIA/physicsnemo-curator

saikrishnanc-nv · 2025-11-19T00:01:46Z

When Curator was first designed, we centralized a run_etl.py script in the core Curator package, which did most of the orchestration of the ETL pipeline.
However, this required users to add the examples directory (or any other directory where their data processing code lived) to the PYTHONPATH, so that the run_etl.py script could import the code necessary to execute it.
This is not ideal, and is generally error prone (and we have received multiple pieces of feedback around this).

Additionally, other PhysicsNeMo repositories do NOT have such a central run_*.py in the core package, and instead, they have similar scripts in each example directory (typically, train.py, infer.py, etc.).

This PR therefore removes the run_etl.py from the core package, and moves them into the respective examples directories.
Several other changes had to be made to the overall codebase as a result of this.

Alexey-Kamenev

LGTM!

Alexey-Kamenev · 2025-11-19T00:57:15Z

Makefile


 pytest:
 	pip install -e ".[dev]" && \
-	pytest


Can also do pytest ./tests/?

Aha - This is a good point to discuss.
Previously that's what was in effect happening, and that worked fine, because the imports were very explicit (full paths relative to the root of the repo).
Now however, that doesn't work because, we don't add examples to the PYTHONPATH anymore.
And therefore, the paths are being modified using sys.path.
Multiple examples have files with the same name (data_sources.py for example). And even if I remove the sys.path modifications at what I think are the right locations - This somehow creates some path pollution.
So my options were:

Ensure file names are unique (hard to guarantee given that users might only care about one example)

Separate the tests out and run them one module at a time

I opted for the 2nd option.
However, if you know of a way to overcome this, I'd be very grateful 😄

saikrishnanc-nv · 2025-11-21T19:53:10Z

/blossom-ci

saikrishnanc-nv added 2 commits November 18, 2025 15:48

Refactored out run_etl.py

bc089b4

Modified pytest section in CI

fda5485

saikrishnanc-nv marked this pull request as draft November 19, 2025 00:02

saikrishnanc-nv requested a review from Alexey-Kamenev November 19, 2025 00:02

Alexey-Kamenev previously approved these changes Nov 19, 2025

View reviewed changes

Updated documentation

37bf178

saikrishnanc-nv dismissed Alexey-Kamenev’s stale review via 37bf178 November 21, 2025 19:52

saikrishnanc-nv self-assigned this Nov 21, 2025

saikrishnanc-nv marked this pull request as ready for review November 21, 2025 19:53

saikrishnanc-nv merged commit 1cde7d8 into NVIDIA:main Nov 21, 2025
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move central run_etl into each example#38

Move central run_etl into each example#38
saikrishnanc-nv merged 3 commits intoNVIDIA:mainfrom
saikrishnanc-nv:saikrishnanc/refactor_run_etl

saikrishnanc-nv commented Nov 19, 2025

Uh oh!

Alexey-Kamenev left a comment

Uh oh!

Alexey-Kamenev Nov 19, 2025

Uh oh!

saikrishnanc-nv Nov 19, 2025

Uh oh!

saikrishnanc-nv commented Nov 21, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

saikrishnanc-nv commented Nov 19, 2025

Uh oh!

Alexey-Kamenev left a comment

Choose a reason for hiding this comment

Uh oh!

Alexey-Kamenev Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

saikrishnanc-nv Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

saikrishnanc-nv commented Nov 21, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants