Move central run_etl into each example#38
Conversation
|
|
||
| pytest: | ||
| pip install -e ".[dev]" && \ | ||
| pytest |
There was a problem hiding this comment.
Can also do pytest ./tests/?
There was a problem hiding this comment.
Aha - This is a good point to discuss.
Previously that's what was in effect happening, and that worked fine, because the imports were very explicit (full paths relative to the root of the repo).
Now however, that doesn't work because, we don't add examples to the PYTHONPATH anymore.
And therefore, the paths are being modified using sys.path.
Multiple examples have files with the same name (data_sources.py for example). And even if I remove the sys.path modifications at what I think are the right locations - This somehow creates some path pollution.
So my options were:
- Ensure file names are unique (hard to guarantee given that users might only care about one example)
- Separate the tests out and run them one module at a time
I opted for the 2nd option.
However, if you know of a way to overcome this, I'd be very grateful 😄
|
/blossom-ci |
When Curator was first designed, we centralized a
run_etl.pyscript in the core Curator package, which did most of the orchestration of the ETL pipeline.However, this required users to add the
examplesdirectory (or any other directory where their data processing code lived) to thePYTHONPATH, so that the run_etl.py script could import the code necessary to execute it.This is not ideal, and is generally error prone (and we have received multiple pieces of feedback around this).
Additionally, other PhysicsNeMo repositories do NOT have such a central
run_*.pyin the core package, and instead, they have similar scripts in each example directory (typically,train.py,infer.py, etc.).This PR therefore removes the
run_etl.pyfrom the core package, and moves them into the respective examples directories.Several other changes had to be made to the overall codebase as a result of this.