Skip to content

Prevent util.execute from reading output files#296

Open
anabiman wants to merge 11 commits intoMolSSI:masterfrom
anabiman:execute
Open

Prevent util.execute from reading output files#296
anabiman wants to merge 11 commits intoMolSSI:masterfrom
anabiman:execute

Conversation

@anabiman
Copy link
Copy Markdown
Collaborator

@anabiman anabiman commented Apr 12, 2021

Description

Adds an extra keyword (outfiles_load) to util.execute that instructs disk_files how to handle output files. If outfiles_load=True (default), the output files are read and their contents returned in outfiles. When outfiles_load=False, only the file paths are returned instead. This is useful and can be necessary when output files are large so keeping track of the output files while retaining the scratch dir becomes a viable alternative:

inp = {
    "command": ...,
    "outfiles": [huge_data_file, ...],
    "outfiles_load": False, # do not load outfiles in memory
    "scratch_messy": True, # do not delete scratch dir
}

_, proc = execute(**inp)

proc["outfiles"][huge_data_file] -> Path

The default behavior is preserved (outfiles_load=True in util.execute), so by default outfiles stores the file contents i.e.

inp = {
    "command": ...,
    "outfiles": [small_data_file, ...],
}

_, proc = execute(**inp)

proc["outfiles"][small_data_file] -> Union[str, bytes]

Changelog description

util.execute supports tracking output files without loading them in memory

Status

  • Code base linted
  • Ready to go

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 12, 2021

Codecov Report

Merging #296 (a3d0ace) into master (43639f2) will increase coverage by 0.19%.
The diff coverage is 100.00%.

@loriab
Copy link
Copy Markdown
Collaborator

loriab commented May 3, 2021

What about making it like binary where only the named files are not read into memory?

Perhaps add a note that scratch_messy=True required to avoid surprise.

@anabiman
Copy link
Copy Markdown
Collaborator Author

anabiman commented May 4, 2021

I like your suggestion. I'll refine this and improve the docstring.

qcengine/util.py Outdated
infiles: Optional[Dict[str, str]] = None,
outfiles: Optional[List[str]] = None,
*,
outfiles_track: Optional[List[str]] = [],
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point but shouldn't matter here since outfiles_track should never be modified, but if you're concerned someone might I could change its type to an immutable object like a Tuple OR if you prefer a list for consistency that is by default None, I could add a single line to disk_files:

outfiles_track = outfiles_track or []

that would make the current code work with no additional changes.

qcengine/util.py Outdated
except (OSError, FileNotFoundError):
if "*" in fl:
filename = lwd / fl
if fl not in outfiles_track:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you handling the case when outfiles=["subdir/*"], outfiles_track=["subdir/bigfile"]?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this will not work (bigfile will be loaded in memory). For outfiles_track to work, it has to be a subset of outfiles. So for the following input:

{
    ...
    "outfiles": ["*"],
    "outfiles_track": ["bigfile"],
    ...
}

or the opposite:

{
    ...
    "outfiles": ["bigfile", ...],
    "outfiles_track": ["*"],
    ...
}

outfiles_track will be ignored. There are also no sanity checks being done on outfiles_track so if one specifies a filename not part of outfiles, it will be ignored without any warnings.

I guess the main advantage in setting outfiles_load: bool is we don't have to worry about individual cases like these.

@loriab loriab added the LocalCompute Using qcng as front-end for large jobs, rather than distributed compute label Jun 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

LocalCompute Using qcng as front-end for large jobs, rather than distributed compute

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants