Skip to content

PostProcess.py: add guardrail tests and fix OOM memory inefficiencies#2

Merged
dprim7 merged 3 commits intohh4b-refactorfrom
copilot/add-memory-guardrails-tests
Feb 20, 2026
Merged

PostProcess.py: add guardrail tests and fix OOM memory inefficiencies#2
dprim7 merged 3 commits intohh4b-refactorfrom
copilot/add-memory-guardrails-tests

Conversation

Copy link
Copy Markdown

Copilot AI commented Feb 20, 2026

PostProcess.py caused OOM crashes due to redundant in-memory DataFrame copies accumulating across the per-sample processing loop in load_process_run3_samples.

Tests (tests/test_postprocess.py)

35 new unit and integration tests covering all key pure-Python functions:

  • add_bdt_scores — binary, 3-class, 4-class, 5-class; with/without JEC shift suffix
  • get_jets_for_txbb_sf — all process type branches
  • get_nevents_data/signal/nosignal — sideband counting, mass window selection, cut masking
  • fom_classic / fom_update — basic cases, zero-signal/background edges, ABCD correction
  • Category assignment — VBF/Bin1/Bin2/Bin3/Fail logic, priority flag behavior

Memory fixes (load_process_run3_samples)

1. Progressive jshift merge — eliminates holding all JEC/JMR shift DataFrames simultaneously before concat:

# Before: holds all copies in memory until concat completes
bdt_events = pd.concat([bdt_events[jshift] for jshift in jshifts], axis=1)

# After: each DataFrame freed as soon as its new columns are transferred
bdt_events_merged = bdt_events.pop(jshifts[0])
for jshift in jshifts[1:]:
    _df = bdt_events.pop(jshift)
    for col in _df.columns:
        if col not in bdt_events_merged.columns:
            bdt_events_merged[col] = _df[col]
bdt_events = bdt_events_merged

2. Direct column assignment for more_vars and variation_vars — removes two intermediate pd.DataFrame() + pd.concat() round-trips:

# Before (two intermediate copies per block)
temp_df = pd.DataFrame(more_vars, index=bdt_events.index)
bdt_events = pd.concat([bdt_events, temp_df], axis=1)

# After
for col, values in more_vars.items():
    if col not in bdt_events.columns:
        bdt_events[col] = values

3. Eager events_dict deletion — frees the full raw sample DataFrame (the largest object per iteration) immediately after its last use (calculate_trigger_weights), rather than holding it alongside the growing bdt_events for the rest of the loop body.


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Copilot AI and others added 2 commits February 20, 2026 06:05
Co-authored-by: dprim7 <38478631+dprim7@users.noreply.github.com>
Co-authored-by: dprim7 <38478631+dprim7@users.noreply.github.com>
Copilot AI changed the title [WIP] Add unit and integration tests for PostProcess functionalities PostProcess.py: add guardrail tests and fix OOM memory inefficiencies Feb 20, 2026
Copilot AI requested a review from dprim7 February 20, 2026 06:11
@dprim7
Copy link
Copy Markdown
Owner

dprim7 commented Feb 20, 2026

@copilot make PR into new branch called hh4b-dev

@dprim7 dprim7 changed the base branch from main to hh4b-refactor February 20, 2026 19:48
@dprim7 dprim7 marked this pull request as ready for review February 20, 2026 19:49
@dprim7 dprim7 merged commit 1750413 into hh4b-refactor Feb 20, 2026
3 of 8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants