Draft
Conversation
…variables in load_columns in postprocessing.py
Fix 1 — Double parquet I/O (utils.py:392–414)
The old code read every .parquet file twice: once with no
arguments (full read of all columns) just to check .empty,
then again with columns= and filters=. For samples with many
parquet files (QCD has hundreds), this doubled all disk I/O.
The empty check also leaked the error-handling duplicate —
the whole except block was identical to the try block for no
reason.
Fix 2 — Repeated .to_numpy() in TXbb SF loop
(PostProcess.py:807–828)
For each (wp, j) bin pair and each jet (ijet), the same
H{ijet}TXbb and H{ijet}Pt columns were converted to numpy 3
times — once for nominal, once for stat_up, once for stat_dn.
With 4 WPs × 1 pT bin × 2 jets, that's 24 redundant array
copies. Now cached as txbb_arr / pt_arr / pt_range before the
3 calls.
Fix 3 — deepcopy → plain boolean indexing
(postprocessing.py:741)
deepcopy(df[mask]) recursively copies every numpy array
inside the DataFrame. But boolean indexing already creates a
new DataFrame with its own data — deepcopy is never needed
here since sig_events is only read (not mutated) in
subsequent code.
Fix 4 — .copy() before pd.concat (postprocessing.py:554)
pd.concat always creates a new DataFrame regardless of
whether the inputs are copied first. The .copy() on every
per-year frame in scale_processes paths was wasting O(N_years
× frame_size) memory before the concat result was even
created.
Fix 5 — list(set(columns))→ dict.fromkeys()
(PostProcess.py:1097)
set() destroys insertion order, meaning df[columns] could
silently reorder output columns each run. dict.fromkeys()
deduplicates while preserving the order columns were first
added.
Fix 6 — del events_dict (PostProcess.py:740)
The raw parquet DataFrame (events_dict) was held in memory
through the entire categorization loop — even though all its
data had already been extracted into bdt_events. For QCD or
ttbar this is O(GB). Deleting it immediately after the
training-event-removal block (the last place it's accessed)
frees that memory before the per-jshift categorization
starts.
Refactor the template production loop in PostProcess.py to process one sample at a time rather than passing all samples to get_templates simultaneously. For each sample, histograms are generated individually and accumulated via addition into the final templates dict. The sample's dataframe is freed after processing to reduce memory pressure. To support this, get_templates in postprocessing.py accepts a new optional all_hist_samples parameter. When provided, it is used directly for the StrCategory histogram axis, ensuring all per-sample histograms share the same axis and can be accumulated. The sig_key access is also guarded so get_templates handles single-sample dicts correctly. A helper _compute_all_hist_samples pre-computes the full set of histogram sample names (base samples + txbb signal shifts + weight-shift variations) needed to populate the axis upfront. For the single-year case, events_combined is now a shallow copy of events_dict_postprocess to prevent per-sample deletion from corrupting the event_list section.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
### Update 02/19/26:
### Update 02/25/2026: