Run3 limits by dprim7 · Pull Request #325 · LPC-HH/HH4b

dprim7 · 2026-02-19T02:11:06Z

Fixed Run-3 v13 BDT dataframe construction in src/HH4b/boosted/bdt_trainings_run3/v13_glopartv2.py by explicitly converting VBF/AK4 inputs to 1D NumPy arrays (to_numpy()[:, i]) before building vectors.
Resolved pandas shape issues in v13 features by casting VBF-derived outputs (VBFjjMass, VBFjjDeltaEta) to NumPy arrays before DataFrame creation.
Fixed duplicate bdt_score column loading in src/HH4b/postprocessing/postprocessing.py by copying columns_to_load[...]/load_columns_syst before mutation, avoiding in-place global list growth across calls.
Added glopart-v3 support in postprocessing.py load/filter config:
new columns_to_load["glopart-v3"]
new filters_to_apply["glopart-v3"]
updated load_run3_samples txbb_version allowlist and assertion message.

### Update 02/19/26:

Added first-pass glopart-v3 runtime compatibility for Run-3 postprocessing by updating branch usage from ParT v2 names to ParT3 names in key paths: BDT input config (v13_glopartv2.py), mass mapping (mreg_strings), trigger-SF branch selection (corrections.trigger_SF), and TXbb preselection handling (PostProcess.py).
Aligned glopart-v3 systematics column loading with available ntuple schema by remapping JMS/JMR mass-shift requests from bbFatJetParTmassVis* to bbFatJetParT3massX2p* inside load_run3_samples, resolving parquet read failures on ttbar/systematics samples.
Added minimal xsec key alias resolution in utils._normalize_weights to support NanoV15 sample names like QCD-4Jets_HT-* without modifying canonical entries in xsecs.py, preventing fallback-scaling KeyErrors during MC load.

### Update 02/25/2026:

Fixed more naming inconsistencies across the various eras, templates now generate
added unit testing suite for post processing
refactored and tested post processing pipeline for better memory efficiency

…ormat

…variables in load_columns in postprocessing.py

…hh_vars

…ast commit)

Fix 1 — Double parquet I/O (utils.py:392–414) The old code read every .parquet file twice: once with no arguments (full read of all columns) just to check .empty, then again with columns= and filters=. For samples with many parquet files (QCD has hundreds), this doubled all disk I/O. The empty check also leaked the error-handling duplicate — the whole except block was identical to the try block for no reason. Fix 2 — Repeated .to_numpy() in TXbb SF loop (PostProcess.py:807–828) For each (wp, j) bin pair and each jet (ijet), the same H{ijet}TXbb and H{ijet}Pt columns were converted to numpy 3 times — once for nominal, once for stat_up, once for stat_dn. With 4 WPs × 1 pT bin × 2 jets, that's 24 redundant array copies. Now cached as txbb_arr / pt_arr / pt_range before the 3 calls. Fix 3 — deepcopy → plain boolean indexing (postprocessing.py:741) deepcopy(df[mask]) recursively copies every numpy array inside the DataFrame. But boolean indexing already creates a new DataFrame with its own data — deepcopy is never needed here since sig_events is only read (not mutated) in subsequent code. Fix 4 — .copy() before pd.concat (postprocessing.py:554) pd.concat always creates a new DataFrame regardless of whether the inputs are copied first. The .copy() on every per-year frame in scale_processes paths was wasting O(N_years × frame_size) memory before the concat result was even created. Fix 5 — list(set(columns))→ dict.fromkeys() (PostProcess.py:1097) set() destroys insertion order, meaning df[columns] could silently reorder output columns each run. dict.fromkeys() deduplicates while preserving the order columns were first added. Fix 6 — del events_dict (PostProcess.py:740) The raw parquet DataFrame (events_dict) was held in memory through the entire categorization loop — even though all its data had already been extracted into bdt_events. For QCD or ttbar this is O(GB). Deleting it immediately after the training-event-removal block (the last place it's accessed) frees that memory before the per-jshift categorization starts.

…split in two

Refactor the template production loop in PostProcess.py to process one sample at a time rather than passing all samples to get_templates simultaneously. For each sample, histograms are generated individually and accumulated via addition into the final templates dict. The sample's dataframe is freed after processing to reduce memory pressure. To support this, get_templates in postprocessing.py accepts a new optional all_hist_samples parameter. When provided, it is used directly for the StrCategory histogram axis, ensuring all per-sample histograms share the same axis and can be accumulated. The sig_key access is also guarded so get_templates handles single-sample dicts correctly. A helper _compute_all_hist_samples pre-computes the full set of histogram sample names (base samples + txbb signal shifts + weight-shift variations) needed to populate the axis upfront. For the single-year case, events_combined is now a shallow copy of events_dict_postprocess to prevent per-sample deletion from corrupting the event_list section.

zichunhao and others added 30 commits November 3, 2025 12:47

Add the missing ttH and ggZH samples

9b51d6b

style: pre-commit fixes

be20ab3

Add xsecs for 2024 V+Jets samples

dc4e5aa

Add files to ignore in data/nanov15

7e010af

style: pre-commit fixes

5d7ac97

Add setup script for xsec analyzer

7b284d4

style: pre-commit fixes

549926b

Add WtoLNu samples and cross sections for 2024 MC samples

4577236

style: pre-commit fixes

b67f749

Update redirector to lpc

93a9859

style: pre-commit fixes

06614ae

Correct the paths to some datasets

b463898

Update nanoindex_v15.json

db13953

Include VBF HH4b CV=+2.12

0ee59ee

Add more samples to v14_25v2 index and update v14_25v2 to v15 index f…

0a9185f

…ormat

Add condor files for v15

c12ade3

Correct mislabeling of the VBF HH CV=+2.12 samples (mislabeled as -2.12)

5ffbfb1

Merge branch 'main' into nanov15

bba638f

Fix pre-commit errors

e0cf67e

Add NLO V+Jets samples to v14_25v2

ccc25a5

Add Run-2 (2016, 2017, and 2018) UL samples

9e7ff62

Remove accidentally staged file

e074d40

Fix v15 nanoindex file corrupted by pre-commit

2baf430

Fix conflict with main branch

ec00c09

Fix conflict

87aaacd

Fix corrupted file after fixing conflict

ab3eda9

Skip diff and only show full hash from git show

8bed7d4

Correct Zto2Q LO samples

0732b01

Add insructions for create NanoAODv15's filelist

7dd2893

Add --skip-years option if not all years are to be submitted

9f55e2b

zichunhao and others added 11 commits November 25, 2025 14:35

Add year of 2025

e7825df

Add 2024 in samples_run3_sig and 2025 to jmsr vars

4c193bf

Merge branch 'main' into nanov15

1ab3126

Use 2024 JEC for 2025 and add jet veto for 2025 (no veto map yet)

cff8d65

Fix jet veto for 2025 and update authors in docstrings

661034b

Update nanoindex v14_25v2 to include new files

6594c08

Add Vto2Q NLO samples

95d3f4d

Make control plots for skimmed nanov15 Ntuples

bba7e3e

fixed casting to numpy in v13_glopartv2.py, fixed mutating of global …

e0af257

…variables in load_columns in postprocessing.py

Merge branch 'main' into bug_fix

da7c13e

wip

a010d1e

dprim7 requested review from jmduarte and zichunhao February 19, 2026 02:11

pre-commit-ci bot and others added 17 commits February 19, 2026 02:13

style: pre-commit fixes

9f2a114

wip, more changes to add compatibility with glopart-v3 and new ntuples

f1e94c2

Merge branch 'run3_limits' of github.com:dprim7/HH4b into run3_limits

74e8c85

fixed more naming mismatches, removed qcd sample selector overriding …

5d20c1a

…hh_vars

added changes from previous commit message (accidentally skipped on l…

8c36549

…ast commit)

added unit testing suite for postprocessing

a2e000a

added postprocessing unit test

bcd559c

removed some memory inefficiencies, passed unit tests

caf9f1b

added glopart-v3 and 2022-2025 compatibility

f9daf6a

more fixes to make datacards work

4a4b979

minor bug fixes

107046e

updated FTest script, included lumi scaling scheme for 2025 (2024 MC …

907af1f

…split in two

updated ignore file

aae9d96

added chunked BDT inference to avoid memory overflow

779eaea

fixed BDT infrence chunking, set default chink size to 0

d1f65c5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run3 limits#325

Run3 limits#325
dprim7 wants to merge 61 commits intoLPC-HH:mainfrom
dprim7:run3_limits

dprim7 commented Feb 19, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

dprim7 commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dprim7 commented Feb 19, 2026 •

edited

Loading