Skip to content

Run3 limits#325

Draft
dprim7 wants to merge 61 commits intoLPC-HH:mainfrom
dprim7:run3_limits
Draft

Run3 limits#325
dprim7 wants to merge 61 commits intoLPC-HH:mainfrom
dprim7:run3_limits

Conversation

@dprim7
Copy link
Copy Markdown
Collaborator

@dprim7 dprim7 commented Feb 19, 2026

  • Fixed Run-3 v13 BDT dataframe construction in src/HH4b/boosted/bdt_trainings_run3/v13_glopartv2.py by explicitly converting VBF/AK4 inputs to 1D NumPy arrays (to_numpy()[:, i]) before building vectors.
  • Resolved pandas shape issues in v13 features by casting VBF-derived outputs (VBFjjMass, VBFjjDeltaEta) to NumPy arrays before DataFrame creation.
  • Fixed duplicate bdt_score column loading in src/HH4b/postprocessing/postprocessing.py by copying columns_to_load[...]/load_columns_syst before mutation, avoiding in-place global list growth across calls.
  • Added glopart-v3 support in postprocessing.py load/filter config:
  • new columns_to_load["glopart-v3"]
  • new filters_to_apply["glopart-v3"]
  • updated load_run3_samples txbb_version allowlist and assertion message.

### Update 02/19/26:

  • Added first-pass glopart-v3 runtime compatibility for Run-3 postprocessing by updating branch usage from ParT v2 names to ParT3 names in key paths: BDT input config (v13_glopartv2.py), mass mapping (mreg_strings), trigger-SF branch selection (corrections.trigger_SF), and TXbb preselection handling (PostProcess.py).
  • Aligned glopart-v3 systematics column loading with available ntuple schema by remapping JMS/JMR mass-shift requests from bbFatJetParTmassVis* to bbFatJetParT3massX2p* inside load_run3_samples, resolving parquet read failures on ttbar/systematics samples.
  • Added minimal xsec key alias resolution in utils._normalize_weights to support NanoV15 sample names like QCD-4Jets_HT-* without modifying canonical entries in xsecs.py, preventing fallback-scaling KeyErrors during MC load.

### Update 02/25/2026:

  • Fixed more naming inconsistencies across the various eras, templates now generate
  • added unit testing suite for post processing
  • refactored and tested post processing pipeline for better memory efficiency

@dprim7 dprim7 requested review from jmduarte and zichunhao February 19, 2026 02:11
pre-commit-ci bot and others added 17 commits February 19, 2026 02:13
  Fix 1 — Double parquet I/O (utils.py:392–414)

  The old code read every .parquet file twice: once with no
  arguments (full read of all columns) just to check .empty,
  then again with columns= and filters=. For samples with many
  parquet files (QCD has hundreds), this doubled all disk I/O.
  The empty check also leaked the error-handling duplicate —
  the whole except block was identical to the try block for no
  reason.

  Fix 2 — Repeated .to_numpy() in TXbb SF loop
  (PostProcess.py:807–828)

  For each (wp, j) bin pair and each jet (ijet), the same
  H{ijet}TXbb and H{ijet}Pt columns were converted to numpy 3
  times — once for nominal, once for stat_up, once for stat_dn.
   With 4 WPs × 1 pT bin × 2 jets, that's 24 redundant array
  copies. Now cached as txbb_arr / pt_arr / pt_range before the
   3 calls.

  Fix 3 — deepcopy → plain boolean indexing
  (postprocessing.py:741)

  deepcopy(df[mask]) recursively copies every numpy array
  inside the DataFrame. But boolean indexing already creates a
  new DataFrame with its own data — deepcopy is never needed
  here since sig_events is only read (not mutated) in
  subsequent code.

  Fix 4 — .copy() before pd.concat (postprocessing.py:554)

  pd.concat always creates a new DataFrame regardless of
  whether the inputs are copied first. The .copy() on every
  per-year frame in scale_processes paths was wasting O(N_years
   × frame_size) memory before the concat result was even
  created.

  Fix 5 — list(set(columns))→ dict.fromkeys()
  (PostProcess.py:1097)

  set() destroys insertion order, meaning df[columns] could
  silently reorder output columns each run. dict.fromkeys()
  deduplicates while preserving the order columns were first
  added.

  Fix 6 — del events_dict (PostProcess.py:740)

  The raw parquet DataFrame (events_dict) was held in memory
  through the entire categorization loop — even though all its
  data had already been extracted into bdt_events. For QCD or
  ttbar this is O(GB). Deleting it immediately after the
  training-event-removal block (the last place it's accessed)
  frees that memory before the per-jshift categorization
  starts.
  Refactor the template production loop in PostProcess.py to process one
  sample at a time rather than passing all samples to get_templates
  simultaneously. For each sample, histograms are generated individually
  and accumulated via addition into the final templates dict. The sample's
  dataframe is freed after processing to reduce memory pressure.

  To support this, get_templates in postprocessing.py accepts a new
  optional all_hist_samples parameter. When provided, it is used directly
  for the StrCategory histogram axis, ensuring all per-sample histograms
  share the same axis and can be accumulated. The sig_key access is also
  guarded so get_templates handles single-sample dicts correctly.

  A helper _compute_all_hist_samples pre-computes the full set of
  histogram sample names (base samples + txbb signal shifts + weight-shift
  variations) needed to populate the axis upfront.

  For the single-year case, events_combined is now a shallow copy of
  events_dict_postprocess to prevent per-sample deletion from corrupting
  the event_list section.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants