Skip to content

fix(metacat): handle unmapped entity tokens and training crashes#399

Open
bgriffen wants to merge 3 commits intoCogStack:mainfrom
bgriffen:fix/metacat-training-crashes
Open

fix(metacat): handle unmapped entity tokens and training crashes#399
bgriffen wants to merge 3 commits intoCogStack:mainfrom
bgriffen:fix/metacat-training-crashes

Conversation

@bgriffen
Copy link
Copy Markdown

@bgriffen bgriffen commented Apr 5, 2026

Summary

Fixes three bugs in MetaCAT supervised training that cause crashes during train_supervised_raw(train_addons=True):

  • Unmapped entity tokens crash inference — entities whose character offsets don't align with any tokenised token cause UnboundLocalError and KeyError during MetaCAT inference within NER training epochs
  • Same crash in training data preparation — identical pattern in prepare_from_json_prepare_from_json_loop causes IndexError during train_raw
  • _train_meta_cat missing save_dir_pathtrain_raw requires a save directory when auto_save_model=True (the default config), but _train_meta_cat never provides one. This crashes after all NER epochs complete, losing hours of training progress.

Reproduction

All three bugs are triggered by calling train_supervised_raw(data, train_addons=True) with MetaCAT addons registered. The token-mapping bugs (1 & 2) occur when entity character spans fall at document boundaries or in whitespace-only regions where no BPE tokens are produced. The save_dir_path bug (3) occurs unconditionally with the default MetaCAT config.

Changes

Commit 1: meta_cat.py — inference path

  • prepare_document: skip entities with empty ctoken_idx (prevents UnboundLocalError on ind)
  • _set_meta_anns: guard ent_id2ind lookup for skipped entities (prevents KeyError)

Commit 2: data_utils.py — training data path

  • _prepare_from_json_loop: skip annotations with empty ctoken_idx (prevents IndexError on ctoken_idx[0])

Commit 3: trainer.py — MetaCAT addon training

  • _train_meta_cat: create a temporary directory and pass it as save_dir_path to train_raw when auto_save_model is enabled

Test plan

  • train_supervised_raw(train_addons=True) completes without crash on documents containing entities near text boundaries
  • MetaCAT inference does not crash on entities that fail to map to BPE tokens
  • train_raw receives a valid save_dir_path and saves best checkpoint during training
  • Entities that successfully map to tokens are unaffected (no behaviour change)

bgriffen added 3 commits April 5, 2026 20:06
When an entity's character offsets fall past the end of the tokenised
text (or into a whitespace-only region), the token-mapping loop in
`prepare_document` produces an empty `ctoken_idx` list. This causes:

1. `UnboundLocalError` on `ind` at line 739 — the loop variable is
   never assigned because `offset_mapping[last_ind:]` is empty or no
   pair matches the entity span.

2. `KeyError` in `_set_meta_anns` — skipped entities are absent from
   `ent_id2ind` but the annotation loop does an unconditional lookup.

Fix: skip entities with empty token mappings in `prepare_document`,
and guard the dict lookup in `_set_meta_anns`.
`_prepare_from_json_loop` in `data_utils.py` has the same token-mapping
pattern as `prepare_document` in `meta_cat.py` (fixed in previous
commit), but in the `train_raw` → `prepare_from_json` code path.

When an annotation's character offsets don't align with any token in
the offset_mapping, `ctoken_idx` is empty and `ctoken_idx[0]` raises
`IndexError`.

Fix: skip annotations with empty token mappings, consistent with the
inference path fix.
`_train_meta_cat` calls `addon.mc.train_raw(data)` without providing
`save_dir_path`. When the MetaCAT config has `auto_save_model=True`
(the default), `train_raw` raises an exception because it needs a
directory to save the best checkpoint during training.

This crash occurs after all NER supervised training epochs have
completed (potentially hours of work), since `_train_addons` is
called at the very end of `train_supervised_raw`.

Fix: create a temporary directory for the MetaCAT save checkpoint
when `auto_save_model` is enabled, and pass it to `train_raw`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant