fix(metacat): handle unmapped entity tokens and training crashes by bgriffen · Pull Request #399 · CogStack/cogstack-nlp

bgriffen · 2026-04-05T10:15:52Z

Summary

Fixes three bugs in MetaCAT supervised training that cause crashes during train_supervised_raw(train_addons=True):

Unmapped entity tokens crash inference — entities whose character offsets don't align with any tokenised token cause UnboundLocalError and KeyError during MetaCAT inference within NER training epochs
Same crash in training data preparation — identical pattern in prepare_from_json → _prepare_from_json_loop causes IndexError during train_raw
_train_meta_cat missing save_dir_path — train_raw requires a save directory when auto_save_model=True (the default config), but _train_meta_cat never provides one. This crashes after all NER epochs complete, losing hours of training progress.

Reproduction

All three bugs are triggered by calling train_supervised_raw(data, train_addons=True) with MetaCAT addons registered. The token-mapping bugs (1 & 2) occur when entity character spans fall at document boundaries or in whitespace-only regions where no BPE tokens are produced. The save_dir_path bug (3) occurs unconditionally with the default MetaCAT config.

Changes

Commit 1: `meta_cat.py` — inference path

prepare_document: skip entities with empty ctoken_idx (prevents UnboundLocalError on ind)
_set_meta_anns: guard ent_id2ind lookup for skipped entities (prevents KeyError)

Commit 2: `data_utils.py` — training data path

_prepare_from_json_loop: skip annotations with empty ctoken_idx (prevents IndexError on ctoken_idx[0])

Commit 3: `trainer.py` — MetaCAT addon training

_train_meta_cat: create a temporary directory and pass it as save_dir_path to train_raw when auto_save_model is enabled

Test plan

train_supervised_raw(train_addons=True) completes without crash on documents containing entities near text boundaries
MetaCAT inference does not crash on entities that fail to map to BPE tokens
train_raw receives a valid save_dir_path and saves best checkpoint during training
Entities that successfully map to tokens are unaffected (no behaviour change)

When an entity's character offsets fall past the end of the tokenised text (or into a whitespace-only region), the token-mapping loop in `prepare_document` produces an empty `ctoken_idx` list. This causes: 1. `UnboundLocalError` on `ind` at line 739 — the loop variable is never assigned because `offset_mapping[last_ind:]` is empty or no pair matches the entity span. 2. `KeyError` in `_set_meta_anns` — skipped entities are absent from `ent_id2ind` but the annotation loop does an unconditional lookup. Fix: skip entities with empty token mappings in `prepare_document`, and guard the dict lookup in `_set_meta_anns`.

`_prepare_from_json_loop` in `data_utils.py` has the same token-mapping pattern as `prepare_document` in `meta_cat.py` (fixed in previous commit), but in the `train_raw` → `prepare_from_json` code path. When an annotation's character offsets don't align with any token in the offset_mapping, `ctoken_idx` is empty and `ctoken_idx[0]` raises `IndexError`. Fix: skip annotations with empty token mappings, consistent with the inference path fix.

`_train_meta_cat` calls `addon.mc.train_raw(data)` without providing `save_dir_path`. When the MetaCAT config has `auto_save_model=True` (the default), `train_raw` raises an exception because it needs a directory to save the best checkpoint during training. This crash occurs after all NER supervised training epochs have completed (potentially hours of work), since `_train_addons` is called at the very end of `train_supervised_raw`. Fix: create a temporary directory for the MetaCAT save checkpoint when `auto_save_model` is enabled, and pass it to `train_raw`.

bgriffen added 3 commits April 5, 2026 20:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(metacat): handle unmapped entity tokens and training crashes#399

fix(metacat): handle unmapped entity tokens and training crashes#399
bgriffen wants to merge 3 commits intoCogStack:mainfrom
bgriffen:fix/metacat-training-crashes

bgriffen commented Apr 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bgriffen commented Apr 5, 2026

Summary

Reproduction

Changes

Commit 1: meta_cat.py — inference path

Commit 2: data_utils.py — training data path

Commit 3: trainer.py — MetaCAT addon training

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Commit 1: `meta_cat.py` — inference path

Commit 2: `data_utils.py` — training data path

Commit 3: `trainer.py` — MetaCAT addon training