fix(metacat): handle unmapped entity tokens and training crashes#399
Open
bgriffen wants to merge 3 commits intoCogStack:mainfrom
Open
fix(metacat): handle unmapped entity tokens and training crashes#399bgriffen wants to merge 3 commits intoCogStack:mainfrom
bgriffen wants to merge 3 commits intoCogStack:mainfrom
Conversation
When an entity's character offsets fall past the end of the tokenised text (or into a whitespace-only region), the token-mapping loop in `prepare_document` produces an empty `ctoken_idx` list. This causes: 1. `UnboundLocalError` on `ind` at line 739 — the loop variable is never assigned because `offset_mapping[last_ind:]` is empty or no pair matches the entity span. 2. `KeyError` in `_set_meta_anns` — skipped entities are absent from `ent_id2ind` but the annotation loop does an unconditional lookup. Fix: skip entities with empty token mappings in `prepare_document`, and guard the dict lookup in `_set_meta_anns`.
`_prepare_from_json_loop` in `data_utils.py` has the same token-mapping pattern as `prepare_document` in `meta_cat.py` (fixed in previous commit), but in the `train_raw` → `prepare_from_json` code path. When an annotation's character offsets don't align with any token in the offset_mapping, `ctoken_idx` is empty and `ctoken_idx[0]` raises `IndexError`. Fix: skip annotations with empty token mappings, consistent with the inference path fix.
`_train_meta_cat` calls `addon.mc.train_raw(data)` without providing `save_dir_path`. When the MetaCAT config has `auto_save_model=True` (the default), `train_raw` raises an exception because it needs a directory to save the best checkpoint during training. This crash occurs after all NER supervised training epochs have completed (potentially hours of work), since `_train_addons` is called at the very end of `train_supervised_raw`. Fix: create a temporary directory for the MetaCAT save checkpoint when `auto_save_model` is enabled, and pass it to `train_raw`.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes three bugs in MetaCAT supervised training that cause crashes during
train_supervised_raw(train_addons=True):UnboundLocalErrorandKeyErrorduring MetaCAT inference within NER training epochsprepare_from_json→_prepare_from_json_loopcausesIndexErrorduringtrain_raw_train_meta_catmissingsave_dir_path—train_rawrequires a save directory whenauto_save_model=True(the default config), but_train_meta_catnever provides one. This crashes after all NER epochs complete, losing hours of training progress.Reproduction
All three bugs are triggered by calling
train_supervised_raw(data, train_addons=True)with MetaCAT addons registered. The token-mapping bugs (1 & 2) occur when entity character spans fall at document boundaries or in whitespace-only regions where no BPE tokens are produced. Thesave_dir_pathbug (3) occurs unconditionally with the default MetaCAT config.Changes
Commit 1:
meta_cat.py— inference pathprepare_document: skip entities with emptyctoken_idx(preventsUnboundLocalErroronind)_set_meta_anns: guardent_id2indlookup for skipped entities (preventsKeyError)Commit 2:
data_utils.py— training data path_prepare_from_json_loop: skip annotations with emptyctoken_idx(preventsIndexErroronctoken_idx[0])Commit 3:
trainer.py— MetaCAT addon training_train_meta_cat: create a temporary directory and pass it assave_dir_pathtotrain_rawwhenauto_save_modelis enabledTest plan
train_supervised_raw(train_addons=True)completes without crash on documents containing entities near text boundariestrain_rawreceives a validsave_dir_pathand saves best checkpoint during training