Skip to content

Add term processing module and counters#6

Open
artvandelay wants to merge 1 commit intomainfrom
codex/build-text-processing-module-with-n-gram-support
Open

Add term processing module and counters#6
artvandelay wants to merge 1 commit intomainfrom
codex/build-text-processing-module-with-n-gram-support

Conversation

@artvandelay
Copy link
Copy Markdown
Owner

Motivation

  • Provide a dedicated text-processing layer to tokenize added/removed diff text and surface candidate terms for signals.
  • Capture multi-token phrases and candidate proper nouns/capitalized phrases for better term detection and entity signals.
  • Maintain time-windowed counters to compare short-term spikes against 7d/30d baselines for anomaly detection.

Description

  • Add src/nlbt/terms/tokenizer.py with tokenize, tokenize_text, and extract_capitalized_phrases using a word regex for robust tokenization.
  • Add src/nlbt/terms/ngrams.py with generate_ngrams (1–4 grams by default) and extract_term_candidates to produce n-grams and proper-noun candidates.
  • Add src/nlbt/terms/diff.py with tokenize_diff and extract_diff_terms helpers to process added/removed text as diffs.
  • Add src/nlbt/terms/counters.py implementing TermCounterStore and TermBucketStats for time-bucketed storage (hourly by default) and rollups via get_rollups (24h/7d/30d) while tracking distinct pages and editors.

Testing

  • Added tests/test_terms.py which exercises diff tokenization, n-gram extraction, and proper-noun capture via tokenize_diff and extract_diff_terms.
  • Ran python tests/test_terms.py and the test script completed successfully (all assertions passed).

Codex Task

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant