Add term processing module and counters by artvandelay · Pull Request #6 · artvandelay/agentic-backtesting

artvandelay · 2026-01-10T08:51:59Z

Provide a dedicated text-processing layer to tokenize added/removed diff text and surface candidate terms for signals.
Capture multi-token phrases and candidate proper nouns/capitalized phrases for better term detection and entity signals.
Maintain time-windowed counters to compare short-term spikes against 7d/30d baselines for anomaly detection.

Add src/nlbt/terms/tokenizer.py with tokenize, tokenize_text, and extract_capitalized_phrases using a word regex for robust tokenization.
Add src/nlbt/terms/ngrams.py with generate_ngrams (1–4 grams by default) and extract_term_candidates to produce n-grams and proper-noun candidates.
Add src/nlbt/terms/diff.py with tokenize_diff and extract_diff_terms helpers to process added/removed text as diffs.
Add src/nlbt/terms/counters.py implementing TermCounterStore and TermBucketStats for time-bucketed storage (hourly by default) and rollups via get_rollups (24h/7d/30d) while tracking distinct pages and editors.

Added tests/test_terms.py which exercises diff tokenization, n-gram extraction, and proper-noun capture via tokenize_diff and extract_diff_terms.
Ran python tests/test_terms.py and the test script completed successfully (all assertions passed).

Add term tokenization and counters

a1a2825

artvandelay added the codex label Jan 10, 2026 — with ChatGPT Codex Connector

Provide feedback