-
Notifications
You must be signed in to change notification settings - Fork 0
Recall: keyword fallback should support short Unicode/CJK tokens #51
Copy link
Copy link
Open
Labels
bugSomething isn't workingSomething isn't workingrustPull requests that update rust codePull requests that update rust code
Description
Summary
Keyword fallback currently drops tokens shorter than 3 characters:
.filter(|s| s.chars().count() >= 3)This can suppress valid short Unicode terms (especially common CJK 1–2 character words), making fallback a no-op for multilingual queries where short tokens are meaningful.
Why this matters
The fallback path is explicitly intended to improve multilingual/short-phrase recall in text-only or weak-semantic scenarios. The fixed length threshold is language-biased and can miss relevant matches.
Proposed direction
- Replace fixed
>=3rule with language-aware or script-aware token policy. - At minimum, allow short non-ASCII tokens (or CJK script classes) while still filtering noisy short Latin stopwords.
- Add tests for short CJK-token fallback behavior.
Acceptance criteria
- Query terms using short CJK tokens can produce fallback terms and recover related matches.
- Existing noise control for short Latin stopwords remains acceptable.
- Regression test(s) added for multilingual short-token fallback.
Context
Found during review of PR #49.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingrustPull requests that update rust codePull requests that update rust code