Skip to content

Recall: keyword fallback should support short Unicode/CJK tokens #51

@GoZumie

Description

@GoZumie

Summary

Keyword fallback currently drops tokens shorter than 3 characters:

.filter(|s| s.chars().count() >= 3)

This can suppress valid short Unicode terms (especially common CJK 1–2 character words), making fallback a no-op for multilingual queries where short tokens are meaningful.

Why this matters

The fallback path is explicitly intended to improve multilingual/short-phrase recall in text-only or weak-semantic scenarios. The fixed length threshold is language-biased and can miss relevant matches.

Proposed direction

  • Replace fixed >=3 rule with language-aware or script-aware token policy.
  • At minimum, allow short non-ASCII tokens (or CJK script classes) while still filtering noisy short Latin stopwords.
  • Add tests for short CJK-token fallback behavior.

Acceptance criteria

  • Query terms using short CJK tokens can produce fallback terms and recover related matches.
  • Existing noise control for short Latin stopwords remains acceptable.
  • Regression test(s) added for multilingual short-token fallback.

Context

Found during review of PR #49.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingrustPull requests that update rust code

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions