Skip to content

Release dev to main: recall refinements, stability fixes, and CI/security updates#49

Merged
GoZumie merged 32 commits intomainfrom
dev
Mar 6, 2026
Merged

Release dev to main: recall refinements, stability fixes, and CI/security updates#49
GoZumie merged 32 commits intomainfrom
dev

Conversation

@GoZumie
Copy link
Member

@GoZumie GoZumie commented Mar 6, 2026

Summary

Promote dev to main so we can proceed with next integration/release steps.

This includes recent recall and stability work, notably:

  • deterministic recall fixture testing for regression safety
  • sqlite-vec loading/threading safety fix on DB connect
  • multi-pass recall strategy improvements
  • recall output budget controls
  • keyword fallback + temporal intent/recency ranking improvements
  • security/CI permission and dependency alert fixes

Notes

  • Branch: devmain
  • This PR is intended as the next-step release gate for downstream work (including openclaw-wagl update/release follow-up).

PR Review by Greptile

Greptile Summary

This PR promotes devmain, bringing a substantial recall-quality improvement batch that refines how wagl recall retrieves, ranks, and budgets its output. The core changes introduce configurable scoring weights (--salience-weight, --dscore-weight), a keyword-fallback expansion step for multilingual queries, temporal-intent detection (yesterday, last week, etc.) with per-window score boosts, multi-pass context packs (high-valence and open-todo passes), and output budget controls (--max-canon, --min-relevant). On the infrastructure side, the sqlite-vec loading is moved from sqlite3_auto_extension (pre-connection, caused a threading panic) to a per-connection load_extension call, and the CI workflow gains a permissions: contents: read scope restriction.

Key points:

  • Temporal scoring comment is inaccurate: the comment claims salience=0.15 in temporal mode (it uses w_salience, default 0.2) and states out-of-window items cap at 0.70 (actual max with defaults is 0.85).
  • last_week temporal window starts at 0 h: unlike yesterday (24–48 h) and last_night (8–32 h), the last_week window includes items created moments ago (window_start_hours: 0.0), which is semantically inconsistent.
  • RECALL_FIXTURES.md scoring formula is stale: the new norm_abs_dscore × w_dscore term is omitted, and the table does not reflect that salience and dscore weights are now configurable; the example JSON in docs/cli/recall.md also omits the new dscore key from meta.weights.
  • Redundant double-clamp on composite_score (already clamped inside the match arm, then shadowed by an identical clamp).
  • The fixture-driven regression harness (recall_quality.rs) and the comprehensive smoke tests are a strong addition for ongoing regression safety.

Confidence Score: 3/5

  • Mergeable with minor fixes recommended — no data loss or security issues, but two logic-level inaccuracies (temporal comment and last_week window) and stale documentation should be addressed before or shortly after merge.
  • The sqlite-vec threading fix and the recall improvements are well-tested with extensive new fixture/smoke tests. However, the last_week temporal window starting at 0 h is a semantic bug that will silently boost items created seconds ago for "last week" queries, and the inaccurate comment about the temporal scoring ceiling could mislead future contributors tuning weights. The stale scoring formulas in newly introduced docs compound the confusion. These are non-trivial correctness concerns in the core ranking logic, though they do not affect data integrity or security.
  • Pay close attention to crates/core/src/temporal_intent.rs (last_week window bounds) and crates/cli/src/main.rs (temporal scoring comment and double-clamp). Also review docs/RECALL_FIXTURES.md and docs/cli/recall.md for the stale scoring formula.

Important Files Changed

Filename Overview
crates/cli/src/main.rs Major expansion of the recall command: adds configurable score weights, keyword fallback, temporal intent, multi-pass context, and output budget controls. Two issues found: an inaccurate comment about temporal-mode max score ceiling (salience weight stated as 0.15 in comment but 0.2 in code), and a redundant double-clamp on composite_score.
crates/core/src/temporal_intent.rs New module providing parse_temporal_intent for detecting time-window keywords in recall queries. The last_week window is inconsistently defined (starts at 0 h, meaning items created right now qualify) compared to the analogous yesterday and last_night patterns which exclude the most-recent period.
crates/db/src/vector_ext.rs Replaces the sqlite3_auto_extension-based approach with a per-connection load_extension call to fix the libsql threading assertion panic (issue #34). The load_extension_disable is always invoked before the load result is propagated, correctly ensuring extensions are never left enabled on error.
crates/db/src/lib.rs Adds query_recent_high_valence and query_open_todos DB methods along with a shared collect_memory_rows helper. The query_open_todos SQL uses LOWER(tags) LIKE '%...' patterns which perform full table scans, but this is acceptable given the existing schema has no tag index.
docs/RECALL_FIXTURES.md New contributor guide for the recall-quality regression harness. The scoring formula shown is stale — it omits the norm_abs_dscore × w_dscore term added in this PR and the quick-reference table does not mention that salience and dscore weights are now configurable.
.github/workflows/ci.yml Adds a top-level permissions: contents: read declaration to limit the workflow's default token scope — a straightforward, correct security hardening change.
crates/cli/tests/recall_quality.rs New fixture-driven recall-quality regression harness with 8 deterministic test scenarios covering salience ordering, EV ranking, recency, multilingual queries, tag/type matching, and empty-result safety. Tests correctly force text-only mode for determinism.
docs/cli/recall.md Updated CLI reference documentation for recall. The example JSON output is missing the new dscore key in meta.weights that the code now always emits.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[wagl recall query] --> B[Resolve weights\nw_salience, w_dscore\nCLI > env > default]
    B --> C[Fetch canon items\nall canon:* tags]
    C --> D[Dedup canon\n1 per tag, text-dedup\ncap at max_canon]
    D --> E{Semantic search\navailable?}
    E -- Yes --> F[Vector similarity search\nget semantic_scores]
    E -- No --> G[Text LIKE query]
    F --> H{max_score >=\nfallback_threshold?}
    G --> H
    H -- No / text-only --> I[Keyword fallback\ntokenize query\nper-token LIKE searches]
    H -- Yes --> J[Filter out all canon: items]
    I --> J
    J --> K[parse_temporal_intent\ndetect time-window keywords]
    K --> L{Temporal hint\npresent?}
    L -- Yes --> M[Temporal scoring\nsemantic×0.35 + salience×w_s\n+ recency×0.10 + ev×0.10\n+ dscore×w_d + temporal_boost]
    L -- No --> N[Base scoring\nsemantic×0.5 + salience×w_s\n+ recency×0.15 + ev×0.15\n+ dscore×w_d]
    M --> O[Sort by rank_score DESC\nclamp composite_score to 0–1]
    N --> O
    O --> P[Truncate to\nmax limit, min_relevant]
    P --> Q{--multi-pass?}
    Q -- Yes --> R[Pass 2: high-valence\nquery_recent_high_valence]
    Q -- Yes --> S[Pass 3: open todos\nquery_open_todos]
    Q -- No --> T[Output JSON]
    R --> T
    S --> T
Loading

Comments Outside Diff (1)

  1. crates/cli/src/main.rs, line 1319-1323 (link)

    Inaccurate comment: temporal weights and max-score ceiling are both wrong

    The comment states that salience weight is 0.15 in temporal mode and that out-of-window items score at most 0.70, but neither is correct.

    1. The code uses salience * w_salience where w_salience defaults to 0.2 (same as the non-temporal path), not 0.15 as the comment implies.
    2. With the actual default weights in temporal mode (semantic=0.35, salience=0.2, recency=0.10, ev=0.10, dscore=0.1), the maximum score for an out-of-window item is 0.35 + 0.2 + 0.10 + 0.10 + 0.1 = 0.85, not 0.70.

    The 0.70 ceiling only holds when w_salience=0.15 and w_dscore=0.0, which are non-default values. The comment appears to pre-date the addition of the configurable dscore weight and was never updated.

Last reviewed commit: 547085a

Greptile also left 4 inline comments on this PR.

GoZumie and others added 30 commits March 4, 2026 04:05
Bumps the npm_and_yarn group with 1 update in the /docs directory: [svgo](https://github.com/svg/svgo).


Updates `svgo` from 4.0.0 to 4.0.1
- [Release notes](https://github.com/svg/svgo/releases)
- [Commits](svg/svgo@v4.0.0...v4.0.1)

---
updated-dependencies:
- dependency-name: svgo
  dependency-version: 4.0.1
  dependency-type: indirect
  dependency-group: npm_and_yarn
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: GoZumie <258471731+GoZumie@users.noreply.github.com>
- Add `temporal_intent` module to `wagl-core` with:
  - `TemporalHint` struct (window_start_hours, window_end_hours, boost)
  - `parse_temporal_intent()` recognizing: yesterday, today, recent/recently/lately,
    last night, this morning/afternoon/evening, last week (most-specific first,
    case-insensitive)
  - 14 unit tests covering all keywords, boundary values, and specificity ordering

- Update `wagl recall` composite scoring:
  - Without temporal hint: unchanged (semantic*0.5 + salience*0.2 + recency*0.15 + ev*0.15)
  - With temporal hint: semantic*0.35 + salience*0.15 + recency*0.10 + ev*0.10 + boost
    (up to 0.30 for in-window items, 0.0 for out-of-window items, capped at 1.0)
  - Items outside the temporal window score at most 0.70; in-window items score up to 1.00

- Emit `meta.temporal_intent` (null when no hint) and update `meta.weights`

- Add CLI integration tests:
  - yesterday hint makes 36h-old item rank above 30-day-old identical item
  - non-temporal query emits null temporal_intent and uses standard weights

Co-authored-by: GoZumie <258471731+GoZumie@users.noreply.github.com>
Co-authored-by: GoZumie <258471731+GoZumie@users.noreply.github.com>
Co-authored-by: GoZumie <258471731+GoZumie@users.noreply.github.com>
Co-authored-by: GoZumie <258471731+GoZumie@users.noreply.github.com>
Co-authored-by: GoZumie <258471731+GoZumie@users.noreply.github.com>
@GoZumie GoZumie requested a review from ChrisCompton as a code owner March 6, 2026 18:01
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 547085acaf

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@GoZumie
Copy link
Member Author

GoZumie commented Mar 6, 2026

Addressed the review findings in commit 75538d4.

Fixed

  • Canonical retrieval now filters by tag prefix in SQL before LIMIT (query_by_tag_prefix), so newer non-canon rows cannot crowd out canon candidates.
  • Recall candidate fetch budgets now scale with effective_limit = max(limit, min_relevant) for semantic and text paths, so --min-relevant is satisfiable for larger values.
  • Removed redundant double clamp on composite_score.
  • Updated temporal parsing for last week to the previous-week window (168–336h), plus matching unit test updates.
  • Updated docs to include the dscore scoring term, configurable weight behavior, and clamping note.
  • Updated recall docs JSON example to include meta.weights.dscore.

Added regression tests

  • recall_canon_not_starved_by_text_noise
  • recall_min_relevant_large_budget_is_satisfiable

Local validation

  • cargo test -p wagl --test recall_budget_smoke
  • cargo test -p wagl --test temporal_recall_smoke
  • cargo test -p wagl-core temporal_intent_tests

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 75538d4a56

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@GoZumie
Copy link
Member Author

GoZumie commented Mar 6, 2026

Follow-up pass complete for newly surfaced conversations.

Implemented and pushed cd2208e:

  • Added semantic undersize backfill: when semantic candidates are present but below budget, recall now backfills from text query and de-duplicates IDs. This covers partial-embedding databases and keeps min_relevant satisfiable.
  • Added strict --created-at validation/normalization for put (RFC3339 required; normalized to UTC RFC3339 string) to avoid malformed timestamps degrading ordering/recency logic.
  • Added regression test: put_rejects_invalid_created_at_override.

Local validation:

  • cargo test -p wagl --test put_area_smoke
  • cargo test -p wagl --test recall_budget_smoke
  • cargo test -p wagl --test temporal_recall_smoke

I reviewed all current PR review conversations and resolved the addressed threads.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cd2208e627

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@GoZumie
Copy link
Member Author

GoZumie commented Mar 6, 2026

I reviewed the two newly surfaced P2 findings.

They are valid refinement items but not release blockers for this PR, so I tracked them in backlog issues:

Given current behavior and risk profile, I recommend proceeding with PR approval and handling these in the next refinement pass.

@GoZumie GoZumie merged commit f79d655 into main Mar 6, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants