fix: extend envelope stripping to broader channel/system markers#481
fix: extend envelope stripping to broader channel/system markers#481jlin53882 wants to merge 6 commits intoCortexReach:masterfrom
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: aa7f6d8f4b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
src/smart-extractor.ts
Outdated
|
|
||
| // Pre-filter: skip known envelope patterns before embedding check | ||
| if (texts.length > 0) { | ||
| const nonEnvelopeTexts = texts.filter(t => !ENVELOPE_NOISE_PATTERNS.some(p => p.test(t))); |
There was a problem hiding this comment.
Do not discard mixed-content messages as envelope noise
SmartExtractor.filterNoiseByEmbedding now drops any text where ENVELOPE_NOISE_PATTERNS matches, but this runs before stripEnvelopeMetadata in the auto-capture path, so messages that contain both envelope lines and real user content are removed entirely instead of cleaned. For example, a single text containing System: [...] plus a user fact will be filtered out, which can reduce cleanTexts below minMessages and skip extraction, causing valid memories to be lost.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
P1 addressed — pushed. The pre-filter has been removed from filterNoiseByEmbedding. Reasoning: in the extraction path, stripEnvelopeMetadata runs before filterNoiseByEmbedding, so the pre-filter would fire on already-cleaned text. The ENVELOPE_NOISE_PATTERNS guard is now only used in the memory_store tool path (see below).
src/tools.ts
Outdated
|
|
||
| try { | ||
| // Guard: reject envelope metadata before any processing | ||
| if (ENVELOPE_NOISE_PATTERNS.some(p => p.test(text))) { |
There was a problem hiding this comment.
Avoid rejecting memory_store input on partial envelope match
The new memory_store guard rejects the whole write if any envelope regex matches (ENVELOPE_NOISE_PATTERNS.some(...)), but these patterns are line-level markers and can appear alongside legitimate memory content in the same payload. In that mixed-content case, the tool returns envelope_metadata_rejected and stores nothing, which drops explicit user-provided facts instead of preserving content after metadata stripping.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
P2 addressed — pushed. The memory_store guard now calls stripEnvelopeMetadata first, then checks if the result is empty. Mixed-content messages (envelope markers + real facts) will have their envelope stripped and the facts preserved. Only purely-envelope text gets rejected.
aa7f6d8 to
cb636e2
Compare
…se 2) - Extend stripEnvelopeMetadata() with 8 new patterns: <<<EXTERNAL_UNTRUSTED_CONTENT, <<<END_EXTERNAL_UNTRUSTED_CONTENT, Sender/Conversation info (untrusted metadata), Thread starter, Forwarded message context, [Queued messages while agent was busy] - Add ENVELOPE_NOISE_PATTERNS to noise-filter.ts for pre-embedding guard - Add memory_store tool guard in tools.ts (strip-then-check approach) - Add 8 regression test cases in strip-envelope-metadata.test.mjs - Fix PR CortexReach#444 regex bug: subagent wrapper lines now stripped via entire-line matching (/^\[Subagent Context|Subagent Task\].*$/gm) - P1 fix: remove pre-filter from filterNoiseByEmbedding (runs before stripEnvelopeMetadata in extraction path, would cause false positives) - P2 fix: memory_store guard now strips first then checks if empty, preserving mixed-content messages Fixes CortexReach#446
|
Current head I can reproduce it with:
This needs to be fixed before merge. |
AliceLJY
left a comment
There was a problem hiding this comment.
Review: PR #481 — fix: extend envelope stripping to broader channel/system markers
The envelope stripping patterns and test coverage are good, but there are two code-level issues that need to be fixed before merge.
Codex P1 — "Do not discard mixed-content messages as envelope noise"
Valid concern, and the author has addressed it (confirmed by jlin53882's comment). The pre-filter was removed from filterNoiseByEmbedding, and the memory_store guard now strips first and only rejects if nothing remains. The current diff reflects the fix.
Codex P2 — "Avoid rejecting memory_store input on partial envelope match"
Valid concern, and the author has addressed it. The memory_store guard in tools.ts now calls stripEnvelopeMetadata(text) first, then checks if the stripped result is empty. Mixed-content messages are preserved after stripping. Correct.
Blocking issues
1. Stray closing brace breaks filterNoiseByEmbedding
In src/smart-extractor.ts, the diff shows:
async filterNoiseByEmbedding(texts: string[]): Promise<string[]> {
const noiseBank = this.config.noiseBank;
+
+ }
+
if (!noiseBank || !noiseBank.initialized) return texts;There is a stray } on its own line that prematurely closes the filterNoiseByEmbedding method. The if (!noiseBank ...) line and everything after it would then be outside the method body, causing a compile error. This appears to be a leftover from removing the envelope pre-filter block — the block's closing brace was not removed along with its opening.
2. Character encoding corruption throughout smart-extractor.ts
Multiple comment lines have their em-dash characters (—) replaced with ??. Examples:
"Smart Memory Extractor ??LLM-powered extraction pipeline"(was— LLM-powered)"Pipeline: conversation ??LLM extract ??candidates ??dedup ??persist"(was→)"LLM returned zero candidates ??strongest noise signal"(was—)"Same brand, different item ??should not be deduped"(was→)- Chinese characters also corrupted:
"?�欢麦�??��??�烧鸡腿??"(was喜欢麦当劳的板烧鸡腿堡)
This is a systematic encoding issue — likely the file was saved or committed with an incorrect encoding. All corrupted characters need to be restored to their originals.
Non-blocking notes
ENVELOPE_NOISE_PATTERNSexport innoise-filter.tsis well-structured and the patterns are specific enough to avoid false positives.- The 8 new test cases in
strip-envelope-metadata.test.mjsprovide good regression coverage. tools.tslost its trailing newline (\ No newline at end of file). Minor.
Please fix the stray brace and the encoding corruption, then this is ready for another look.
…ractor.ts P1: Remove stray closing brace in filterNoiseByEmbedding that caused compile error P2: Restore em-dashes (—) and arrows (→) that were corrupted to ?? throughout comments P3: Fix garbled Chinese text in preference-slot guard comment T1: Add missing trailing newline to tools.ts
|
All blocking issues have been fixed:
Please re-review when you have a chance 🙏 |
AliceLJY
left a comment
There was a problem hiding this comment.
Re-reviewed after latest commit. Both blocking issues from my previous review are fixed:
- Stray closing brace — removed,
filterNoiseByEmbeddingmethod body is clean ✅ - Encoding corruption — CJK characters (
喜欢麦当劳的板烧鸡腿堡) and em-dashes (—) are now correct ✅
Code quality:
ENVELOPE_NOISE_PATTERNSexport is well-structuredstripEnvelopeMetadataextensions handle Discord/channel markers properly (per-line, no blanket rejection)tools.tsguard correctly strips first → rejects only if empty (P2 fix)- 8 new test cases provide solid regression coverage
LGTM. @rwmjhb 请 review 后合并。
Summary
Extends envelope metadata stripping (from PR #444) to cover 8 additional channel/system envelope patterns that were still leaking into memory extraction.
Changes
1. src/smart-extractor.ts — stripEnvelopeMetadata()
2. src/noise-filter.ts — ENVELOPE_NOISE_PATTERNS
3. src/tools.ts — registerMemoryStoreTool
4. test/strip-envelope-metadata.test.mjs
Testing
node --test test/strip-envelope-metadata.test.mjs
22/22 pass
References