Skip to content

fix: extend envelope stripping to broader channel/system markers#481

Open
jlin53882 wants to merge 6 commits intoCortexReach:masterfrom
jlin53882:fix/envelope-stripping-phase2
Open

fix: extend envelope stripping to broader channel/system markers#481
jlin53882 wants to merge 6 commits intoCortexReach:masterfrom
jlin53882:fix/envelope-stripping-phase2

Conversation

@jlin53882
Copy link
Copy Markdown
Contributor

@jlin53882 jlin53882 commented Apr 3, 2026

Summary

Extends envelope metadata stripping (from PR #444) to cover 8 additional channel/system envelope patterns that were still leaking into memory extraction.

Changes

1. src/smart-extractor.ts — stripEnvelopeMetadata()

  • Added per-line stripping for 8 new patterns:
    • <<<EXTERNAL_UNTRUSTED_CONTENT / <<<END_EXTERNAL_UNTRUSTED_CONTENT (Discord channel forwarded message wrappers)
    • Sender (untrusted metadata): + JSON block
    • Conversation info (untrusted metadata): + JSON block
    • Thread starter (untrusted, for context): block
    • Forwarded message context (untrusted metadata): block
    • [Queued messages while agent was busy]
  • Fixed a pre-existing bug in PR fix: filter subagent runtime wrappers from auto-capture #444: subagent wrapper boilerplate on the same line as the wrapper prefix was not stripped. Changed to entire-line matching for reliability.

2. src/noise-filter.ts — ENVELOPE_NOISE_PATTERNS

  • New exported array of 8 RegExp patterns
  • Pre-filter in filterNoiseByEmbedding() skips envelope-pattern texts before embedding-based noise check

3. src/tools.ts — registerMemoryStoreTool

  • Added envelope metadata guard: rejects input matching any ENVELOPE_NOISE_PATTERNS with a clear error

4. test/strip-envelope-metadata.test.mjs

  • 8 new regression test cases (one per new pattern)
  • 1 mixed integration test covering all Phase 2 patterns together

Testing

node --test test/strip-envelope-metadata.test.mjs
22/22 pass

References

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: aa7f6d8f4b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".


// Pre-filter: skip known envelope patterns before embedding check
if (texts.length > 0) {
const nonEnvelopeTexts = texts.filter(t => !ENVELOPE_NOISE_PATTERNS.some(p => p.test(t)));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Do not discard mixed-content messages as envelope noise

SmartExtractor.filterNoiseByEmbedding now drops any text where ENVELOPE_NOISE_PATTERNS matches, but this runs before stripEnvelopeMetadata in the auto-capture path, so messages that contain both envelope lines and real user content are removed entirely instead of cleaned. For example, a single text containing System: [...] plus a user fact will be filtered out, which can reduce cleanTexts below minMessages and skip extraction, causing valid memories to be lost.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 addressed — pushed. The pre-filter has been removed from filterNoiseByEmbedding. Reasoning: in the extraction path, stripEnvelopeMetadata runs before filterNoiseByEmbedding, so the pre-filter would fire on already-cleaned text. The ENVELOPE_NOISE_PATTERNS guard is now only used in the memory_store tool path (see below).

src/tools.ts Outdated

try {
// Guard: reject envelope metadata before any processing
if (ENVELOPE_NOISE_PATTERNS.some(p => p.test(text))) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid rejecting memory_store input on partial envelope match

The new memory_store guard rejects the whole write if any envelope regex matches (ENVELOPE_NOISE_PATTERNS.some(...)), but these patterns are line-level markers and can appear alongside legitimate memory content in the same payload. In that mixed-content case, the tool returns envelope_metadata_rejected and stores nothing, which drops explicit user-provided facts instead of preserving content after metadata stripping.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 addressed — pushed. The memory_store guard now calls stripEnvelopeMetadata first, then checks if the result is empty. Mixed-content messages (envelope markers + real facts) will have their envelope stripped and the facts preserved. Only purely-envelope text gets rejected.

@jlin53882 jlin53882 force-pushed the fix/envelope-stripping-phase2 branch from aa7f6d8 to cb636e2 Compare April 3, 2026 07:10
…se 2)

- Extend stripEnvelopeMetadata() with 8 new patterns:
  <<<EXTERNAL_UNTRUSTED_CONTENT, <<<END_EXTERNAL_UNTRUSTED_CONTENT,
  Sender/Conversation info (untrusted metadata), Thread starter,
  Forwarded message context, [Queued messages while agent was busy]
- Add ENVELOPE_NOISE_PATTERNS to noise-filter.ts for pre-embedding guard
- Add memory_store tool guard in tools.ts (strip-then-check approach)
- Add 8 regression test cases in strip-envelope-metadata.test.mjs
- Fix PR CortexReach#444 regex bug: subagent wrapper lines now stripped via
  entire-line matching (/^\[Subagent Context|Subagent Task\].*$/gm)
- P1 fix: remove pre-filter from filterNoiseByEmbedding (runs before
  stripEnvelopeMetadata in extraction path, would cause false positives)
- P2 fix: memory_store guard now strips first then checks if empty,
  preserving mixed-content messages

Fixes CortexReach#446
Copy link
Copy Markdown

app3apps commented Apr 3, 2026

Current head cb636e2 is still broken: src/smart-extractor.ts has a stray } around filterNoiseByEmbedding(), which makes the module fail to parse.

I can reproduce it with:

  • jiti("./src/smart-extractor.ts")
  • node --test test/strip-envelope-metadata.test.mjs

This needs to be fixed before merge.

Copy link
Copy Markdown
Collaborator

@AliceLJY AliceLJY left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: PR #481 — fix: extend envelope stripping to broader channel/system markers

The envelope stripping patterns and test coverage are good, but there are two code-level issues that need to be fixed before merge.

Codex P1 — "Do not discard mixed-content messages as envelope noise"

Valid concern, and the author has addressed it (confirmed by jlin53882's comment). The pre-filter was removed from filterNoiseByEmbedding, and the memory_store guard now strips first and only rejects if nothing remains. The current diff reflects the fix.

Codex P2 — "Avoid rejecting memory_store input on partial envelope match"

Valid concern, and the author has addressed it. The memory_store guard in tools.ts now calls stripEnvelopeMetadata(text) first, then checks if the stripped result is empty. Mixed-content messages are preserved after stripping. Correct.

Blocking issues

1. Stray closing brace breaks filterNoiseByEmbedding

In src/smart-extractor.ts, the diff shows:

  async filterNoiseByEmbedding(texts: string[]): Promise<string[]> {
    const noiseBank = this.config.noiseBank;
+
+    }
+
    if (!noiseBank || !noiseBank.initialized) return texts;

There is a stray } on its own line that prematurely closes the filterNoiseByEmbedding method. The if (!noiseBank ...) line and everything after it would then be outside the method body, causing a compile error. This appears to be a leftover from removing the envelope pre-filter block — the block's closing brace was not removed along with its opening.

2. Character encoding corruption throughout smart-extractor.ts

Multiple comment lines have their em-dash characters () replaced with ??. Examples:

  • "Smart Memory Extractor ??LLM-powered extraction pipeline" (was — LLM-powered)
  • "Pipeline: conversation ??LLM extract ??candidates ??dedup ??persist" (was )
  • "LLM returned zero candidates ??strongest noise signal" (was )
  • "Same brand, different item ??should not be deduped" (was )
  • Chinese characters also corrupted: "?�欢麦�??��??�烧鸡腿??" (was 喜欢麦当劳的板烧鸡腿堡)

This is a systematic encoding issue — likely the file was saved or committed with an incorrect encoding. All corrupted characters need to be restored to their originals.

Non-blocking notes

  • ENVELOPE_NOISE_PATTERNS export in noise-filter.ts is well-structured and the patterns are specific enough to avoid false positives.
  • The 8 new test cases in strip-envelope-metadata.test.mjs provide good regression coverage.
  • tools.ts lost its trailing newline (\ No newline at end of file). Minor.

Please fix the stray brace and the encoding corruption, then this is ready for another look.

…ractor.ts

P1: Remove stray closing brace in filterNoiseByEmbedding that caused compile error
P2: Restore em-dashes (—) and arrows (→) that were corrupted to ?? throughout comments
P3: Fix garbled Chinese text in preference-slot guard comment
T1: Add missing trailing newline to tools.ts
@jlin53882
Copy link
Copy Markdown
Contributor Author

All blocking issues have been fixed:

  • P1: Removed stray } in ilterNoiseByEmbedding that caused the compile error
  • P2: Restored corrupted characters throughout smart-extractor.ts — em-dashes (—), arrows (→), and Chinese text all fixed
  • T1: Added trailing newline to ools.ts

Please re-review when you have a chance 🙏

Copy link
Copy Markdown
Collaborator

@AliceLJY AliceLJY left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-reviewed after latest commit. Both blocking issues from my previous review are fixed:

  1. Stray closing brace — removed, filterNoiseByEmbedding method body is clean ✅
  2. Encoding corruption — CJK characters (喜欢麦当劳的板烧鸡腿堡) and em-dashes () are now correct ✅

Code quality:

  • ENVELOPE_NOISE_PATTERNS export is well-structured
  • stripEnvelopeMetadata extensions handle Discord/channel markers properly (per-line, no blanket rejection)
  • tools.ts guard correctly strips first → rejects only if empty (P2 fix)
  • 8 new test cases provide solid regression coverage

LGTM. @rwmjhb 请 review 后合并。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

#394 — Envelope Metadata Leak into Memory

4 participants