Skip to content
This repository was archived by the owner on Mar 2, 2026. It is now read-only.

fix: deduplicate on compile — update existing articles instead of creating duplicates (#73)#582

Merged
zonk1024 merged 2 commits intomainfrom
fix/73-compile-dedup
Feb 28, 2026
Merged

fix: deduplicate on compile — update existing articles instead of creating duplicates (#73)#582
zonk1024 merged 2 commits intomainfrom
fix/73-compile-dedup

Conversation

@zonk1024
Copy link
Contributor

Summary

Implements dedup-on-compile (issue #73): when compile_article() produces content that is highly similar to an existing active article, it updates that article in-place instead of creating a duplicate.

Changes

src/valence/core/compilation.py

  • DEDUP_SIMILARITY_THRESHOLD = 0.90 — module-level constant (configurable via the threshold parameter)
  • _find_similar_article(content, threshold) — async helper that:
    • Embeds the compiled content using generate_embedding() (OpenAI provider)
    • Queries articles table with 1 - (embedding <=> %s::vector) > threshold (pgvector cosine similarity)
    • Returns the best-matching article dict, or None
    • Gracefully degrades on embedding or DB failure (logs warning, returns None)
  • _update_existing_article(...) — async helper that:
    • UPDATEs the existing article with new content, title, confidence, epistemic_type, version++
    • Additively links new sources via _link_sources() (does not clear existing links)
    • Records an 'updated' mutation with a dedup summary
    • Queues a split if the updated article exceeds max_tokens
  • compile_article() — after LLM produces article_content / article_title and before _create_article_row, runs the dedup check and short-circuits if a match is found

tests/core/test_compilation_dedup.py (new, 10 tests)

  • TestDedupConstant — threshold is 0.90
  • TestFindSimilarArticle — returns None on no match, returns dict on match, graceful degradation on embedding failure, graceful degradation on DB failure, custom threshold is forwarded
  • TestCompileArticleDedupPath — dedup fires and updates existing article, novel content creates new article, embedding failure causes graceful fallback to new article, dedup logs info message with article ID

Test Results

Closes ourochronos/tracking#73

…ating duplicates (#73)

- Add DEDUP_SIMILARITY_THRESHOLD = 0.90 constant
- Add _find_similar_article() helper: embeds compiled content and queries
  articles table using pgvector cosine similarity (1 - embedding <=> vector)
- Add _update_existing_article() helper: updates article in-place on dedup hit,
  additively links new sources, records 'updated' mutation
- Modify compile_article() to run dedup check after LLM produces content but
  before _create_article_row — returns updated article if similar one exists
- Graceful degradation: embedding failures are caught and logged; compilation
  proceeds normally creating a new article
- Tests: test_compilation_dedup.py with 10 tests covering dedup fires/updates,
  novel content creates new article, embedding failure graceful degradation,
  and info logging when dedup triggers

Closes ourochronos/tracking#73
Copilot AI review requested due to automatic review settings February 28, 2026 19:16
@zonk1024 zonk1024 merged commit c577f67 into main Feb 28, 2026
5 checks passed
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements deduplication-on-compile (referenced as tracking issue #73): when compile_article() produces content highly similar to an existing active article, it updates that article in-place instead of creating a duplicate.

Changes:

  • Adds _find_similar_article() helper that embeds compiled content and queries pgvector for cosine similarity above a configurable threshold
  • Adds _update_existing_article() helper that updates an existing article's content, increments the version, additively links new sources, and queues a split mutation if needed
  • Hooks the dedup check into compile_article() after LLM compilation and before new article creation
  • Adds a new test file test_compilation_dedup.py with 10 tests covering the constant, the similarity helper, and three integration paths

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.

File Description
src/valence/core/compilation.py Adds DEDUP_SIMILARITY_THRESHOLD, _find_similar_article(), _update_existing_article(), and inserts the dedup check into compile_article()
tests/core/test_compilation_dedup.py New test file with unit tests for the threshold constant and _find_similar_article(), plus integration tests for compile_article() dedup paths

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@@ -0,0 +1,306 @@
"""Tests for dedup-on-compile feature (issue #73).
Copy link

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description says "Closes ourochronos/tracking#73" (a separate tracking repository), but the in-code comment here says (issue #73). In this repository, issue #73 is "Implement MLS group creation for federations" — an entirely unrelated feature. Without the full repository qualifier, the reference is misleading to anyone browsing the source. The comment should use the full reference (ourochronos/tracking#73) to make the provenance clear.

Copilot uses AI. Check for mistakes.
relationships,
epistemic_type,
conf,
)
Copy link

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the dedup check fires (line 660-676), the code immediately returns the result of _update_existing_article for the first article, but completely skips creating any extra articles that the LLM may have returned in articles_data[1:]. In the non-dedup path (line 709), _create_extra_articles is called to handle those additional articles, but the dedup early-return bypasses this entirely.

If the LLM produces multiple articles (e.g., covering distinct topics) and the first one matches an existing article, all the additional articles are silently dropped. To fix this, the dedup branch should either also process the remaining articles from articles_data[1:] after updating the existing one, or at minimum log a warning that extra articles were discarded.

Suggested change
)
)
# Note: if the LLM produced multiple articles, the additional ones are currently
# not created in the dedup path. Log this explicitly so it is visible.
try:
extra_count = len(articles_data) - 1 # type: ignore[name-defined]
except Exception:
extra_count = 0
if extra_count > 0:
logger.warning(
"Dedup: updated existing article %s; discarding %d additional compiled article(s) "
"from this compilation run",
existing["id"],
extra_count,
)

Copilot uses AI. Check for mistakes.
Comment on lines 107 to +111
# Right-sizing config
# ---------------------------------------------------------------------------

DEDUP_SIMILARITY_THRESHOLD = 0.90

Copy link

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The DEDUP_SIMILARITY_THRESHOLD constant is placed inside the "Right-sizing config" section (between _ensure_degraded_column and DEFAULT_RIGHT_SIZING). This is misleading because it belongs logically to the dedup feature, not to right-sizing. The dedup helpers section already exists at line 168 as a clearly delineated block — the constant should be placed there, above _find_similar_article, to keep the dedup logic self-contained and avoid confusion with unrelated sizing constants.

Suggested change
# Right-sizing config
# ---------------------------------------------------------------------------
DEDUP_SIMILARITY_THRESHOLD = 0.90
# Dedup config
# ---------------------------------------------------------------------------
DEDUP_SIMILARITY_THRESHOLD = 0.90
# ---------------------------------------------------------------------------
# Right-sizing config
# ---------------------------------------------------------------------------

Copilot uses AI. Check for mistakes.
Comment on lines +190 to +204
cur.execute(
"""
SELECT id, title, content, embedding
FROM articles
WHERE status = 'active'
AND embedding IS NOT NULL
AND 1 - (embedding <=> %s::vector) > %s
ORDER BY embedding <=> %s::vector
LIMIT 1
""",
(vec_str, threshold, vec_str),
)
row = cur.fetchone()
if row:
return dict(row)
Copy link

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SELECT query fetches content and embedding columns from the articles table, but the return value of _find_similar_article is only used in compile_article to access existing["id"] and existing.get("title"). The content and embedding fields are never used, yet embedding in particular can be a very large high-dimensional float array. This adds unnecessary data transfer overhead on every dedup check.

Removing content and embedding from the SELECT list (keeping just id, title) would reduce the data transferred per query.

Copilot uses AI. Check for mistakes.


# ---------------------------------------------------------------------------
# Dedup helpers (issue #73)
Copy link

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code comment on this line references issue #73 which, in this repository, corresponds to "Implement MLS group creation for federations" — an entirely unrelated feature. The PR description clarifies it closes ourochronos/tracking#73 (a separate tracking repository), but the bare (issue #73) in the source code comment is ambiguous and misleading. The comment should either be removed or include the full qualified reference, e.g., (ourochronos/tracking#73), to avoid confusion with the local issue #73.

Suggested change
# Dedup helpers (issue #73)
# Dedup helpers (ourochronos/tracking#73)

Copilot uses AI. Check for mistakes.
Comment on lines +660 to +676
existing = await _find_similar_article(article_content)
if existing:
logger.info(
"Dedup: found existing article %s ('%s') similar to new compilation, updating instead of creating",
existing["id"],
existing.get("title"),
)
result = await _update_existing_article(
existing["id"],
article_content,
article_title,
sources,
relationships,
epistemic_type,
conf,
)
return result
Copy link

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dedup check runs regardless of whether is_degraded is True (line 660). When the LLM is unavailable and fallback concatenation is used, is_degraded = True and the fallback content is compared for similarity. If dedup fires on degraded content, _update_existing_article is called without setting the degraded flag or queuing a recompile_degraded mutation — both of which the non-dedup path handles at lines 692–704.

This means a well-compiled existing article could be silently overwritten with raw concatenated fallback content, with no recompile queued to fix it later. The dedup check should be skipped (or guarded) when is_degraded is True, similar to how the code intentionally degrades gracefully in other scenarios.

Suggested change
existing = await _find_similar_article(article_content)
if existing:
logger.info(
"Dedup: found existing article %s ('%s') similar to new compilation, updating instead of creating",
existing["id"],
existing.get("title"),
)
result = await _update_existing_article(
existing["id"],
article_content,
article_title,
sources,
relationships,
epistemic_type,
conf,
)
return result
# Skip dedup when compilation is degraded (e.g., fallback concatenation),
# so we don't overwrite a good article with degraded content without
# marking it degraded and queuing a recompile.
if not is_degraded:
existing = await _find_similar_article(article_content)
if existing:
logger.info(
"Dedup: found existing article %s ('%s') similar to new compilation, updating instead of creating",
existing["id"],
existing.get("title"),
)
result = await _update_existing_article(
existing["id"],
article_content,
article_title,
sources,
relationships,
epistemic_type,
conf,
)
return result

Copilot uses AI. Check for mistakes.
zonk1024 added a commit that referenced this pull request Feb 28, 2026
PR #582 added _find_similar_article() inside compile_article(), which
calls generate_embedding() and makes a real API call.  Add an autouse
fixture that patches _find_similar_article to return None so existing
tests keep testing compilation logic without a live embedding API.

Dedup behaviour is exercised by test_compilation_dedup.py.
zonk1024 added a commit that referenced this pull request Feb 28, 2026
PR #582 added _find_similar_article() inside compile_article(), which
calls generate_embedding() and makes a real API call.  Add an autouse
fixture that patches _find_similar_article to return None so existing
tests keep testing compilation logic without a live embedding API.

Dedup behaviour is exercised by test_compilation_dedup.py.
zonk1024 added a commit that referenced this pull request Feb 28, 2026
PR #582 added _find_similar_article() inside compile_article(), which
calls generate_embedding() and makes a real API call.  Add an autouse
fixture that patches _find_similar_article to return None so existing
tests keep testing compilation logic without a live embedding API.

Dedup behaviour is exercised by test_compilation_dedup.py.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants