fix: deduplicate on compile — update existing articles instead of creating duplicates (#73)#582
fix: deduplicate on compile — update existing articles instead of creating duplicates (#73)#582
Conversation
…ating duplicates (#73) - Add DEDUP_SIMILARITY_THRESHOLD = 0.90 constant - Add _find_similar_article() helper: embeds compiled content and queries articles table using pgvector cosine similarity (1 - embedding <=> vector) - Add _update_existing_article() helper: updates article in-place on dedup hit, additively links new sources, records 'updated' mutation - Modify compile_article() to run dedup check after LLM produces content but before _create_article_row — returns updated article if similar one exists - Graceful degradation: embedding failures are caught and logged; compilation proceeds normally creating a new article - Tests: test_compilation_dedup.py with 10 tests covering dedup fires/updates, novel content creates new article, embedding failure graceful degradation, and info logging when dedup triggers Closes ourochronos/tracking#73
There was a problem hiding this comment.
Pull request overview
This PR implements deduplication-on-compile (referenced as tracking issue #73): when compile_article() produces content highly similar to an existing active article, it updates that article in-place instead of creating a duplicate.
Changes:
- Adds
_find_similar_article()helper that embeds compiled content and queries pgvector for cosine similarity above a configurable threshold - Adds
_update_existing_article()helper that updates an existing article's content, increments the version, additively links new sources, and queues a split mutation if needed - Hooks the dedup check into
compile_article()after LLM compilation and before new article creation - Adds a new test file
test_compilation_dedup.pywith 10 tests covering the constant, the similarity helper, and three integration paths
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
src/valence/core/compilation.py |
Adds DEDUP_SIMILARITY_THRESHOLD, _find_similar_article(), _update_existing_article(), and inserts the dedup check into compile_article() |
tests/core/test_compilation_dedup.py |
New test file with unit tests for the threshold constant and _find_similar_article(), plus integration tests for compile_article() dedup paths |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @@ -0,0 +1,306 @@ | |||
| """Tests for dedup-on-compile feature (issue #73). | |||
There was a problem hiding this comment.
The PR description says "Closes ourochronos/tracking#73" (a separate tracking repository), but the in-code comment here says (issue #73). In this repository, issue #73 is "Implement MLS group creation for federations" — an entirely unrelated feature. Without the full repository qualifier, the reference is misleading to anyone browsing the source. The comment should use the full reference (ourochronos/tracking#73) to make the provenance clear.
| relationships, | ||
| epistemic_type, | ||
| conf, | ||
| ) |
There was a problem hiding this comment.
When the dedup check fires (line 660-676), the code immediately returns the result of _update_existing_article for the first article, but completely skips creating any extra articles that the LLM may have returned in articles_data[1:]. In the non-dedup path (line 709), _create_extra_articles is called to handle those additional articles, but the dedup early-return bypasses this entirely.
If the LLM produces multiple articles (e.g., covering distinct topics) and the first one matches an existing article, all the additional articles are silently dropped. To fix this, the dedup branch should either also process the remaining articles from articles_data[1:] after updating the existing one, or at minimum log a warning that extra articles were discarded.
| ) | |
| ) | |
| # Note: if the LLM produced multiple articles, the additional ones are currently | |
| # not created in the dedup path. Log this explicitly so it is visible. | |
| try: | |
| extra_count = len(articles_data) - 1 # type: ignore[name-defined] | |
| except Exception: | |
| extra_count = 0 | |
| if extra_count > 0: | |
| logger.warning( | |
| "Dedup: updated existing article %s; discarding %d additional compiled article(s) " | |
| "from this compilation run", | |
| existing["id"], | |
| extra_count, | |
| ) |
| # Right-sizing config | ||
| # --------------------------------------------------------------------------- | ||
|
|
||
| DEDUP_SIMILARITY_THRESHOLD = 0.90 | ||
|
|
There was a problem hiding this comment.
The DEDUP_SIMILARITY_THRESHOLD constant is placed inside the "Right-sizing config" section (between _ensure_degraded_column and DEFAULT_RIGHT_SIZING). This is misleading because it belongs logically to the dedup feature, not to right-sizing. The dedup helpers section already exists at line 168 as a clearly delineated block — the constant should be placed there, above _find_similar_article, to keep the dedup logic self-contained and avoid confusion with unrelated sizing constants.
| # Right-sizing config | |
| # --------------------------------------------------------------------------- | |
| DEDUP_SIMILARITY_THRESHOLD = 0.90 | |
| # Dedup config | |
| # --------------------------------------------------------------------------- | |
| DEDUP_SIMILARITY_THRESHOLD = 0.90 | |
| # --------------------------------------------------------------------------- | |
| # Right-sizing config | |
| # --------------------------------------------------------------------------- |
| cur.execute( | ||
| """ | ||
| SELECT id, title, content, embedding | ||
| FROM articles | ||
| WHERE status = 'active' | ||
| AND embedding IS NOT NULL | ||
| AND 1 - (embedding <=> %s::vector) > %s | ||
| ORDER BY embedding <=> %s::vector | ||
| LIMIT 1 | ||
| """, | ||
| (vec_str, threshold, vec_str), | ||
| ) | ||
| row = cur.fetchone() | ||
| if row: | ||
| return dict(row) |
There was a problem hiding this comment.
The SELECT query fetches content and embedding columns from the articles table, but the return value of _find_similar_article is only used in compile_article to access existing["id"] and existing.get("title"). The content and embedding fields are never used, yet embedding in particular can be a very large high-dimensional float array. This adds unnecessary data transfer overhead on every dedup check.
Removing content and embedding from the SELECT list (keeping just id, title) would reduce the data transferred per query.
|
|
||
|
|
||
| # --------------------------------------------------------------------------- | ||
| # Dedup helpers (issue #73) |
There was a problem hiding this comment.
The code comment on this line references issue #73 which, in this repository, corresponds to "Implement MLS group creation for federations" — an entirely unrelated feature. The PR description clarifies it closes ourochronos/tracking#73 (a separate tracking repository), but the bare (issue #73) in the source code comment is ambiguous and misleading. The comment should either be removed or include the full qualified reference, e.g., (ourochronos/tracking#73), to avoid confusion with the local issue #73.
| # Dedup helpers (issue #73) | |
| # Dedup helpers (ourochronos/tracking#73) |
| existing = await _find_similar_article(article_content) | ||
| if existing: | ||
| logger.info( | ||
| "Dedup: found existing article %s ('%s') similar to new compilation, updating instead of creating", | ||
| existing["id"], | ||
| existing.get("title"), | ||
| ) | ||
| result = await _update_existing_article( | ||
| existing["id"], | ||
| article_content, | ||
| article_title, | ||
| sources, | ||
| relationships, | ||
| epistemic_type, | ||
| conf, | ||
| ) | ||
| return result |
There was a problem hiding this comment.
The dedup check runs regardless of whether is_degraded is True (line 660). When the LLM is unavailable and fallback concatenation is used, is_degraded = True and the fallback content is compared for similarity. If dedup fires on degraded content, _update_existing_article is called without setting the degraded flag or queuing a recompile_degraded mutation — both of which the non-dedup path handles at lines 692–704.
This means a well-compiled existing article could be silently overwritten with raw concatenated fallback content, with no recompile queued to fix it later. The dedup check should be skipped (or guarded) when is_degraded is True, similar to how the code intentionally degrades gracefully in other scenarios.
| existing = await _find_similar_article(article_content) | |
| if existing: | |
| logger.info( | |
| "Dedup: found existing article %s ('%s') similar to new compilation, updating instead of creating", | |
| existing["id"], | |
| existing.get("title"), | |
| ) | |
| result = await _update_existing_article( | |
| existing["id"], | |
| article_content, | |
| article_title, | |
| sources, | |
| relationships, | |
| epistemic_type, | |
| conf, | |
| ) | |
| return result | |
| # Skip dedup when compilation is degraded (e.g., fallback concatenation), | |
| # so we don't overwrite a good article with degraded content without | |
| # marking it degraded and queuing a recompile. | |
| if not is_degraded: | |
| existing = await _find_similar_article(article_content) | |
| if existing: | |
| logger.info( | |
| "Dedup: found existing article %s ('%s') similar to new compilation, updating instead of creating", | |
| existing["id"], | |
| existing.get("title"), | |
| ) | |
| result = await _update_existing_article( | |
| existing["id"], | |
| article_content, | |
| article_title, | |
| sources, | |
| relationships, | |
| epistemic_type, | |
| conf, | |
| ) | |
| return result |
PR #582 added _find_similar_article() inside compile_article(), which calls generate_embedding() and makes a real API call. Add an autouse fixture that patches _find_similar_article to return None so existing tests keep testing compilation logic without a live embedding API. Dedup behaviour is exercised by test_compilation_dedup.py.
PR #582 added _find_similar_article() inside compile_article(), which calls generate_embedding() and makes a real API call. Add an autouse fixture that patches _find_similar_article to return None so existing tests keep testing compilation logic without a live embedding API. Dedup behaviour is exercised by test_compilation_dedup.py.
PR #582 added _find_similar_article() inside compile_article(), which calls generate_embedding() and makes a real API call. Add an autouse fixture that patches _find_similar_article to return None so existing tests keep testing compilation logic without a live embedding API. Dedup behaviour is exercised by test_compilation_dedup.py.
Summary
Implements dedup-on-compile (issue #73): when
compile_article()produces content that is highly similar to an existing active article, it updates that article in-place instead of creating a duplicate.Changes
src/valence/core/compilation.pyDEDUP_SIMILARITY_THRESHOLD = 0.90— module-level constant (configurable via the threshold parameter)_find_similar_article(content, threshold)— async helper that:generate_embedding()(OpenAI provider)articlestable with1 - (embedding <=> %s::vector) > threshold(pgvector cosine similarity)NoneNone)_update_existing_article(...)— async helper that:_link_sources()(does not clear existing links)'updated'mutation with a dedup summarymax_tokenscompile_article()— after LLM producesarticle_content/article_titleand before_create_article_row, runs the dedup check and short-circuits if a match is foundtests/core/test_compilation_dedup.py(new, 10 tests)TestDedupConstant— threshold is 0.90TestFindSimilarArticle— returns None on no match, returns dict on match, graceful degradation on embedding failure, graceful degradation on DB failure, custom threshold is forwardedTestCompileArticleDedupPath— dedup fires and updates existing article, novel content creates new article, embedding failure causes graceful fallback to new article, dedup logs info message with article IDTest Results
Closes ourochronos/tracking#73