fix: deduplicate on compile — update existing articles instead of creating duplicates (#73) by zonk1024 · Pull Request #582 · ourochronos/valence

zonk1024 · 2026-02-28T19:16:36Z

Summary

Implements dedup-on-compile (issue #73): when compile_article() produces content that is highly similar to an existing active article, it updates that article in-place instead of creating a duplicate.

Changes

`src/valence/core/compilation.py`

DEDUP_SIMILARITY_THRESHOLD = 0.90 — module-level constant (configurable via the threshold parameter)
_find_similar_article(content, threshold) — async helper that:
- Embeds the compiled content using generate_embedding() (OpenAI provider)
- Queries articles table with 1 - (embedding <=> %s::vector) > threshold (pgvector cosine similarity)
- Returns the best-matching article dict, or None
- Gracefully degrades on embedding or DB failure (logs warning, returns None)
_update_existing_article(...) — async helper that:
- UPDATEs the existing article with new content, title, confidence, epistemic_type, version++
- Additively links new sources via _link_sources() (does not clear existing links)
- Records an 'updated' mutation with a dedup summary
- Queues a split if the updated article exceeds max_tokens
compile_article() — after LLM produces article_content / article_title and before _create_article_row, runs the dedup check and short-circuits if a match is found

`tests/core/test_compilation_dedup.py` (new, 10 tests)

TestDedupConstant — threshold is 0.90
TestFindSimilarArticle — returns None on no match, returns dict on match, graceful degradation on embedding failure, graceful degradation on DB failure, custom threshold is forwarded
TestCompileArticleDedupPath — dedup fires and updates existing article, novel content creates new article, embedding failure causes graceful fallback to new article, dedup logs info message with article ID

Test Results

Closes ourochronos/tracking#73

…ating duplicates (#73) - Add DEDUP_SIMILARITY_THRESHOLD = 0.90 constant - Add _find_similar_article() helper: embeds compiled content and queries articles table using pgvector cosine similarity (1 - embedding <=> vector) - Add _update_existing_article() helper: updates article in-place on dedup hit, additively links new sources, records 'updated' mutation - Modify compile_article() to run dedup check after LLM produces content but before _create_article_row — returns updated article if similar one exists - Graceful degradation: embedding failures are caught and logged; compilation proceeds normally creating a new article - Tests: test_compilation_dedup.py with 10 tests covering dedup fires/updates, novel content creates new article, embedding failure graceful degradation, and info logging when dedup triggers Closes ourochronos/tracking#73

Copilot

Pull request overview

This PR implements deduplication-on-compile (referenced as tracking issue #73): when compile_article() produces content highly similar to an existing active article, it updates that article in-place instead of creating a duplicate.

Changes:

Adds _find_similar_article() helper that embeds compiled content and queries pgvector for cosine similarity above a configurable threshold
Adds _update_existing_article() helper that updates an existing article's content, increments the version, additively links new sources, and queues a split mutation if needed
Hooks the dedup check into compile_article() after LLM compilation and before new article creation
Adds a new test file test_compilation_dedup.py with 10 tests covering the constant, the similarity helper, and three integration paths

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.

File	Description
`src/valence/core/compilation.py`	Adds `DEDUP_SIMILARITY_THRESHOLD`, `_find_similar_article()`, `_update_existing_article()`, and inserts the dedup check into `compile_article()`
`tests/core/test_compilation_dedup.py`	New test file with unit tests for the threshold constant and `_find_similar_article()`, plus integration tests for `compile_article()` dedup paths

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-28T19:22:49Z

tests/core/test_compilation_dedup.py

@@ -0,0 +1,306 @@
+"""Tests for dedup-on-compile feature (issue #73).


The PR description says "Closes ourochronos/tracking#73" (a separate tracking repository), but the in-code comment here says (issue #73). In this repository, issue #73 is "Implement MLS group creation for federations" — an entirely unrelated feature. Without the full repository qualifier, the reference is misleading to anyone browsing the source. The comment should use the full reference (ourochronos/tracking#73) to make the provenance clear.

Copilot · 2026-02-28T19:22:50Z