fix: hash multi-document files in change detection by octo-patch · Pull Request #299 · yichuan-w/LEANN

octo-patch · 2026-04-03T02:57:37Z

Fixes #290

Problem

FileSynchronizer.generate_file_hashes() silently skipped any file that SimpleDirectoryReader returns as multiple documents (e.g., multi-page PDFs). The old code:

if len(file) > 1:
    continue  # SimpleDirectoryReader can load more than 1 documents for weird file types e.g. PDFs

Since detect_changes() returns all keys in file_hashes as "new" when there is no existing snapshot (self.tree is None), an empty file_hashes causes it to return ([], [], []) — no changes detected — and the CLI prints "Index up to date." and exits without building the index.

This is the root cause of the first-build failure reported in #290: if the source directory contains only multi-document file types (e.g., PDFs), the index is never built unless --force is used.

Solution

Instead of skipping multi-document files, combine the text of all documents produced for a single file before hashing. This correctly tracks every file, regardless of how many documents SimpleDirectoryReader produces from it.

combined_text = "".join(doc.text for doc in file)
file_hash = hash_data(combined_text)
file_hashes[file_path] = file_hash

The combined hash changes whenever any page/section of the file changes, so modification detection remains accurate.

Testing

Directories containing only PDFs (or other multi-document file types) now produce non-empty file_hashes on the first scan, causing detect_changes to correctly report all files as new and trigger a full build.
Single-document files (.txt, .md, etc.) are unaffected — behaviour is identical to before.
Subsequent builds correctly detect modifications to multi-document files because the combined hash changes when any part of the file changes.

SimpleDirectoryReader can return multiple documents for a single file (e.g., multi-page PDFs). The previous code skipped such files with `continue`, causing them to be absent from file_hashes. On first build with no existing snapshot, detect_changes returns all keys in file_hashes as 'new'; if file_hashes is empty (all files skipped), it returns no changes and prints 'Index up to date.' without building the index. Fix by combining the text of all documents for a given file before hashing, so multi-document files are correctly tracked in change detection.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: hash multi-document files in change detection#299

fix: hash multi-document files in change detection#299
octo-patch wants to merge 1 commit intoyichuan-w:mainfrom
octo-patch:fix/issue-290-multi-doc-file-hashing

octo-patch commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

octo-patch commented Apr 3, 2026

Problem

Solution

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant