Skip to content

fix: hash multi-document files in change detection#299

Open
octo-patch wants to merge 1 commit intoyichuan-w:mainfrom
octo-patch:fix/issue-290-multi-doc-file-hashing
Open

fix: hash multi-document files in change detection#299
octo-patch wants to merge 1 commit intoyichuan-w:mainfrom
octo-patch:fix/issue-290-multi-doc-file-hashing

Conversation

@octo-patch
Copy link
Copy Markdown

Fixes #290

Problem

FileSynchronizer.generate_file_hashes() silently skipped any file that SimpleDirectoryReader returns as multiple documents (e.g., multi-page PDFs). The old code:

if len(file) > 1:
    continue  # SimpleDirectoryReader can load more than 1 documents for weird file types e.g. PDFs

Since detect_changes() returns all keys in file_hashes as "new" when there is no existing snapshot (self.tree is None), an empty file_hashes causes it to return ([], [], []) — no changes detected — and the CLI prints "Index up to date." and exits without building the index.

This is the root cause of the first-build failure reported in #290: if the source directory contains only multi-document file types (e.g., PDFs), the index is never built unless --force is used.

Solution

Instead of skipping multi-document files, combine the text of all documents produced for a single file before hashing. This correctly tracks every file, regardless of how many documents SimpleDirectoryReader produces from it.

combined_text = "".join(doc.text for doc in file)
file_hash = hash_data(combined_text)
file_hashes[file_path] = file_hash

The combined hash changes whenever any page/section of the file changes, so modification detection remains accurate.

Testing

  • Directories containing only PDFs (or other multi-document file types) now produce non-empty file_hashes on the first scan, causing detect_changes to correctly report all files as new and trigger a full build.
  • Single-document files (.txt, .md, etc.) are unaffected — behaviour is identical to before.
  • Subsequent builds correctly detect modifications to multi-document files because the combined hash changes when any part of the file changes.

SimpleDirectoryReader can return multiple documents for a single file
(e.g., multi-page PDFs). The previous code skipped such files with
`continue`, causing them to be absent from file_hashes. On first build
with no existing snapshot, detect_changes returns all keys in file_hashes
as 'new'; if file_hashes is empty (all files skipped), it returns no
changes and prints 'Index up to date.' without building the index.

Fix by combining the text of all documents for a given file before
hashing, so multi-document files are correctly tracked in change
detection.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

The index is not being built without using --force

1 participant