fix: hash multi-document files in change detection#299
Open
octo-patch wants to merge 1 commit intoyichuan-w:mainfrom
Open
fix: hash multi-document files in change detection#299octo-patch wants to merge 1 commit intoyichuan-w:mainfrom
octo-patch wants to merge 1 commit intoyichuan-w:mainfrom
Conversation
SimpleDirectoryReader can return multiple documents for a single file (e.g., multi-page PDFs). The previous code skipped such files with `continue`, causing them to be absent from file_hashes. On first build with no existing snapshot, detect_changes returns all keys in file_hashes as 'new'; if file_hashes is empty (all files skipped), it returns no changes and prints 'Index up to date.' without building the index. Fix by combining the text of all documents for a given file before hashing, so multi-document files are correctly tracked in change detection.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #290
Problem
FileSynchronizer.generate_file_hashes()silently skipped any file thatSimpleDirectoryReaderreturns as multiple documents (e.g., multi-page PDFs). The old code:Since
detect_changes()returns all keys infile_hashesas "new" when there is no existing snapshot (self.tree is None), an emptyfile_hashescauses it to return([], [], [])— no changes detected — and the CLI prints "Index up to date." and exits without building the index.This is the root cause of the first-build failure reported in #290: if the source directory contains only multi-document file types (e.g., PDFs), the index is never built unless
--forceis used.Solution
Instead of skipping multi-document files, combine the text of all documents produced for a single file before hashing. This correctly tracks every file, regardless of how many documents
SimpleDirectoryReaderproduces from it.The combined hash changes whenever any page/section of the file changes, so modification detection remains accurate.
Testing
file_hasheson the first scan, causingdetect_changesto correctly report all files as new and trigger a full build..txt,.md, etc.) are unaffected — behaviour is identical to before.