Fix missing pages in search indexes#212
Merged
Girgias merged 5 commits intophp:masterfrom Nov 4, 2025
Merged
Conversation
…ue to duplicated ids)
…uires changes on web-php to use new indexes)
Girgias
reviewed
Nov 3, 2025
Member
Girgias
left a comment
There was a problem hiding this comment.
I am happy to just merge this as-is because PhD is complicated and we don't really have tests and I trust that you actually figured it out, worst case we just revert
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #211
I noticed there were significant pages missing from the web search.
After some investigation I found there were 2 circumstances where entries were missing.
In the first case, some entries were being excluded by the
!$index['chunk']check. I found these could easily be re-included by skipping this check for specificelementvalues. You can see the list of these running the following query on theoutput/index.sqlitefile generated by a phd run:Additionally entries were excluded where they all had the same docbook_id because of the way the
indexesarray is indexed by this id. Because the existing search indexes rely on this id, it's not easy to resolve.However, the current web search actually reworks these indexes to combine them anyway, and I'd created a pre-combined version of these indexes for #204 that does not rely on indexing by the docbook_id.
I've copied these changes to this PR and then modified them to pull the index list from the database without the deduplication. These will require further changes to php-web
js/search,jsandjs/search-index.php- I've already made the changes to search.js for #204 (and the changes tosearch-index.phpis just having it spit out the new -combined index file without any manipulation it currently does)You can see the list of affected docbook_ids with the following query:
Generally these are cases where there are both procedural and OOP interfaces. I'd guess which one makes it into the indexes is determined by the order in which the
<refname>values appear.There are some other cases such as stream wrappers (eg. bzip2:// and zlib://)