Skip to content

Fix missing pages in search indexes#212

Merged
Girgias merged 5 commits intophp:masterfrom
AllenJB:search-missing-pages
Nov 4, 2025
Merged

Fix missing pages in search indexes#212
Girgias merged 5 commits intophp:masterfrom
AllenJB:search-missing-pages

Conversation

@AllenJB
Copy link
Contributor

@AllenJB AllenJB commented Nov 3, 2025

Fixes #211

I noticed there were significant pages missing from the web search.

After some investigation I found there were 2 circumstances where entries were missing.

In the first case, some entries were being excluded by the !$index['chunk'] check. I found these could easily be re-included by skipping this check for specific element values. You can see the list of these running the following query on the output/index.sqlite file generated by a phd run:

SELECT *
FROM ids
WHERE chunk = 0
	AND element IN (
		'refentry',
		'stream_wrapper',
		'phpdoc:classref',
		'phpdoc:exceptionref',
		'phpdoc:varentry'
	)

Additionally entries were excluded where they all had the same docbook_id because of the way the indexes array is indexed by this id. Because the existing search indexes rely on this id, it's not easy to resolve.

However, the current web search actually reworks these indexes to combine them anyway, and I'd created a pre-combined version of these indexes for #204 that does not rely on indexing by the docbook_id.

I've copied these changes to this PR and then modified them to pull the index list from the database without the deduplication. These will require further changes to php-web js/search,js and js/search-index.php - I've already made the changes to search.js for #204 (and the changes to search-index.php is just having it spit out the new -combined index file without any manipulation it currently does)

You can see the list of affected docbook_ids with the following query:

SELECT docbook_id
FROM ids
WHERE chunk = 1
	OR element IN (
		'refentry',
		'stream_wrapper',
		'phpdoc:classref',
		'phpdoc:exceptionref',
		'phpdoc:varentry'
	)
GROUP BY docbook_id
HAVING COUNT(*) > 1

Generally these are cases where there are both procedural and OOP interfaces. I'd guess which one makes it into the indexes is determined by the order in which the <refname> values appear.

There are some other cases such as stream wrappers (eg. bzip2:// and zlib://)

Copy link
Member

@Girgias Girgias left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am happy to just merge this as-is because PhD is complicated and we don't really have tests and I trust that you actually figured it out, worst case we just revert

@Girgias Girgias merged commit 264c65b into php:master Nov 4, 2025
7 checks passed
@AllenJB AllenJB deleted the search-missing-pages branch November 4, 2025 17:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Missing search (index) entries

2 participants