Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HTML search] Bug: 'indexentries' section missing results from search index following non-fresh project rebuild. #12599

Open
jayaddison opened this issue Jul 16, 2024 · 1 comment
Labels
html search python Pull requests that update Python code type:bug

Comments

@jayaddison
Copy link
Contributor

Describe the bug

During some resource consumption profiling I noticed that the baseline searchindex.js size that I was comparing against dropped fairly (~30%) when a project is rebuilt. This occurs both on the development branch that I was using, and also for v7.4.4 mainline.

How to Reproduce

$ sphinx-build -b html doc _build_baseline  # built once
$ sphinx-build -b html doc _build_rebuilt
$ sphinx-build -b html doc _build_rebuilt   # built twice
$ find _build* -type f -name 'searchindex.js' -exec ls -sh {} +
476K _build_baseline/searchindex.js  352K _build_rebuilt/searchindex.js

Comparing the contents of the searchindex.js files, the difference appears to be that indexentries section of the file contains different results; many are missing from the rebuilt copy. The first item that is missing is a key with name --author.

Environment Information

Platform:              linux; (Linux-6.9.8-arm64-aarch64-with-glibc2.38)
Python version:        3.12.4 (main, Jun 12 2024, 19:06:53) [GCC 13.2.0])
Python implementation: CPython
Sphinx version:        7.4.4
Docutils version:      0.21.2
Jinja2 version:        3.1.4
Pygments version:      2.18.0

Sphinx extensions

N/A

Additional context

Discovered during work on #12596.

@jayaddison
Copy link
Contributor Author

I think that part of a fix could involve making sure that we load indexentries from the frozen index representation here:

def load(self, stream: IO, format: Any) -> None:
"""Reconstruct from frozen data."""
if format == "jsdump":
warnings.warn("format=jsdump is deprecated, use json instead",
RemovedInSphinx70Warning, stacklevel=2)
format = self.formats["json"]
elif isinstance(format, str):
format = self.formats[format]
frozen = format.load(stream)
# if an old index is present, we treat it as not existing.
if not isinstance(frozen, dict) or \
frozen.get('envversion') != self.env.version:
raise ValueError('old format')
index2fn = frozen['docnames']
self._filenames = dict(zip(index2fn, frozen['filenames']))
self._titles = dict(zip(index2fn, frozen['titles']))
self._all_titles = {}
for title, doc_tuples in frozen['alltitles'].items():
for doc, titleid in doc_tuples:
self._all_titles.setdefault(index2fn[doc], []).append((title, titleid))
def load_terms(mapping: Dict[str, Any]) -> Dict[str, Set[str]]:
rv = {}
for k, v in mapping.items():
if isinstance(v, int):
rv[k] = {index2fn[v]}
else:
rv[k] = {index2fn[i] for i in v}
return rv
self._mapping = load_terms(frozen['terms'])
self._title_mapping = load_terms(frozen['titleterms'])
# no need to load keywords/objtypes

...because during a non-fresh rebuild -- especially a no-change rebuild -- we don't re-scan all documents, so we are not likely to re-discover all required index directives/entries (and therefore they are missing in the searchindex.js output).

cc @AA-Turner - I think this bug relates to #10819.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
html search python Pull requests that update Python code type:bug
Projects
None yet
Development

No branches or pull requests

1 participant