[HTML search] Bug: 'indexentries' section missing results from search index following non-fresh project rebuild. #12599

jayaddison · 2024-07-16T18:19:21Z

Describe the bug

During some resource consumption profiling I noticed that the baseline searchindex.js size that I was comparing against dropped fairly (~30%) when a project is rebuilt. This occurs both on the development branch that I was using, and also for v7.4.4 mainline.

How to Reproduce

$ sphinx-build -b html doc _build_baseline  # built once
$ sphinx-build -b html doc _build_rebuilt
$ sphinx-build -b html doc _build_rebuilt   # built twice
$ find _build* -type f -name 'searchindex.js' -exec ls -sh {} +
476K _build_baseline/searchindex.js  352K _build_rebuilt/searchindex.js

Comparing the contents of the searchindex.js files, the difference appears to be that indexentries section of the file contains different results; many are missing from the rebuilt copy. The first item that is missing is a key with name --author.

Environment Information

Platform:              linux; (Linux-6.9.8-arm64-aarch64-with-glibc2.38)
Python version:        3.12.4 (main, Jun 12 2024, 19:06:53) [GCC 13.2.0])
Python implementation: CPython
Sphinx version:        7.4.4
Docutils version:      0.21.2
Jinja2 version:        3.1.4
Pygments version:      2.18.0

Sphinx extensions

N/A

Additional context

Discovered during work on #12596.

The text was updated successfully, but these errors were encountered:

jayaddison · 2024-07-17T11:11:30Z

I think that part of a fix could involve making sure that we load indexentries from the frozen index representation here:

sphinx/sphinx/search/__init__.py

Lines 275 to 308 in 8ae8183

    
           def load(self, stream: IO, format: Any) -> None: 
        
               """Reconstruct from frozen data.""" 
        
               if format == "jsdump": 
        
                   warnings.warn("format=jsdump is deprecated, use json instead", 
        
                                 RemovedInSphinx70Warning, stacklevel=2) 
        
                   format = self.formats["json"] 
        
               elif isinstance(format, str): 
        
                   format = self.formats[format] 
        
               frozen = format.load(stream) 
        
               # if an old index is present, we treat it as not existing. 
        
               if not isinstance(frozen, dict) or \ 
        
                  frozen.get('envversion') != self.env.version: 
        
                   raise ValueError('old format') 
        
               index2fn = frozen['docnames'] 
        
               self._filenames = dict(zip(index2fn, frozen['filenames'])) 
        
               self._titles = dict(zip(index2fn, frozen['titles'])) 
        
               self._all_titles = {} 
        
               for title, doc_tuples in frozen['alltitles'].items(): 
        
                   for doc, titleid in doc_tuples: 
        
                       self._all_titles.setdefault(index2fn[doc], []).append((title, titleid)) 
        
               def load_terms(mapping: Dict[str, Any]) -> Dict[str, Set[str]]: 
        
                   rv = {} 
        
                   for k, v in mapping.items(): 
        
                       if isinstance(v, int): 
        
                           rv[k] = {index2fn[v]} 
        
                       else: 
        
                           rv[k] = {index2fn[i] for i in v} 
        
                   return rv 
        
               self._mapping = load_terms(frozen['terms']) 
        
               self._title_mapping = load_terms(frozen['titleterms']) 
        
               # no need to load keywords/objtypes

...because during a non-fresh rebuild -- especially a no-change rebuild -- we don't re-scan all documents, so we are not likely to re-discover all required index directives/entries (and therefore they are missing in the searchindex.js output).

cc @AA-Turner - I think this bug relates to #10819.

jayaddison added type:bug html search python Pull requests that update Python code labels Jul 16, 2024

jayaddison mentioned this issue Jul 16, 2024

HTML search: Introduce ngram-based partial-match searching #12596

Closed

3 tasks

jayaddison mentioned this issue Jul 17, 2024

[HTML search] Optimization: write zero/one instead (one character) instead of true/false (four/five characters) for bool flags. #12605

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HTML search] Bug: 'indexentries' section missing results from search index following non-fresh project rebuild. #12599

[HTML search] Bug: 'indexentries' section missing results from search index following non-fresh project rebuild. #12599

jayaddison commented Jul 16, 2024

jayaddison commented Jul 17, 2024

[HTML search] Bug: 'indexentries' section missing results from search index following non-fresh project rebuild. #12599

[HTML search] Bug: 'indexentries' section missing results from search index following non-fresh project rebuild. #12599

Comments

jayaddison commented Jul 16, 2024

Describe the bug

How to Reproduce

Environment Information

Sphinx extensions

Additional context

jayaddison commented Jul 17, 2024