HTML search: Introduce ngram-based partial-match searching #12596

jayaddison · 2024-07-16T14:55:07Z

Feature or Bugfix

Feature

Purpose

Replace the brute-force partial-match search algorithm that iterates over all indexed terms with a directed search based on an n-gram index.

Detail

🐍 Python: after the terms index is built, derive another index of the n-grams contained within those terms. Compress and minify this into a trie datastructure for handover to JavaScript.
🕸️ JavaScript: use the ngram index to lookup relevant terms during partial matching instead of scanning all terms.

Todo / open questions

Implement one further (and potentially significantly space-saving) optimization: refer to terms not by storing their entire string value in trie nodes, but using integer identifier(s).
Check the impact on index size - the current draft at ea5b2da introduces a 3x bloat on the test JS indexes -- but this may not be representative (either better/worse) compared to larger documentation projects.
Benchmark the performance results to confirm whether this noticeably improves client-side search efficiency for partial-match queries.

Relates

Uses test coverage from [HTML search] Add test coverage to prepare for anticipated partial-match refactoring. #12592.
May resolve [HTML search] optimization: don't loop over all document terms and title terms during partial-matching. #12045.

Edit: add JavaScript implementation details to this description.

jayaddison · 2024-07-16T17:03:57Z

At d15a3ae (the current latest work-in-progress commit), the index size bloat remains significant from these changes; the Sphinx self-built documentation increases from under 500K to almost 800K (an ~33% increase relative to the original).

$ find _build* -type f -name 'searchindex.js' -exec ls -sh {} +
476K _build_baseline/searchindex.js  796K _build/searchindex.js

jayaddison · 2024-07-16T18:22:35Z

At d15a3ae (the current latest work-in-progress commit), the index size bloat remains significant from these changes; the Sphinx self-built documentation increases from under 500K to almost 800K (an ~33% increase relative to the original).
$ find _build* -type f -name 'searchindex.js' -exec ls -sh {} +
476K _build_baseline/searchindex.js  796K _build/searchindex.js

Although I discovered a bug (#12599) related to missing entries in searchindex.js following non-fresh project rebuilds, the above figures were based on fresh self-builds of Sphinx and are not affected by the bug (nitpick: or are both equally affected by it). In other words: I believe those statistics remain valid and accurate for comparison purposes.

Edit: add clarification/nitpick.

jayaddison · 2024-07-16T18:31:20Z

One more idea: we have some redundancy between the ngram index and the existing terms index -- in particular, prefix-matching could use the terms index directly. So, with some client-side work, we could omit word-start edge-ngrams.

sphinx/themes/basic/static/searchtools.js

jayaddison · 2024-07-16T21:51:40Z

Stop press; this quote below needs some amendment:

This implementation was enjoyable, but I think I have to admit defeat: the self-built Sphinx documentation creates a searchindex.js file that is approximately double the uncompressed size of the baseline/current v7.4.4 equivalent -- and also the performance appears to be significantly slower for partial-match querying:

It turns out that much of this was due to making many, many too many calls to Object.keys (ref: 1271159). With a fix for that in place, here's take two:

Before (v7.4.4 baseline)

Traced running time: 45.4ms

After (this branch at commit 1271159)

Traced running time: 20.0ms

There definitely is some variance in trace-times on my machine, so I wouldn't claim that this is a definite performance win yet. However, we do at least appear to be back on parity with mainline.

Edit: fix before-image display.

jayaddison · 2024-07-16T22:03:44Z

Arguable benefits of ngram-based search:

➕ I think this would provide more consistent client-side search performance on large documentation sets.
➕ Having ngrams available in the client could support some autosuggest-related features.

Drawbacks:

➖ This feature does increase searchindex.js file size, seemingly significantly (based on Sphinx docs self-build).
➖ It's additional work at build-time - I don't have a way to measure this overhead locally I'm afraid, but I would be curious to find out.

Any ideas for how to test/optimize/debate this further are welcome!

…l-search refactoring.

sphinx/search/__init__.py

jayaddison · 2024-07-17T16:35:21Z

No further changes planned on this branch, pausing for now pending review/feedback.

…d by commit 996f805.

jayaddison · 2024-07-18T16:18:59Z

No further changes planned on this branch, pausing for now pending review/feedback.

(sorry; I spotted an edge case that I felt would be worth handling. pausing again now)

Edit: add hyperlink reference to the edge case + test coverage commit.

Conflicts: CHANGES.rst (manually adjusted)

sphinx/themes/basic/static/searchtools.js

jayaddison · 2024-07-19T11:17:04Z

Todo: consider edge-cases related to document-terms and/or query-terms that contain repeated characters. For example: aaaaaaaab - this only produces two distinct trigrams. What does that imply for indexing and query behaviour? (and are there any undesirable side-effects of that?)

Edit: clarify that two distinct trigrams are emitted. ngrams can be arbitrary-length.

…match handling.

…time during ngram search.

sphinx/themes/basic/static/searchtools.js

…efore partial term match result collection.

sphinx/themes/basic/static/searchtools.js

… set-comparison operations are complete. This means that set-comparison operations occur using integer values instead of string values.

… a JavaScript `Set` during collection and filtering of candidate terms.

jayaddison · 2024-07-19T17:28:23Z

Ok; I've applied a few more tweaks, and I think that the performance results are fairly positive on another sample query for phin using the self-built Sphinx docs from commit 6d885d6:

Traced running time: 16.0ms

I've confirmed that the result-count is the same too. Document scores should be unaffected, although ordering of results that all share the same score may differ from the mainline/baseline algorithm.

wlach · 2024-07-20T14:16:59Z

Hey @jayaddison, I'll try to find some time to take a look at this next week.

FWIW, my understanding is that <100ms is on the edge of human perceptible latency so I wouldn't be too concerned about micro-optimizations here though they don't hurt so long as the code is still understandable/readable.

https://stackoverflow.com/questions/536300/what-is-the-shortest-perceivable-application-response-delay

wlach

Hey, I had a look through and though this looks cool I'm just not sure it's worth the extra space and more importantly code complexity as things stand. I'm not seeing much sign that the existing brute force search approach is actually causing problems for anyone. It would be a different matter if it took several seconds and was blocking the main thread, but as mentioned it seems like a search is happening well below the threshold of human perception.

I'm still not sure if it'd be worth it, but one alternative to building this index python-side might be to construct it upon initialization of the search code. I bet it'd still be quite fast and should reduce concerns about index bloat.

jayaddison · 2024-07-24T12:13:26Z

Hey, I had a look through and though this looks cool I'm just not sure it's worth the extra space and more importantly code complexity as things stand.

That's valid criticism - I'd add datastructure complexity into that too; most of the other data in searchindex.js is somewhat human-readable - the termsngrams data isn't, and providing tooling to make it so would add yet another layer of complexity.

I'm not seeing much sign that the existing brute force search approach is actually causing problems for anyone. It would be a different matter if it took several seconds and was blocking the main thread, but as mentioned it seems like a search is happening well below the threshold of human perception.

This is reasonable too. I've attempted disabling html_show_search_summary and limiting the number of displayed results to 10 locally to determine whether the runtime performance difference became perceptible -- it didn't on the Sphinx documentation itself, but I'll admit that I haven't performed the same evaluation on a very large documentation set like Python (yet?).

I'm still not sure if it'd be worth it, but one alternative to building this index python-side might be to construct it upon initialization of the search code. I bet it'd still be quite fast and should reduce concerns about index bloat.

That's a very interesting idea, and if time allows I'll evaluate that too. My intuitive sense is that placing the functionality cost for the ngrams (it has to go somewhere) into the bandwidth consumption (the largest overall increase for a many-many-clients situation like here) is better than client-compute here, because we can lean heavily on caching of static data, and the incremental transfer/decompression times I think would be lower -- and also scale-up at a lower rate as the documentation set size increases -- than client compute duration. That's not clear without benchmarking though.

A couple of non-directly relevant thoughts:

I think ngrams could unlock efficient typo-tolerant searches (but again, to your point about needs: I haven't noticed people asking for that).
I'm worrying about the HTTP traffic from exact-match phrase search in pessimal cases for New exact phrase searching feature (for HTML) #12552 -- in particular the search experience if I send you a hyperlink for a Sphinx phrase query where the terms individually appear in many documents, but never together. That would cause a large number of requests but show no results. Again, no-one necessarily asking for this, but I'd considered it, and with a small extension, ngrams can rule-out phrases that'll never match across the collection.

I think I'm tending towards closing this, but may do a bit more exploration. Thanks for the feedback!

Edit: fixup for link to v2 of exact-match phrase query support pull request.

jayaddison · 2024-07-25T14:44:39Z

I'm still not sure if it'd be worth it, but one alternative to building this index python-side might be to construct it upon initialization of the search code. I bet it'd still be quite fast and should reduce concerns about index bloat.

There is one potentially-significant limitation of building the ngrams client-side: they can only be derived from information available to the client at searchtools.js initialization-time. In contrast, ngram construction during the HTML builder phase could include information derived from anywhere in the project sources.

That's not a problem for this feature/pull-request in isolation, but it wouldn't be compatible with the elimination of non-existent phrase queries; that follow-up requires information about what words are adjacent to each other in documents, information that is not available to searchtools.js code currently.

jayaddison · 2024-07-26T13:26:17Z

Another thing to consider evaluating could be replacement of the terms index by an ngram-based index. With a few extensions to this code, I think that would be possible, and in a way that would support elimination of known-absent-phrases during phrase queries on a per-document basis (compared to the approach I've been working on so far, where only phrases that are absent collection-wide can be filtered-out at search time).

Doing so would increase the size of the ngram index, but perhaps complete removal of the terms index would mean that the overall resulting index size difference is more manageable.

jayaddison added type:proposal a feature suggestion html search javascript Pull requests that update Javascript code type:performance priority:low python Pull requests that update Python code labels Jul 16, 2024

jayaddison mentioned this pull request Jul 16, 2024

[HTML search] Bug: 'indexentries' section missing results from search index following non-fresh project rebuild. #12599

Open

This comment was marked as outdated.

Sign in to view

jayaddison commented Jul 16, 2024

View reviewed changes

sphinx/themes/basic/static/searchtools.js Outdated Show resolved Hide resolved

jayaddison changed the title ~~Draft: [HTML search] Introduce ngram-based partial-match searching.~~ [HTML search] Introduce ngram-based partial-match searching. Jul 16, 2024

jayaddison marked this pull request as ready for review July 16, 2024 22:03

[HTML search] Add preparatory test coverage before anticipated partia…

70788b6

…l-search refactoring.

jayaddison force-pushed the issue-12045/partial-search-ngrams branch from ce424ea to 70788b6 Compare July 16, 2024 22:04

jayaddison mentioned this pull request Jul 16, 2024

[HTML search] optimization: don't loop over all document terms and title terms during partial-matching. #12045

Closed

jayaddison commented Jul 16, 2024

View reviewed changes

sphinx/search/__init__.py Show resolved Hide resolved

jayaddison mentioned this pull request Jul 17, 2024

New exact phrase searching feature (for HTML) #12552

Open

jayaddison added 2 commits July 17, 2024 17:03

Merge branch 'master' into issue-12045/partial-search-ngrams

1a08711

Add CHANGES.rst entry.

8c1df09

jayaddison requested a review from wlach July 17, 2024 16:06

jayaddison commented Jul 17, 2024

View reviewed changes

sphinx/search/__init__.py Show resolved Hide resolved

jayaddison commented Jul 17, 2024

View reviewed changes

sphinx/search/__init__.py Outdated Show resolved Hide resolved

Nitpick: remove redundant dict-lookup fallback value.

ffc914c

jayaddison added 2 commits July 18, 2024 17:09

[HTML search] Tests: add coverage for non-matching query.

996f805

[HTML search] Tests: cleanup: remove duplicate test accidentally adde…

83cf50c

…d by commit 996f805.

Merge branch 'master' into issue-12045/partial-search-ngrams

ee05a38

jayaddison added the DO NOT MERGE label Jul 18, 2024

jayaddison added 2 commits July 19, 2024 09:24

Merge branch 'master' into issue-12045/partial-search-ngrams

115a0ef

Conflicts: CHANGES.rst (manually adjusted)

Fixup: remove extraneous newline accidentally added to CHANGES.rst.

e096a36

jayaddison removed the DO NOT MERGE label Jul 19, 2024

jayaddison commented Jul 19, 2024

View reviewed changes

sphinx/themes/basic/static/searchtools.js Outdated Show resolved Hide resolved

jayaddison added 2 commits July 19, 2024 13:48

[HTML search] Tests: add some safety-guard tests around ngram suffix-…

a59b630

…match handling.

[HTML search] Optimization: traverse query terms two characters at a …

5253bd5

…time during ngram search.

jayaddison commented Jul 19, 2024

View reviewed changes

sphinx/themes/basic/static/searchtools.js Outdated Show resolved Hide resolved

jayaddison marked this pull request as draft July 19, 2024 16:20

jayaddison added 2 commits July 19, 2024 17:32

[HTML search] Fixup: return empty when trie-path lookup is incomplete.

94daefd

[HTML search] Safety guard: restore escapedWord conditional check b…

d38500b

…efore partial term match result collection.

jayaddison commented Jul 19, 2024

View reviewed changes

sphinx/themes/basic/static/searchtools.js Outdated Show resolved Hide resolved

jayaddison added 2 commits July 19, 2024 18:06

[HTML search] Refactor / optimization: delay term-offset lookup until…

fd13f5d

… set-comparison operations are complete. This means that set-comparison operations occur using integer values instead of string values.

[HTML search] Refactor / brevity: use a JavaScript Array instead of…

6d885d6

… a JavaScript `Set` during collection and filtering of candidate terms.

jayaddison marked this pull request as ready for review July 19, 2024 17:28

jayaddison mentioned this pull request Jul 20, 2024

[HTML search] Add test coverage to prepare for anticipated partial-match refactoring. #12592

Closed

wlach reviewed Jul 23, 2024

View reviewed changes

jayaddison changed the title ~~[HTML search] Introduce ngram-based partial-match searching.~~ [HTML search] Introduce ngram-based partial-match searching Jul 23, 2024

jayaddison changed the title ~~[HTML search] Introduce ngram-based partial-match searching~~ HTML search: Introduce ngram-based partial-match searching Aug 5, 2024

jayaddison closed this Aug 11, 2024

jayaddison deleted the issue-12045/partial-search-ngrams branch August 11, 2024 14:28

github-actions bot locked as resolved and limited conversation to collaborators Sep 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTML search: Introduce ngram-based partial-match searching #12596

HTML search: Introduce ngram-based partial-match searching #12596

jayaddison commented Jul 16, 2024 •

edited

Loading

jayaddison commented Jul 16, 2024

jayaddison commented Jul 16, 2024 •

edited

Loading

jayaddison commented Jul 16, 2024

This comment was marked as outdated.

jayaddison commented Jul 16, 2024 •

edited

Loading

jayaddison commented Jul 16, 2024

jayaddison commented Jul 17, 2024

jayaddison commented Jul 18, 2024 •

edited

Loading

jayaddison commented Jul 19, 2024 •

edited

Loading

jayaddison commented Jul 19, 2024

wlach commented Jul 20, 2024

wlach left a comment

jayaddison commented Jul 24, 2024 •

edited

Loading

jayaddison commented Jul 25, 2024

jayaddison commented Jul 26, 2024

HTML search: Introduce ngram-based partial-match searching #12596

HTML search: Introduce ngram-based partial-match searching #12596

Conversation

jayaddison commented Jul 16, 2024 • edited Loading

Feature or Bugfix

Purpose

Detail

Todo / open questions

Relates

jayaddison commented Jul 16, 2024

jayaddison commented Jul 16, 2024 • edited Loading

jayaddison commented Jul 16, 2024

This comment was marked as outdated.

jayaddison commented Jul 16, 2024 • edited Loading

jayaddison commented Jul 16, 2024

jayaddison commented Jul 17, 2024

jayaddison commented Jul 18, 2024 • edited Loading

jayaddison commented Jul 19, 2024 • edited Loading

jayaddison commented Jul 19, 2024

wlach commented Jul 20, 2024

wlach left a comment

Choose a reason for hiding this comment

jayaddison commented Jul 24, 2024 • edited Loading

jayaddison commented Jul 25, 2024

jayaddison commented Jul 26, 2024

jayaddison commented Jul 16, 2024 •

edited

Loading

jayaddison commented Jul 16, 2024 •

edited

Loading

jayaddison commented Jul 16, 2024 •

edited

Loading

jayaddison commented Jul 18, 2024 •

edited

Loading

jayaddison commented Jul 19, 2024 •

edited

Loading

jayaddison commented Jul 24, 2024 •

edited

Loading