Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML search: Introduce ngram-based partial-match searching #12596

Conversation

jayaddison
Copy link
Contributor

@jayaddison jayaddison commented Jul 16, 2024

Feature or Bugfix

  • Feature

Purpose

  • Replace the brute-force partial-match search algorithm that iterates over all indexed terms with a directed search based on an n-gram index.

Detail

  • 🐍 Python: after the terms index is built, derive another index of the n-grams contained within those terms. Compress and minify this into a trie datastructure for handover to JavaScript.
  • 🕸️ JavaScript: use the ngram index to lookup relevant terms during partial matching instead of scanning all terms.

Todo / open questions

  • Implement one further (and potentially significantly space-saving) optimization: refer to terms not by storing their entire string value in trie nodes, but using integer identifier(s).
  • Check the impact on index size - the current draft at ea5b2da introduces a 3x bloat on the test JS indexes -- but this may not be representative (either better/worse) compared to larger documentation projects.
  • Benchmark the performance results to confirm whether this noticeably improves client-side search efficiency for partial-match queries.

Relates

Edit: add JavaScript implementation details to this description.

@jayaddison jayaddison added type:proposal a feature suggestion html search javascript Pull requests that update Javascript code type:performance priority:low python Pull requests that update Python code labels Jul 16, 2024
@jayaddison
Copy link
Contributor Author

At d15a3ae (the current latest work-in-progress commit), the index size bloat remains significant from these changes; the Sphinx self-built documentation increases from under 500K to almost 800K (an ~33% increase relative to the original).

$ find _build* -type f -name 'searchindex.js' -exec ls -sh {} +
476K _build_baseline/searchindex.js  796K _build/searchindex.js

@jayaddison
Copy link
Contributor Author

jayaddison commented Jul 16, 2024

At d15a3ae (the current latest work-in-progress commit), the index size bloat remains significant from these changes; the Sphinx self-built documentation increases from under 500K to almost 800K (an ~33% increase relative to the original).

$ find _build* -type f -name 'searchindex.js' -exec ls -sh {} +
476K _build_baseline/searchindex.js  796K _build/searchindex.js

Although I discovered a bug (#12599) related to missing entries in searchindex.js following non-fresh project rebuilds, the above figures were based on fresh self-builds of Sphinx and are not affected by the bug (nitpick: or are both equally affected by it). In other words: I believe those statistics remain valid and accurate for comparison purposes.

Edit: add clarification/nitpick.

@jayaddison
Copy link
Contributor Author

One more idea: we have some redundancy between the ngram index and the existing terms index -- in particular, prefix-matching could use the terms index directly. So, with some client-side work, we could omit word-start edge-ngrams.

@jayaddison

This comment was marked as outdated.

@jayaddison
Copy link
Contributor Author

jayaddison commented Jul 16, 2024

Stop press; this quote below needs some amendment:

This implementation was enjoyable, but I think I have to admit defeat: the self-built Sphinx documentation creates a searchindex.js file that is approximately double the uncompressed size of the baseline/current v7.4.4 equivalent -- and also the performance appears to be significantly slower for partial-match querying:

It turns out that much of this was due to making many, many too many calls to Object.keys (ref: 1271159). With a fix for that in place, here's take two:

Before (v7.4.4 baseline)
before
Traced running time: 45.4ms

After (this branch at commit 1271159)
take2
Traced running time: 20.0ms

There definitely is some variance in trace-times on my machine, so I wouldn't claim that this is a definite performance win yet. However, we do at least appear to be back on parity with mainline.

Edit: fix before-image display.

@jayaddison jayaddison changed the title Draft: [HTML search] Introduce ngram-based partial-match searching. [HTML search] Introduce ngram-based partial-match searching. Jul 16, 2024
@jayaddison
Copy link
Contributor Author

Arguable benefits of ngram-based search:

  • ➕ I think this would provide more consistent client-side search performance on large documentation sets.
  • ➕ Having ngrams available in the client could support some autosuggest-related features.

Drawbacks:

  • ➖ This feature does increase searchindex.js file size, seemingly significantly (based on Sphinx docs self-build).
  • ➖ It's additional work at build-time - I don't have a way to measure this overhead locally I'm afraid, but I would be curious to find out.

Any ideas for how to test/optimize/debate this further are welcome!

@jayaddison jayaddison marked this pull request as ready for review July 16, 2024 22:03
@jayaddison jayaddison requested a review from wlach July 17, 2024 16:06
sphinx/search/__init__.py Outdated Show resolved Hide resolved
@jayaddison
Copy link
Contributor Author

No further changes planned on this branch, pausing for now pending review/feedback.

@jayaddison
Copy link
Contributor Author

jayaddison commented Jul 18, 2024

No further changes planned on this branch, pausing for now pending review/feedback.

(sorry; I spotted an edge case that I felt would be worth handling. pausing again now)

Edit: add hyperlink reference to the edge case + test coverage commit.

@jayaddison
Copy link
Contributor Author

jayaddison commented Jul 19, 2024

Todo: consider edge-cases related to document-terms and/or query-terms that contain repeated characters. For example: aaaaaaaab - this only produces two distinct trigrams. What does that imply for indexing and query behaviour? (and are there any undesirable side-effects of that?)

Edit: clarify that two distinct trigrams are emitted. ngrams can be arbitrary-length.

@jayaddison jayaddison marked this pull request as draft July 19, 2024 16:20
… set-comparison operations are complete.

This means that set-comparison operations occur using integer values instead of string values.
… a JavaScript `Set` during collection and filtering of candidate terms.
@jayaddison
Copy link
Contributor Author

Ok; I've applied a few more tweaks, and I think that the performance results are fairly positive on another sample query for phin using the self-built Sphinx docs from commit 6d885d6:

image
Traced running time: 16.0ms

I've confirmed that the result-count is the same too. Document scores should be unaffected, although ordering of results that all share the same score may differ from the mainline/baseline algorithm.

@jayaddison jayaddison marked this pull request as ready for review July 19, 2024 17:28
@wlach
Copy link
Contributor

wlach commented Jul 20, 2024

Hey @jayaddison, I'll try to find some time to take a look at this next week.

FWIW, my understanding is that <100ms is on the edge of human perceptible latency so I wouldn't be too concerned about micro-optimizations here though they don't hurt so long as the code is still understandable/readable.

https://stackoverflow.com/questions/536300/what-is-the-shortest-perceivable-application-response-delay

Copy link
Contributor

@wlach wlach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, I had a look through and though this looks cool I'm just not sure it's worth the extra space and more importantly code complexity as things stand. I'm not seeing much sign that the existing brute force search approach is actually causing problems for anyone. It would be a different matter if it took several seconds and was blocking the main thread, but as mentioned it seems like a search is happening well below the threshold of human perception.

I'm still not sure if it'd be worth it, but one alternative to building this index python-side might be to construct it upon initialization of the search code. I bet it'd still be quite fast and should reduce concerns about index bloat.

@jayaddison jayaddison changed the title [HTML search] Introduce ngram-based partial-match searching. [HTML search] Introduce ngram-based partial-match searching Jul 23, 2024
@jayaddison
Copy link
Contributor Author

jayaddison commented Jul 24, 2024

Hey, I had a look through and though this looks cool I'm just not sure it's worth the extra space and more importantly code complexity as things stand.

That's valid criticism - I'd add datastructure complexity into that too; most of the other data in searchindex.js is somewhat human-readable - the termsngrams data isn't, and providing tooling to make it so would add yet another layer of complexity.

I'm not seeing much sign that the existing brute force search approach is actually causing problems for anyone. It would be a different matter if it took several seconds and was blocking the main thread, but as mentioned it seems like a search is happening well below the threshold of human perception.

This is reasonable too. I've attempted disabling html_show_search_summary and limiting the number of displayed results to 10 locally to determine whether the runtime performance difference became perceptible -- it didn't on the Sphinx documentation itself, but I'll admit that I haven't performed the same evaluation on a very large documentation set like Python (yet?).

I'm still not sure if it'd be worth it, but one alternative to building this index python-side might be to construct it upon initialization of the search code. I bet it'd still be quite fast and should reduce concerns about index bloat.

That's a very interesting idea, and if time allows I'll evaluate that too. My intuitive sense is that placing the functionality cost for the ngrams (it has to go somewhere) into the bandwidth consumption (the largest overall increase for a many-many-clients situation like here) is better than client-compute here, because we can lean heavily on caching of static data, and the incremental transfer/decompression times I think would be lower -- and also scale-up at a lower rate as the documentation set size increases -- than client compute duration. That's not clear without benchmarking though.

A couple of non-directly relevant thoughts:

  • I think ngrams could unlock efficient typo-tolerant searches (but again, to your point about needs: I haven't noticed people asking for that).
  • I'm worrying about the HTTP traffic from exact-match phrase search in pessimal cases for New exact phrase searching feature (for HTML) #12552 -- in particular the search experience if I send you a hyperlink for a Sphinx phrase query where the terms individually appear in many documents, but never together. That would cause a large number of requests but show no results. Again, no-one necessarily asking for this, but I'd considered it, and with a small extension, ngrams can rule-out phrases that'll never match across the collection.

I think I'm tending towards closing this, but may do a bit more exploration. Thanks for the feedback!

Edit: fixup for link to v2 of exact-match phrase query support pull request.

@jayaddison
Copy link
Contributor Author

I'm still not sure if it'd be worth it, but one alternative to building this index python-side might be to construct it upon initialization of the search code. I bet it'd still be quite fast and should reduce concerns about index bloat.

There is one potentially-significant limitation of building the ngrams client-side: they can only be derived from information available to the client at searchtools.js initialization-time. In contrast, ngram construction during the HTML builder phase could include information derived from anywhere in the project sources.

That's not a problem for this feature/pull-request in isolation, but it wouldn't be compatible with the elimination of non-existent phrase queries; that follow-up requires information about what words are adjacent to each other in documents, information that is not available to searchtools.js code currently.

@jayaddison
Copy link
Contributor Author

Another thing to consider evaluating could be replacement of the terms index by an ngram-based index. With a few extensions to this code, I think that would be possible, and in a way that would support elimination of known-absent-phrases during phrase queries on a per-document basis (compared to the approach I've been working on so far, where only phrases that are absent collection-wide can be filtered-out at search time).

Doing so would increase the size of the ngram index, but perhaps complete removal of the terms index would mean that the overall resulting index size difference is more manageable.

@jayaddison jayaddison changed the title [HTML search] Introduce ngram-based partial-match searching HTML search: Introduce ngram-based partial-match searching Aug 5, 2024
@jayaddison jayaddison closed this Aug 11, 2024
@jayaddison jayaddison deleted the issue-12045/partial-search-ngrams branch August 11, 2024 14:28
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 11, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
html search javascript Pull requests that update Javascript code priority:low python Pull requests that update Python code type:performance type:proposal a feature suggestion
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[HTML search] optimization: don't loop over all document terms and title terms during partial-matching.
2 participants