-
-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HTML search: Introduce ngram-based partial-match searching #12596
HTML search: Introduce ngram-based partial-match searching #12596
Conversation
At d15a3ae (the current latest work-in-progress commit), the index size bloat remains significant from these changes; the Sphinx self-built documentation increases from under 500K to almost 800K (an ~33% increase relative to the original).
|
Although I discovered a bug (#12599) related to missing entries in Edit: add clarification/nitpick. |
One more idea: we have some redundancy between the ngram index and the existing |
This comment was marked as outdated.
This comment was marked as outdated.
Stop press; this quote below needs some amendment:
It turns out that much of this was due to making many, many too many calls to Before (v7.4.4 baseline) After (this branch at commit 1271159) There definitely is some variance in trace-times on my machine, so I wouldn't claim that this is a definite performance win yet. However, we do at least appear to be back on parity with mainline. Edit: fix before-image display. |
Arguable benefits of ngram-based search:
Drawbacks:
Any ideas for how to test/optimize/debate this further are welcome! |
…l-search refactoring.
ce424ea
to
70788b6
Compare
No further changes planned on this branch, pausing for now pending review/feedback. |
(sorry; I spotted an edge case that I felt would be worth handling. pausing again now) Edit: add hyperlink reference to the edge case + test coverage commit. |
Conflicts: CHANGES.rst (manually adjusted)
Todo: consider edge-cases related to document-terms and/or query-terms that contain repeated characters. For example: Edit: clarify that two distinct trigrams are emitted. ngrams can be arbitrary-length. |
…time during ngram search.
…efore partial term match result collection.
… set-comparison operations are complete. This means that set-comparison operations occur using integer values instead of string values.
… a JavaScript `Set` during collection and filtering of candidate terms.
Ok; I've applied a few more tweaks, and I think that the performance results are fairly positive on another sample query for I've confirmed that the result-count is the same too. Document scores should be unaffected, although ordering of results that all share the same score may differ from the mainline/baseline algorithm. |
Hey @jayaddison, I'll try to find some time to take a look at this next week. FWIW, my understanding is that <100ms is on the edge of human perceptible latency so I wouldn't be too concerned about micro-optimizations here though they don't hurt so long as the code is still understandable/readable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, I had a look through and though this looks cool I'm just not sure it's worth the extra space and more importantly code complexity as things stand. I'm not seeing much sign that the existing brute force search approach is actually causing problems for anyone. It would be a different matter if it took several seconds and was blocking the main thread, but as mentioned it seems like a search is happening well below the threshold of human perception.
I'm still not sure if it'd be worth it, but one alternative to building this index python-side might be to construct it upon initialization of the search code. I bet it'd still be quite fast and should reduce concerns about index bloat.
That's valid criticism - I'd add datastructure complexity into that too; most of the other data in
This is reasonable too. I've attempted disabling
That's a very interesting idea, and if time allows I'll evaluate that too. My intuitive sense is that placing the functionality cost for the ngrams (it has to go somewhere) into the bandwidth consumption (the largest overall increase for a many-many-clients situation like here) is better than client-compute here, because we can lean heavily on caching of static data, and the incremental transfer/decompression times I think would be lower -- and also scale-up at a lower rate as the documentation set size increases -- than client compute duration. That's not clear without benchmarking though. A couple of non-directly relevant thoughts:
I think I'm tending towards closing this, but may do a bit more exploration. Thanks for the feedback! Edit: fixup for link to v2 of exact-match phrase query support pull request. |
There is one potentially-significant limitation of building the ngrams client-side: they can only be derived from information available to the client at That's not a problem for this feature/pull-request in isolation, but it wouldn't be compatible with the elimination of non-existent phrase queries; that follow-up requires information about what words are adjacent to each other in documents, information that is not available to |
Another thing to consider evaluating could be replacement of the Doing so would increase the size of the ngram index, but perhaps complete removal of the |
Feature or Bugfix
Purpose
Detail
terms
index is built, derive another index of the n-grams contained within those terms. Compress and minify this into a trie datastructure for handover to JavaScript.Todo / open questions
terms
not by storing their entire string value in trie nodes, but using integer identifier(s).Relates
Edit: add JavaScript implementation details to this description.