Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New exact phrase searching feature (for HTML) #12552

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

AA-Turner
Copy link
Member

@AA-Turner AA-Turner commented Jul 13, 2024

Re-opened version of #4254. (See #4254 (comment))

New exact phrase searching feature (for HTML)

I've just rebased the old PR and updated. However I'm not sure that this is the best implementation now, given that we have split "display" logic from "search" logic -- so if best to close this PR and start anew then I won't object.

Closes #3301

A

@AA-Turner AA-Turner added html search javascript Pull requests that update Javascript code labels Jul 13, 2024
@AA-Turner AA-Turner requested a review from wlach July 13, 2024 06:14
@AA-Turner AA-Turner requested a review from jayaddison July 13, 2024 06:15
Copy link
Member

@picnixz picnixz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we need some tests?

sphinx/themes/basic/static/searchtools.js Outdated Show resolved Hide resolved
@jayaddison
Copy link
Contributor

jayaddison commented Jul 13, 2024

This is a clever way to implement the feature without having to change the format of the search index (searchindex.js) file! My main concern is that it relies on the search summary functionality (HTTP GET of the complete content of each result), meaning that the retrieval behaviour of some search queries becomes entangled with what is otherwise mainly a result-formatting config setting (html_show_search_summary - on by default, but even so, as a user/project maintainer I would not expect that setting to alter search query capabilities).

An alternative implementation I have in mind would involve storing the location of each term in the documents it was found in as part of searchindex.js -- and then a phrase-query would check that all of the words within the phrase appear adjacently at least once. However.. that's significantly more effort.

Also agreed with @picnixz that some test coverage would be good if+when we add this.

Edit: rephrase; I shouldn't have suggested that this is an incomplete approach.

@jayaddison
Copy link
Contributor

jayaddison commented Jul 13, 2024

An alternative implementation I have in mind would involve storing the location of each term in the documents it was found in as part of searchindex.js -- and then a phrase-query would check that all of the words within the phrase appear adjacently at least once. However.. that's significantly more effort.

Note: stopwords (the, a, it...) are a challenge with this approach, because their positions aren't stored. The trick is to remove them from each phrase in the input query too.

Then the contents of a hypothetical document 15, with contents The example on page A is an useful example!, might tokenize to _ example _ page _ _ _ useful example => document term positions example: {15: [2, 9]}, page: {15: 4}, useful: {15: 8}...

...and a query for example on page could tokenize to example _ page => query term positions {example: 1}, {<ANY>: 2}, {page: 3} -- and now we need to match documents where both example and page appear, and then filter those results to cases where each matched term from the tokenized phrase has a corresponding next-match (allowing the ANY wildcard) with the same offset. And then an exact-match phase to eliminate incorrect wildcard matches (example code page in unrelated document 14).

Perhaps I should try to link to some kind of information retrieval coursebook or online resource, but I wanted to mention some of that to have it in context here while it's on my mind.

It's doable and probably quite a challenging and satisfying implementation, but there would be quirks and details.

Edit: add exact-match post-filter step
Edit: don't imply in the description that we'll definitely implement this

@picnixz
Copy link
Member

picnixz commented Jul 13, 2024

The trick is to remove them from each phrase in the input query too.

This may lead to a lot of false positive.

Perhaps I should try to link to some kind of information retrieval coursebook or online resource, but I wanted to mention some of that to have it in context here while it's on my mind.

One idea is to use n-gram predictions but I'm not sure if it will be sufficient (and efficient).

Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com>
Copy link
Contributor

@wlach wlach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit on the fence about this one. The implementation isn't that bad, but on the other hand I'm not sure how useful it is and whether it's worth the additional surface area to support (especially if we might want to refactor the search internals in the future, as several people have discussed).

I would definitely want to see tests before it goes in.

sphinx/themes/basic/static/searchtools.js Outdated Show resolved Hide resolved
sphinx/themes/basic/static/searchtools.js Outdated Show resolved Hide resolved
Co-authored-by: Will Lachance <wrlach@gmail.com>
@jayaddison
Copy link
Contributor

Despite my initial flip-out about an implementation that isn't index-driven, I would note that this is the most-requested search-related feature in the bugtracker. That's bringing me more towards acceptance of it.

if (data) {
const lowercaseData = data.toLowerCase();
const mismatch = (s) => !lowercaseData.includes(s);
if (exactSearchPhrases.some(mismatch)) return;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps every would be better than some here?

If searching for two phrases, it could be frustrating not to find any results at all, despite the fact that some pages do include one of the phrases.

@jayaddison
Copy link
Contributor

If a user runs a query for golang "code example" and no pages contain the phrase 'code example', but pages do contain 'golang', do we have a preferred outcome? (zero results, or include the non-phrase results)

@picnixz
Copy link
Member

picnixz commented Jul 17, 2024

For me, quotes should be what I want to search, even if it contains the rest. Quotes mean "I want that exact string, I don't want anything else". So maybe we could add a warning saying "remove quotes"?

@AA-Turner
Copy link
Member Author

Google presents the following ideas:

image

image

A

@jayaddison
Copy link
Contributor

jayaddison commented Jul 17, 2024

Hmm. Would searching for "that with" on a large (English-language example, but generalizable to others) documentation set potentially launch many, many HTTP GET requests with this?

@jayaddison
Copy link
Contributor

Hmm. Would searching for "that with" on a large (English-language example, but generalizable to others) documentation set potentially launch many, many HTTP GET requests with this?

Hm. Fortunately not, thanks to both of those being EN-language stopwords.

@jayaddison
Copy link
Contributor

For me, quotes should be what I want to search, even if it contains the rest. Quotes mean "I want that exact string, I don't want anything else". So maybe we could add a warning saying "remove quotes"?

That seems simple, and as a user, if I've intentionally used quotes to try to get exact-match results, then I am probably reasonably likely to be able to figure out that all of them must match if I use multiple quoted phrases in my query.

I think my largest concern about these changes remains the effiency/time-cost of the client reading through the entire contents of documents for matches. I could draft an ngram-based solution? (this time using inter-term ngrams, as compared to intra-term ngrams in #12596).

@jayaddison
Copy link
Contributor

I think my largest concern about these changes remains the effiency/time-cost of the client reading through the entire contents of documents for matches. I could draft an ngram-based solution? (this time using inter-term ngrams, as compared to intra-term ngrams in #12596).

Idea: when indexing tri-grams in the manner proposed in #12596, add the following handling:

  • During indexing, keep track of the word before/preceding each term, provided that it is part of the same block of text (paragraph/sentence).
  • If the trigram being created is at the start/prefix of a word, then include the trigram of the suffix of the previous word in the term list (so, when indexing the phrase context matters, term offsets zero and one respectively, the trigram for ext would point to term-offset-zero, the trigram for ers would point to term-offset-one, and the trigram for mat -- seemingly oddly -- would point to both term-offsets zero and term one).
  • During phrase queries, we would begin by collecting all of the starting-edge trigrams from the query phrase. If we query again for "context matters", this returns [0], [0,1] -- and we can observe that there is a valid, ordered path along the ordered terms. If we queried for "context indeed matters", then we would retrieve [0], [2], [0, 1] or something along those lines -- the second result ([2]) doesn't contain a path back to the first result ([0]), and so this pair of terms does not exist in the document collection.

The flaw in all of the above reasoning is that it is global across the entire document collection. It may be preferable to have per-document filtering, because otherwise query performance may be inconsistent (some very fast queries where the phrase is known not to exist at all -- but then slow queries where we still have to check every document).

// exclude results that don't contain exact phrases if we are searching for them
if (data) {
const lowercaseData = data.toLowerCase();
const mismatch = (s) => !lowercaseData.includes(s);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: perhaps we could/should add word boundaries around the match?

Suggested change
const mismatch = (s) => !lowercaseData.includes(s);
const mismatch = (s) => !s.match(`\b${lowercaseData}\b`);

Reasoning:

  • Could make it easier to exact-search for strings that are substrings of other phrases/words.
  • Although regex usage can introduce some overhead, there's also optimization opportunity if the matching can skip over non-word boundary match positions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
html search javascript Pull requests that update Javascript code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Exact search in Sphinx
5 participants