Make the search REST API work in large wikis

MichaelHamann · February 26, 2025, 2:22pm

Hi everyone,

page pickers like the one used in the Ctrl + G popup or macro parameters like the included page in the include macro are currently based on the search REST API /wikis/{wikiName}/search. This REST API performs a query that contains the where-clause upper(doc.name) like :keywords where :keywords is set to %USER_INPUT%, where USER_INPUT is the input of the user. From what I understand from @lucaa for example, it is important for her and maybe also for other users that this query really performs an exact substring match in the page name. Unfortunately, this kind of SQL query cannot be optimized with an index and is thus not scalable. We’ve recently experienced that this leads to severe performance problems on an instance with a million documents, making the pickers useless and impacting the stability of the instance.

I’ve done a bit of research and what I found is that the best option might be to move this query to Solr with fields indexed with an N-Gram filter. Simply put, with this filter, Solr would create an inverted index for short substrings of the indexed content. For example, for N-Grams of size 3 to 10 with a lowercase filter and input “XWiki”, it would index “xwi”, “wik”, “iki”, “xwik”, “wiki”, and “xwiki”. That way, we can get fast exact substring matches up to a certain size. There are several open questions, though:

How do we configure the N-Gram filter, do we apply it just on tokens, so after a tokenizer filter, or directly on the original lowercase text? I’m actually not sure if it is possible without tokenization, all examples seem to use a tokenizer filter. There is also the option to use just prefixes and suffixes if we don’t care about matches in the middle of a word.
What sizes of N-Grams do we want to have? It probably makes sense to use a minimum of at least 2 or 3 characters, for the maximum I would suggest performing experiments on actual content to see how it influences the index size. I think we should have at least 5 characters, I would want to try 10, but I wouldn’t go beyond 20.
I would suggest combining whatever we do with a regular full text search (BM25) on the respective fields. The question is how we should configure the ranking. One option is for example to give the regular full text search a higher weight which should sort matches of full words before partial matches.

We currently support searching in the following scopes: title, name, content, spaces, objects. I’m pretty sure that we’re using title and name for page and spaces for space pickers, but I couldn’t find any uses of the content or objects scope. I would suggest to index N-Grams at least for title, name and space, but I’m not convinced we should add an N-Gram index for the page and object content - or if we do, we might want to configure it differently to avoid excessive size usage, with smaller N-Grams.

Do you have any further suggestions what we should do? Do you agree that we should change the wiki (and space) search REST API to use Solr even if the behavior should be slightly different? Should we offer a switch to the old solution for small wikis that care about the exact previous behavior?

Thank you very much for your feedback and inputs!

surli · February 26, 2025, 4:56pm

Hi,

thanks for this investigation!

Just to be sure, the tokens you’re mentioning here are in general words separated by spaces, or are we talking about another form of tokenization?
Feels like it would give better result to apply it after the tokenizer no?

This question seems to highly depend on how the page pickers are used. I don’t think it would make sense to present results to users before having typed 2 or 3 characters, and 2 characters might even be too low. And IMO users would rarely go as far as typing 10 characters to obtain a result. Maybe @tkrieck would have some insight on this.
In any case, this needs to be configurable.

This might also depend on how we present the results. We could imagine that the API returns 20 results for regular search and 20 results from n-gram filter search and we handle the combination at the very end. And then that should probably be also configurable.

+1, at least for a first version.

Do you have any figures, even estimations, on the space needed for different number of documents e.g. with 100, 1000, 10000 docs?

MichaelHamann · February 26, 2025, 5:34pm

It depends, there are different tokenizers in Solr. But in general, yes, that’s the idea.

And now that I saw that page, I also realized that there is an N-Gram Tokenizer so we can indeed directly use it without first splitting into words. The advantage of not using a tokenizer before the n-gram generation is that you would be able to more easily search for things like a-b or a@b which are normally separated by the tokenizer. I doubt that such queries would work otherwise if we only index n-grams starting with a minimum size of 2.

The problem is that you need to re-index the whole wiki to change this, as this needs to be configured in the Solr schema. However, this also means that you can configure it: just edit the Solr schema and re-index.

No, I first wanted to get a consensus that this is the way we want to go, and then I wanted to perform some experiments to see how this influences the index size.

I think there are a few factors that influence the index size:

small n-gram sizes might match a huge number of documents, so this might blow up the index a bit.
there are more distinct n-grams without tokenization than with tokenization.
there are more distinct larger n-grams (in particular without prior tokenization).

tkrieck · February 27, 2025, 9:27am

I agree that minimum of 3 characters is preferable, the results wouldn’t be useful with less than that. Regarding the maximum, it depends if the input for search is the only way of filtering results. With a large dataset and only an input for filtering, the characters inputted can go pretty high. On the main XWiki search it’s not a problem bc there’s the advanced (ie, regular) search.

MichaelHamann · February 27, 2025, 9:38am

As far as I understand (but I haven’t tried it), there is no real problem to have larger inputs and there will still be results, it’s just that we can’t get precise substring matches, at least not that fast. I’m certain we can still get regular keyword search results for larger queries and either automatically or by a preprocessing step we can also still get precise substring matches, e.g., by just searching for the last, e.g., 10 characters of the query, or by splitting the query into parts. I believe Solr already does such things like having an imprecise query for parts and then filtering the results later when you, e.g., search for a phrase by putting the search string in quotes. The query might just be less optimized.

MichaelHamann · March 4, 2025, 5:28pm

I’ve performed some experiments. I’ve defined three kinds of fields in a test Solr instance:

Just n-gram tokenization on the full text. This allows for exact substring matches up to 10 characters (and we also add the full text as a token, so an exact query of the whole input matches, too):

    <fieldType name="ngrams_full" class="solr.TextField" positionIncrementGap="100">
        <analyzer type="index">
            <tokenizer name="keyword"/>
            <filter name="lowercase"/>
            <filter name="nGram" minGramSize="1" maxGramSize="10" preserveOriginal="true"/>
        </analyzer>
        <analyzer type="query">
            <tokenizer name="keyword"/>
            <filter name="lowercase"/>
        </analyzer>
    </fieldType>

N-gram tokenization on whitespace-separated tokens. This allows for exact substring matches in several words in any order with space-separated words of up to 10 characters each (and the whole words are also added as tokens):

    <fieldType name="ngrams_tokens" class="solr.TextField" positionIncrementGap="100">
        <analyzer type="index">
            <tokenizer name="whitespace"/>
            <filter name="lowercase"/>
            <filter name="nGram" minGramSize="1" maxGramSize="10" preserveOriginal="true"/>
        </analyzer>
        <analyzer type="query">
            <tokenizer name="whitespace"/>
            <filter name="lowercase"/>
        </analyzer>
    </fieldType>

Regular tokenization as a comparison:

    <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
        <analyzer>
            <tokenizer class="solr.StandardTokenizerFactory"/>
            <filter class="solr.LowerCaseFilterFactory"/>
        </analyzer>
    </fieldType>

In the first and the second case, I’ve configured the query analyzer to perform the same analysis without n-gram generation. The reason for this is that for the query, it doesn’t make sense to search for all possible substrings. Instead, I think what we should do in the XWiki code that uses these fields is to generate sensible queries, meaning that for words/whole search strings up to 10 characters we pass them as is and for search strings/words (for the tokenized variant) longer than 10 characters, we generate n-grams of length 10 to approximate an exact match.

To get somewhat realistic input data, I’ve indexed the first 3 million titles of the full English Wikipedia title dump of February 20, 2025.

For the regular tokenization, I get an index of 100MB.

With a minimum n-gram size of 1, I get the following sizes:

N-grams of the full title: 1445 MB
N-grams of tokens: 540 MB

With a minimum n-gram size of 2, I get:

N-grams of the full title: 1409 MB
N-grams of tokens: 490 MB

If you’re interested in particular configurations, I can try them, too.

I’ve tested various queries, and they seem to work as expected. With the n-grams of the full title we can match exactly across words (but only up to 10 characters) while with the n-grams of tokens the order of the space-separated words doesn’t matter, so “wiki search” is the same as “search wiki”. In both cases, numbers and special characters are indexed, too, so in contrast to fields using the standard tokenizer, you can easily search for things like “a(b)”.

What’s not so expected for me is the ranking, I don’t seem to be able to boost a full match of a word or a match of the whole title compared to a substring match with the n-gram based index. However, it seems to work as expected with the regular index where the boost for a phrase match with the DisMax query parser works as expected. As we have the regular index, anyway, I think we could construct a query that boosts phrase matches with the regular index (I couldn’t test that as I put each field type in a separate collection to be able to compare their sizes).

Based on the database queries that we currently have, I think we would need the following fields for the different search scopes we support:

name: this searches in the page name for terminal page and in the last space for non-terminal pages. I think for this we should introduce a new field that contains the name for terminal pages and the last space for non-terminal pages.
title: this searches in the (raw) page title. I don’t think we should search in the raw title, but instead we should use the existing rendered title, just indexed with the chosen index format.
content: this searches for substrings in the content. As mentioned before, I don’t think we should use this expensive n-gram index for the whole content and just use the existing content index.
spaces: this is used for the space picker. Here, we currently do a prefix match on the full reference, and a substring match on the last space. We could emulate this with the document index in Solr I think if we a) create an edge-n-gram index of the full space (so we have all substrings of the space that start at the start) and b) add an n-gram index for all spaces themselves (so for A.B.C, we would add A, B and C as tokens). We could then search in those two fields and request facet results for the existing space prefixes (that contain the full space hierarchy of the document, for A.B.C, this contains A, A.B, and A.B.C) which should give us unique space values. The reason why I think it’s not enough to index the last space for the second index is that there might be no page in the intermediate spaces. Further, we should probably include hidden pages for that query.
objects: this currently searches in the content of short and large string properties. The simplest would be to search in the propertyvalue of the indexed properties (type: OBJECT_PROPERTY). As the property values could be huge, it might not be good to index n-grams. The ordering could be challenging, though, as we don’t have all properties available like the title for example. We could also search in objcontent in the document but there the problem is that we have not only the values but also the keys in objcontent which might lead to undesired matches. We could also decide to index just the property values separately in the document (again, without n-grams).

Do you have any opinions on:

The configuration of the n-gram generation.
The fields to index and search for the different scopes.

Thank you very much for your input! Also, let me know if you need any clarifications, or I should verify any examples.

MichaelHamann · March 5, 2025, 9:21am

I got some more thoughts:

Do we really need/want exact substring matches? It’s what we currently have, but it might not actually be needed (or even desired).
Even with the splitting at whitespace, there might be issues with certain kinds of page names. For example, the page name strategy could be configured to not use any whitespace, or it could be random UUIDs - I expect that this would be the worst case in terms of index size.
Still, I think it would be good to be able to match substrings like 17.10-RC1 exactly and fast.

Some ideas how to tackle this:

Use a more limited range of n-gram sizes, like two (or three) to five characters to avoid producing a huge number of different n-grams when the input cannot be split into small tokens.
Discard the idea of being able to search for arbitrary substrings and just use a standard tokenizer as first step, which would result in smaller and less diverse inputs to the n-gram generation. Solr might still be able to boost matches that are close together.
Develop a custom tokenization strategy, e.g., based on a regular expression.

Do you have any opinions on what and how we should be able to match?

MichaelHamann · March 7, 2025, 1:25pm

I’ve done some more experiments, and I’m wondering if we really need n-grams. It seems to me that doing something like *INPUT* is bad in Solr, but INPUT* seems to only cause a relatively small performance hit - having a more complex query with two more conditions seemed to cause a bigger hit in my local test with 3 million indexed documents.

Considering that, there might not really be a need for that n-gram index unless we really want to be able to have matches starting in the middle of a word.

What seems to be a problem with the current search index is that we currently don’t have the last space indexed separately which would be important to be able to match the name of non-terminal documents. We could instead match all spaces which would basically re-open XWIKI-20632. We could make that a bit less severe by giving a higher weight to the title when both title and name shall be searched. Further, the title would be more effective in Solr as we have the rendered title in Solr.

I’m also not sure that the current index is really enough for searching spaces. While I outlined a possible strategy above, I’m not sure whether it is good enough. However, searching for spaces could also be treated as separate topic that we could solve separately as I think it is used much less frequently and the code is already quite separate.

Do you have any opinions how we should continue? I think it might be a promising move to develop a first version of the search REST API that works with the existing search index as it would be a lot simpler to deploy.

I would also suggest offering both database and Solr implementations with a configuration option to switch them in a first version, with the idea of removing the database version once the Solr version has been proven to be good enough in real use.