Hi everyone,
page pickers like the one used in the Ctrl + G
popup or macro parameters like the included page in the include macro are currently based on the search REST API /wikis/{wikiName}/search
. This REST API performs a query that contains the where-clause upper(doc.name) like :keywords
where :keywords
is set to %USER_INPUT%
, where USER_INPUT
is the input of the user. From what I understand from @lucaa for example, it is important for her and maybe also for other users that this query really performs an exact substring match in the page name. Unfortunately, this kind of SQL query cannot be optimized with an index and is thus not scalable. We’ve recently experienced that this leads to severe performance problems on an instance with a million documents, making the pickers useless and impacting the stability of the instance.
I’ve done a bit of research and what I found is that the best option might be to move this query to Solr with fields indexed with an N-Gram filter. Simply put, with this filter, Solr would create an inverted index for short substrings of the indexed content. For example, for N-Grams of size 3 to 10 with a lowercase filter and input “XWiki”, it would index “xwi”, “wik”, “iki”, “xwik”, “wiki”, and “xwiki”. That way, we can get fast exact substring matches up to a certain size. There are several open questions, though:
- How do we configure the N-Gram filter, do we apply it just on tokens, so after a tokenizer filter, or directly on the original lowercase text? I’m actually not sure if it is possible without tokenization, all examples seem to use a tokenizer filter. There is also the option to use just prefixes and suffixes if we don’t care about matches in the middle of a word.
- What sizes of N-Grams do we want to have? It probably makes sense to use a minimum of at least 2 or 3 characters, for the maximum I would suggest performing experiments on actual content to see how it influences the index size. I think we should have at least 5 characters, I would want to try 10, but I wouldn’t go beyond 20.
- I would suggest combining whatever we do with a regular full text search (BM25) on the respective fields. The question is how we should configure the ranking. One option is for example to give the regular full text search a higher weight which should sort matches of full words before partial matches.
We currently support searching in the following scopes: title, name, content, spaces, objects. I’m pretty sure that we’re using title
and name
for page and spaces
for space pickers, but I couldn’t find any uses of the content
or objects
scope. I would suggest to index N-Grams at least for title, name and space, but I’m not convinced we should add an N-Gram index for the page and object content - or if we do, we might want to configure it differently to avoid excessive size usage, with smaller N-Grams.
Do you have any further suggestions what we should do? Do you agree that we should change the wiki (and space) search REST API to use Solr even if the behavior should be slightly different? Should we offer a switch to the old solution for small wikis that care about the exact previous behavior?
Thank you very much for your feedback and inputs!