Make the search REST API work in large wikis

MichaelHamann · March 7, 2025, 1:25pm

I’ve done some more experiments, and I’m wondering if we really need n-grams. It seems to me that doing something like *INPUT* is bad in Solr, but INPUT* seems to only cause a relatively small performance hit - having a more complex query with two more conditions seemed to cause a bigger hit in my local test with 3 million indexed documents.

Considering that, there might not really be a need for that n-gram index unless we really want to be able to have matches starting in the middle of a word.

What seems to be a problem with the current search index is that we currently don’t have the last space indexed separately which would be important to be able to match the name of non-terminal documents. We could instead match all spaces which would basically re-open XWIKI-20632. We could make that a bit less severe by giving a higher weight to the title when both title and name shall be searched. Further, the title would be more effective in Solr as we have the rendered title in Solr.

I’m also not sure that the current index is really enough for searching spaces. While I outlined a possible strategy above, I’m not sure whether it is good enough. However, searching for spaces could also be treated as separate topic that we could solve separately as I think it is used much less frequently and the code is already quite separate.

Do you have any opinions how we should continue? I think it might be a promising move to develop a first version of the search REST API that works with the existing search index as it would be a lot simpler to deploy.

I would also suggest offering both database and Solr implementations with a configuration option to switch them in a first version, with the idea of removing the database version once the Solr version has been proven to be good enough in real use.