Improve Solr indexing speed by committing less frequently

Hi everyone,

I noticed that Solr indexing is mostly dominated by Solr committing the changes to the index (> 90% of the indexing time spent there). There are currently two parameters that control how frequently we commit indexed changes: solr.indexer.batch.size and solr.indexer.batch.maxLength. The first controls after how many indexed documents we commit and the second controls after how many characters we commit. The default values are 50 documents and 10k characters. Additionally, we also commit whenever the indexing queue is empty. In regular usage, this means that when a single document is changed, we should index that document and then immediately commit the index.

There are several solutions how to improve this situation:

  1. Just increase the default values. I had good results with solr.indexer.batch.size=1000 and solr.indexer.batch.maxLength=10000000 but I didn’t really benchmark these values. They are much larger than the current values. We could try performing a more controlled study to see which of the two values is the really critical one. With these values, the indexing time went down from several minutes to seconds for an empty wiki with standard flavor on my laptop. I also added slightly smaller values (200 instead of 1000 documents) to this integration test as it made the test significantly faster for me.
  2. Only perform a soft commit at the configured intervals that resets caches and makes changes available to the search but doesn’t flush changes to disk and enable auto commits in Solr to commit the index after a certain time. Using auto-commits “is preferable to sending explicit commits from the indexing client as it offers much more control over your commit strategy”.

We could also combine both options. I think option 1 should be enough, so +1 for that and +0 for option 2 as it is more complex.

Thank you very much for your opinions!

1 Like

+1 for 1 as it’s easy (but it’s less easy to find the right default value)

not a fan of 2 unless absolutely necessary as it feels a bit more error-prone

1 Like

AFAIU the tradeoff is:

frequent updates will improve the accuracy of searches because new content will be searchable more quickly, but performance may suffer because of the frequent updates. Less frequent updates may improve performance but it will take longer for updates to show up in queries.

I’m trying to think of a downside of your proposed changes. On an XWiki instance with many users editing at the same time different documents, their changes will appear in the search only after all the modified documents are indexed (or the batch size / length limits are reached). This means that if you have a constant flux of wiki page updates, there is a big chance that the search results will lag behind. E.g. if you constantly import mails or issues from an issue tracker in your wiki. But maybe this is not a problem because the indexing will be much faster and so the queue will become empty before other documents are updated.

+1 to try (1) and see how it goes.

Thanks,
Marius

1 Like

To work around this problem, we could also configure Solr to additionally automatically commit the index after a certain time if there are any uncommitted changes. That way, if the manual commit takes too long, e.g., because some documents take a lot of time to index, the automatic commit will make everything that has been indexed so far available in search.

2 Likes