Hi everyone,
I noticed that Solr indexing is mostly dominated by Solr committing the changes to the index (> 90% of the indexing time spent there). There are currently two parameters that control how frequently we commit indexed changes: solr.indexer.batch.size and solr.indexer.batch.maxLength. The first controls after how many indexed documents we commit and the second controls after how many characters we commit. The default values are 50 documents and 10k characters. Additionally, we also commit whenever the indexing queue is empty. In regular usage, this means that when a single document is changed, we should index that document and then immediately commit the index.
There are several solutions how to improve this situation:
- Just increase the default values. I had good results with solr.indexer.batch.size=1000andsolr.indexer.batch.maxLength=10000000but I didn’t really benchmark these values. They are much larger than the current values. We could try performing a more controlled study to see which of the two values is the really critical one. With these values, the indexing time went down from several minutes to seconds for an empty wiki with standard flavor on my laptop. I also added slightly smaller values (200 instead of 1000 documents) to this integration test as it made the test significantly faster for me.
- Only perform a soft commit at the configured intervals that resets caches and makes changes available to the search but doesn’t flush changes to disk and enable auto commits in Solr to commit the index after a certain time. Using auto-commits “is preferable to sending explicit commits from the indexing client as it offers much more control over your commit strategy”.
We could also combine both options. I think option 1 should be enough, so +1 for that and +0 for option 2 as it is more complex.
Thank you very much for your opinions!