Hello everyone,
I’m having a problem with indexing large PDF files in XWiki 17.8.0 running on Linux Debian + Tomcat 10 + MariaDB
I have several PDF files upolad by File Manger (plain text, ~300 pages each).I know it’s a lot, but I have 16GB of RAM allocated to this VM.
Only the first ~50 pages are indexed properly, while the rest of the document is not indexed at all. Searching for content from later pages returns no results.
I checked the logs but I cannot find any errors related to Solr, Tika, or PDF extraction. Re-indexing and rebuilding Solr indexes did not change anything.
I also tried adjusting the following parameters in xwiki.properties, but with no effect:
solr.type=embedded
solr.indexer.indexAttachments=true
solr.indexer.attachment.extractContentInline=true
solr.indexer.attachment.maxContentLength=-1
solr.indexer.attachment.maxTextSize=-1
textExtractor.limit=-1
textExtractor.tika.writeLimit=-1
solr.indexer.batch.size=200
solr.indexer.batch.maxLength=10000000
solr.indexer.queue.capacity=20000
Unfortunately, the issue persists and I am running out of ideas.
Has anyone encountered similar behavior or knows what might cause partial PDF indexing?
Any suggestions, tips, or debugging guidance would be very appreciated.
Thank you in advance for your help!