XWiki – issue with indexing large PDF files uploaded via File Manager

Hello everyone,

I’m having a problem with indexing large PDF files in XWiki 17.8.0 running on Linux Debian + Tomcat 10 + MariaDB

I have several PDF files upolad by File Manger (plain text, ~300 pages each).I know it’s a lot, but I have 16GB of RAM allocated to this VM.
Only the first ~50 pages are indexed properly, while the rest of the document is not indexed at all. Searching for content from later pages returns no results.

I checked the logs but I cannot find any errors related to Solr, Tika, or PDF extraction. Re-indexing and rebuilding Solr indexes did not change anything.

I also tried adjusting the following parameters in xwiki.properties, but with no effect:

solr.type=embedded
solr.indexer.indexAttachments=true
solr.indexer.attachment.extractContentInline=true
solr.indexer.attachment.maxContentLength=-1

solr.indexer.attachment.maxTextSize=-1
textExtractor.limit=-1
textExtractor.tika.writeLimit=-1

solr.indexer.batch.size=200
solr.indexer.batch.maxLength=10000000
solr.indexer.queue.capacity=20000

Unfortunately, the issue persists and I am running out of ideas.

Has anyone encountered similar behavior or knows what might cause partial PDF indexing?
Any suggestions, tips, or debugging guidance would be very appreciated.

Thank you in advance for your help!

The method we use to extract content from the PDF only returns the first 100k characters to prevent memory issues. I don’t believe we have any option to adjust this limit. Feel free to report an issue to add an option to adjust this limit.

Hi Michael,
thanks for the clarification about the 100k‑character PDF extraction limit.

I have two short follow‑up questions:

  1. Are there any practical workarounds?

For example, do people typically split large PDFs, extract text externally and add it to the page somehow, or using difrent type of file… is there another common solution?

  1. Would an external Solr instance help?

Before I set one up: does using standalone Solr change anything here, or is the limit fully inside XWiki’s extraction code and independent of Solr?

Thanks in advance for any tips!

I’m not aware of any frequently applied workarounds. You could run a Groovy macro with something like org.xwiki.tika.internal.TikaUtils.getTika().setMaxStringLength(<desiredLimit>) where <desiredLimit> is the limit you want to set (untested, but I don’t see why it shouldn’t work). But this would need to be re-run after every restart of XWiki before any affected page is indexed.

No, this is independent of Solr.