Xwiki 8.4.5 - Your document contained more than 100000 characters

Hi there!

Im using xwiki 8.4.5 and when i upload a PDF file with large amount of characters, this exception is showed:

Caused by: org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your document contained more than 100000 characters, and so your requested limit has been reached. To receive the full text of the document, increase your limit. (Text up to the limit is however available).

How I can increase the characters limit?

Any help will be aprreciated.
Thanks!

Maycon

From what I can see in the code there should not be any limit actually. You should report this issue in http://jira.xwiki.org and ideally provide a file that reproduce the error when attached.

Does that still apply? I get a similar error in XWiki 13.10.5. and just opened [XWIKI-20139] Some PDF attachements are not indexed - XWiki.org JIRA.

I don’t know Tika that well, but it’s good that you provided a document. We don’t set any kind of limit on XWiki side, but it’s possible that Tika has an harcoded (or just default) one.

Thanks for the issue and the file to reproduce so that we can debug and see if there is something we can do on XWiki side to workaround that or if we need to report upstream to Tika.

1 Like

I added some quick research results on the topic and would gladly try to solve the issue, but I need assistance as stated in my comment linked above.

1 Like

Hi @ClemensRobbenhaar, would it be possible for us to work on fixing this issue together?

Sorry, so far I did not look into that issue. I just tried to reproduce with a development installation (XWiki 13.10.3 jetty/hsql package) and oddly could not reproduce the issue.

Does that mean, that in your test environment the provided example file from XWIKI-20139 was correctly indexed and thus the last words of the content were part of the index as well as no error message was thrown after the upload?

Oh, I did not check the results, sorry.
First, indeed no error is shown in the logs for me, but then anything from somewhere in the middle of page 26 down is indeed missing from the search index. So the results are now cut short in a more “silent” manner.

I looked a bit more into the problem, and it seems there are several issues:

  • It seems test in italics is not indexed at all. So you have to do without “Maxwell”. Bummer!
    This seems to be a limitation/bug with Tika
  • When using a hard wired tika.setMaxStringLength(-1); in the TikaUtils class, I can find e.g. the text: “Geschwindigkeitsgradienten” from the last page of the document, but only if I delete the attachment and upload it again. Uploading a new (identical) version does not work for me - despite of the fact that the CPU is quite busy for some time after the upload. That seems strange; maybe I have done something wrong.
  • Finally, to make the maxStringLength configurable, it is not possible to add a configuration source to the TikaUtils class, as this is a static one and one cannot inject the configuration there. Instead one need to use one of the components calling the TikaUtils class (e.g. the AttachmentSolrMetadataExtractor ) and inject the configurationSource there. (At least I think so.) This looks somewhat odd, as the setting there will have global side effects.