Processing of PDF attachment / PDF upload as input with TIKA

NorSch · April 15, 2023, 1:33pm

In order to automatically process PDF files that are uploaded as attachments, we have written two macros based on Apache TIKA, which is integrated into XWIKI. (see below)

Automatic processing is particularly desirable when it is based on the data contained in PDF forms.

Since the development of the following processes is still at the beginning, the first question I have is: Will Apache TIKA remain a component of XWiki in the long run?

And second would be an idea / wish from my side:

To provide certain standard functions for attachments (depending on the MIME type). For PDF files, this would be the above-mentioned routines for querying the text content or the form data; for other types, such as graphic files, it could be the various metadata.
I have no problem getting the information - at the latest via an integrated shell call - it is more a question of XWIKI’s development policy whether one wants to burden the system with such additional functions to be maintained at all.

Now here is the sample code:

{{groovy}}
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.BodyContentHandler;
import org.apache.tika.sax.ToXMLContentHandler;
import org.xml.sax.ContentHandler;

def getContentFromPDF(inputByteStream) {
      def Metadata metadata = new Metadata();
      def Parser parser = new AutoDetectParser();
      def ContentHandler handler = new ToXMLContentHandler();
      def ParseContext thisContext = new ParseContext();
      try {
        parser.parse(inputByteStream, handler, metadata, thisContext);
        return handler.toString()
      }  catch (Exception e) { return "ERROR  - PDF-content could not be extracted" } 
}


def getFormFromPDF(inputByteStream) {
      def Metadata metadata = new Metadata();
      def result=[:]
      def textContent
      def Parser parser = new AutoDetectParser();
      def ContentHandler handler = new ToXMLContentHandler();
      def ParseContext thisContext = new ParseContext();
      try {
        parser.parse(inputByteStream, handler, metadata, thisContext);
        textContent= handler.toString().replaceAll(/(?s)^.*?<div class="acroform"><ol>/,"").replaceAll(/(?s)<\/ol>.*$/,"")
        def m = textContent =~/\<li altName=.*?\>(.*?): (.*?)\<\/li\>/
        if (m) { m.each { result[it[1]]=it[2] } }
        m = textContent =~ /\<li\>(\S+): (.*?)\<\/li\>/
        if (m) { m.each { result[it[1]]=it[2] } }
        return result
     } catch (Exception e) { result["ERROR"] = e.getMessage(); return result }
}


// e x a m p l e 

  docName="<FULL NAME OF THE DOCUMENT>"
  attName="<NAME OF PDF-ATTACHMENT>"

  dataStream1=xwiki.getDocument(docName).getAttachment(attName).getContentInputStream()

  htmlLikeContent = getContentFromPDF(dataStream1)

  println "{{html clean='false'}}"+htmlLikeContent.replaceAll(/(?s)^.*?<body>/,"").replaceAll(/(?s)<\/body>.*/,"")+"{{/html}}"

// e x a m p l e - extract form data

  docName="<FULL NAME OF THE DOCUMENT>"
  attName="<NAME OF PDF-ATTACHMENT>"

  dataStream2=xwiki.getDocument(docName).getAttachment(attName).getContentInputStream()

  theForm = getFormFromPDF(dataStream2)

  println "\n(%style='width:auto' %)\n|=Key|=Value"
  theForm.each { x,y -> println "|{{code}}${x}{{/code}}|{{code}}${y}{{/code}}"}

{{/groovy}}

vmassol · May 2, 2023, 1:11pm

This cannot be guaranteed (it’s something internal) but FTM I don’t know of any plan to move away from it.

Even if it were removed, you could still have a dependency on Tika in your extensions.