In order to automatically process PDF files that are uploaded as attachments, we have written two macros based on Apache TIKA, which is integrated into XWIKI. (see below)
Automatic processing is particularly desirable when it is based on the data contained in PDF forms.
Since the development of the following processes is still at the beginning, the first question I have is: Will Apache TIKA remain a component of XWiki in the long run?
And second would be an idea / wish from my side:
To provide certain standard functions for attachments (depending on the MIME type). For PDF files, this would be the above-mentioned routines for querying the text content or the form data; for other types, such as graphic files, it could be the various metadata.
I have no problem getting the information - at the latest via an integrated shell call - it is more a question of XWIKI’s development policy whether one wants to burden the system with such additional functions to be maintained at all.
Now here is the sample code:
{{groovy}}
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.BodyContentHandler;
import org.apache.tika.sax.ToXMLContentHandler;
import org.xml.sax.ContentHandler;
def getContentFromPDF(inputByteStream) {
def Metadata metadata = new Metadata();
def Parser parser = new AutoDetectParser();
def ContentHandler handler = new ToXMLContentHandler();
def ParseContext thisContext = new ParseContext();
try {
parser.parse(inputByteStream, handler, metadata, thisContext);
return handler.toString()
} catch (Exception e) { return "ERROR - PDF-content could not be extracted" }
}
def getFormFromPDF(inputByteStream) {
def Metadata metadata = new Metadata();
def result=[:]
def textContent
def Parser parser = new AutoDetectParser();
def ContentHandler handler = new ToXMLContentHandler();
def ParseContext thisContext = new ParseContext();
try {
parser.parse(inputByteStream, handler, metadata, thisContext);
textContent= handler.toString().replaceAll(/(?s)^.*?<div class="acroform"><ol>/,"").replaceAll(/(?s)<\/ol>.*$/,"")
def m = textContent =~/\<li altName=.*?\>(.*?): (.*?)\<\/li\>/
if (m) { m.each { result[it[1]]=it[2] } }
m = textContent =~ /\<li\>(\S+): (.*?)\<\/li\>/
if (m) { m.each { result[it[1]]=it[2] } }
return result
} catch (Exception e) { result["ERROR"] = e.getMessage(); return result }
}
// e x a m p l e
docName="<FULL NAME OF THE DOCUMENT>"
attName="<NAME OF PDF-ATTACHMENT>"
dataStream1=xwiki.getDocument(docName).getAttachment(attName).getContentInputStream()
htmlLikeContent = getContentFromPDF(dataStream1)
println "{{html clean='false'}}"+htmlLikeContent.replaceAll(/(?s)^.*?<body>/,"").replaceAll(/(?s)<\/body>.*/,"")+"{{/html}}"
// e x a m p l e - extract form data
docName="<FULL NAME OF THE DOCUMENT>"
attName="<NAME OF PDF-ATTACHMENT>"
dataStream2=xwiki.getDocument(docName).getAttachment(attName).getContentInputStream()
theForm = getFormFromPDF(dataStream2)
println "\n(%style='width:auto' %)\n|=Key|=Value"
theForm.each { x,y -> println "|{{code}}${x}{{/code}}|{{code}}${y}{{/code}}"}
{{/groovy}}