I’m proud to announce the release of the first stable version of the Tesseract OCR Application. This extension is meant to provide an easy to use importation of graphical documents such as images or scanned PDFs into wiki pages using Optical Character Recognition, and the FOSS Tesseract library.
I would also like to thank @tmortagne for his help on the extension design .
If you try the extension, any kind of feedback would be great! The application is compatible with the last LTS and should also work with XWiki versions above 8.4.4.
A little suggestion if I may (and apologies if this is already existing functionality). It would be great if this extension could automatically extract text from images in wiki pages on save and add the content to image metatags. This would add extra value to pages and potentially allow users to find and recycle images.
I’ve run a couple of upload tests and I’m not getting very good results. I.e. almost every word has multiple errors.
That’s quite strange. Well, I guess that it can be explained if you are using handwritten documents. FTR I made my tests with LaTeX documents and this image ; most of the time, I get about 10 to 20% of errors, but not more.
If you get poor results, that might be due to Tesseract not being able to properly use his training data files. The screenshot you gave about the empty data store is really strange too ; could you open an issue on Loading... with some informations about your setup so that I can try to reproduce the bug ? Thanks