Tesseract OCR Application v1.1 released

caubin · February 18, 2018, 2:17pm

Hello everyone!

I’m proud to announce the release of the first stable version of the Tesseract OCR Application. This extension is meant to provide an easy to use importation of graphical documents such as images or scanned PDFs into wiki pages using Optical Character Recognition, and the FOSS Tesseract library.

I would also like to thank @tmortagne for his help on the extension design .

If you try the extension, any kind of feedback would be great! The application is compatible with the last LTS and should also work with XWiki versions above 8.4.4.

For more information on how to use the extension, how to report an issue or how to contribute in its development or translations, please check out the extension page on extensions.xwiki.org : http://extensions.xwiki.org/xwiki/bin/view/Extension/Tesseract%20OCR%20Application

vmassol · February 18, 2018, 5:00pm

Awesome work @caubin !

If I may suggest something: you could include some screenshots of this extension in action on http://extensions.xwiki.org/xwiki/bin/view/Extension/Tesseract%20OCR%20Application That would the page and the extension even more appealing to try it out!

Thanks again for your contribution!

caubin · February 19, 2018, 12:19pm

Indeed @vmassol ; I’ll do it in the next days!

ben.megson · February 20, 2018, 8:56am

@caubin,

I really like this capability! Good work.

A little suggestion if I may (and apologies if this is already existing functionality). It would be great if this extension could automatically extract text from images in wiki pages on save and add the content to image metatags. This would add extra value to pages and potentially allow users to find and recycle images.

Thanks again,
Ben

caubin · February 22, 2018, 11:16am

Thanks for the suggestion @ben.megson! I’ve created OCR-16 for your feature proposal.

ben.megson · February 22, 2018, 2:39pm

@caubin,

I’ve run a couple of upload tests and I’m not getting very good results. I.e. almost every word has multiple errors.

Used very clear pictures of digital word documents so would expect reasonably good accuracy.

What’s your experience?

ben.megson · February 22, 2018, 2:47pm

Also,

Can’t see any sources in the Data Store panel, however the eng.traineddata file does seem to be present on my server.

Capture111

XWiki 9.5.1, MySQL, Tomcat

Any suggestions?
Ben

caubin · February 23, 2018, 4:39pm

I’ve run a couple of upload tests and I’m not getting very good results. I.e. almost every word has multiple errors.

That’s quite strange. Well, I guess that it can be explained if you are using handwritten documents. FTR I made my tests with LaTeX documents and this image ; most of the time, I get about 10 to 20% of errors, but not more.

If you get poor results, that might be due to Tesseract not being able to properly use his training data files. The screenshot you gave about the empty data store is really strange too ; could you open an issue on Loading... with some informations about your setup so that I can try to reproduce the bug ? Thanks

ben.megson · February 28, 2018, 9:03am

Thanks @caubin. Will do.

I’ll also try your image a little later.

ben.megson · March 3, 2018, 1:12pm

OK, have performed some more tests and Tesseract performs well. Previously described problems probably a fluke or caused by the font type.