Tesseract OCR Application v1.1 released

Hello everyone!

I’m proud to announce the release of the first stable version of the Tesseract OCR Application. This extension is meant to provide an easy to use importation of graphical documents such as images or scanned PDFs into wiki pages using Optical Character Recognition, and the FOSS Tesseract library.

I would also like to thank @tmortagne for his help on the extension design :wink: .

If you try the extension, any kind of feedback would be great! The application is compatible with the last LTS and should also work with XWiki versions above 8.4.4.

For more information on how to use the extension, how to report an issue or how to contribute in its development or translations, please check out the extension page on extensions.xwiki.org : http://extensions.xwiki.org/xwiki/bin/view/Extension/Tesseract%20OCR%20Application

1 Like

Awesome work @caubin ! :slight_smile:

If I may suggest something: you could include some screenshots of this extension in action on http://extensions.xwiki.org/xwiki/bin/view/Extension/Tesseract%20OCR%20Application That would the page and the extension even more appealing to try it out!

Thanks again for your contribution!

Indeed @vmassol ; I’ll do it in the next days!

@caubin,

I really like this capability! Good work.

A little suggestion if I may (and apologies if this is already existing functionality). It would be great if this extension could automatically extract text from images in wiki pages on save and add the content to image metatags. This would add extra value to pages and potentially allow users to find and recycle images.

Thanks again,
Ben

1 Like

Thanks for the suggestion @ben.megson! I’ve created OCR-16 for your feature proposal.

@caubin,

I’ve run a couple of upload tests and I’m not getting very good results. I.e. almost every word has multiple errors.

Used very clear pictures of digital word documents so would expect reasonably good accuracy.

What’s your experience?

Also,

Can’t see any sources in the Data Store panel, however the eng.traineddata file does seem to be present on my server.

Capture111

XWiki 9.5.1, MySQL, Tomcat

Any suggestions?
Ben

I’ve run a couple of upload tests and I’m not getting very good results. I.e. almost every word has multiple errors.

That’s quite strange. Well, I guess that it can be explained if you are using handwritten documents. FTR I made my tests with LaTeX documents and this image ; most of the time, I get about 10 to 20% of errors, but not more.

If you get poor results, that might be due to Tesseract not being able to properly use his training data files. The screenshot you gave about the empty data store is really strange too ; could you open an issue on http://jira.xwiki.org/browse/OCR with some informations about your setup so that I can try to reproduce the bug ? Thanks :slight_smile:

Thanks @caubin. Will do.

I’ll also try your image a little later.

OK, have performed some more tests and Tesseract performs well. Previously described problems probably a fluke or caused by the font type.