OCR With Special Characters
Training Tesseract to reconise new characters and fonts
I was recently faced with the challenge of using Optical Character Recognition (OCR) to convert an old book to an editable text format. This is not difficult with English and other major languages, but the challenge comes when the document uses a minority language with a different alphabet to those recognised by OCR programs. Take the "ŋ" character in the image in this article - with English based OCR, "ŋ" can show as "n" or "g" or "u"... or another, but never as "ŋ" as that character does not exist in the language file. The OCR gets completely lost and the resulting text requires much editing.
The easiest solution is to find a language in the OCR program that uses these characters - Portugese is good for ô and ê , but it is not easy to find one that also includes ŋ , which is extensively used in some minority languages. A better solution is to train the OCR program to recognise the new characters. The training process takes some time, but improves the results many times over.
Tesseract is one of the more accurate and adaptable OCR systems. It is also open source and free to use. It was originally developed by HP and more recently improved, with significant input by Google. It can be used directly from the "command line", or with user friendly front ends.
To test the training with Tesseract, I created training data using a single scanned page from the book on which I need to run the OCR. As the training data only used this single source, we wouldn't expect perfect results with source documents that use a different font. But it worked quite well for this single case.
OCR recognises the character graphics and gets assistance by a language dictionary, if present. Its graphical recognition is only as good as the range of font types used to train the program. When training Tesseract for new characters, additional training texts, one for each font, can be generated by a word processing program. These are then combined in the training process - see the Tesseract training documentation for details.
Recognition of English and other major language texts is excellent, as good as the original image allows. This is a key point in OCR - first make sure you have good image (tiff, png, jpg, bmp, pdf), with clear text, 10 - 12px, with scan/print resolution of 300dpi.
With a good image, OCR can be near perfect. The following illustration from "gImageReader" (in Ubuntu) shows a 300dpi source image in the centre and the recognised text on the right hand side. This example used the Portugese language (without a dictionary) and the added the special characters used in the Jabêm language, created as a result of this Tesseract training test..
The following table outlines the steps I used to train Tesseract (on Ubuntu) using a single image / font to generate training data. See the Tesseract wiki for the full details on training Tesseract.
ActionLinux command (each typed on single line, except for “mv” which are separate)
Result – files output
Create source image with training dataScan and process, rename image
jab.gentium.exp0.jpg
Create "box" file (a text image map with recognised characters)tesseract -l por jab.gentium.exp0.jpg jab.gentium.exp0 batch.nochop makebox
jab.gentium.exp0.box
Edit box file to correct character recognitionpython moshpytt.py
Check and correct character recognition – resave box file
Create base training file from image and box files - If errors show, go back to previous steptesseract jab.gentium.exp0.jpg jab.gentium.box nobatch box.train
jab.gentium.box.tr
Extract list of utf characters in the training sampleunicharset_extractor jab.gentium.exp0.box
unicharset
create list of fonts and their propertiesecho jab.gentium.box 0 0 0 1 0 > font_properties
font_properties
mftrainingmftraining -F font_properties -U unicharset -O jab.unicharset jab.gentium.box.tr
Shapetable, inttemp, pffmtable, jab.unicharset
cntraining - output the character normalization sensitivity prototypes filecntraining *.tr
normproto
Rename output files to add language prefix "jab." The dot is needed.mv inttemp jab.inttemp
mv normproto jab.normproto
mv pffmtable jab.pffmtable
mv shapetable jab.shapetable
jab.inttemp
jab.normproto
jab.pffmtable
jab.shapetable
create the consolidated training data filecombine_tessdata jab.
jab.traineddata
Move the training data file to the "tessdata" folder under the "tesseract-ocr" installation foldersudo mv jab.traineddata /usr/share/tesseract-ocr/tessdata/
Notes
- Best to keep all files in a single folder
- Image file naming convention: [language name].[font name].exp[number].[file extension]
- Do not mix fonts in the training file - use separate images for different fonts and for italic, bold etc. Best if more than ten of each character are included in the image.
- For this image, I used "tesseract -l por" when creating the initial "box file". This tells the Tesseract command line to use the Portugese language - used in this case as it is better at recognising ê and ô and so requires less processing in the next step.
- ASCII or UTF-8 encoding without BOM with "LF" not "CR/LF" for "End Of Line" for all text files that you are using. Note that Windows uses "CR/LF"as default.
- Font properties
fontname
italic
bold
monospace
serif
fraktur
jab.gentium.box
0
0
0
1
0
- Moshpytt is a python based "box file editor" - to help correct the character recognition - Download from http://code.google.com/p/moshpytt/ Other graphical "box-file" editors are also available. See the Tesseract wiki for details.
This screenshot shows Moshpytt in action. As you move down the list on the right (contents of the "box" file), the character is highlighted and the Tesseract interpretation of the character shown in red. Wrong characters can be corrected in the table on the right. It will need to be checked and corrected if errors show.
- While I have used Linux here, Tesseract OCR and training can also be done in Windows.
What's Related