I was recently faced with the challenge of using Optical Character Recognition (OCR) to convert an old book to an editable text format. This is not difficult with English and other major languages, but the challenge comes when the document uses a minority language with a different alphabet to those recognised by OCR programs. Take the "ŋ" character in the image in this article - with English based OCR, "ŋ" can show as "n" or "g" or "u"... or another, but never as "ŋ" as that character does not exist in the language file. The OCR gets completely lost and the resulting text requires much editing.
The easiest solution is to find a language in the OCR program that uses these characters - Portugese is good for ô and ê , but it is not easy to find one that also includes ŋ , which is extensively used in some minority languages. A better solution is to train the OCR program to recognise the new characters. The training process takes some time, but improves the results many times over.
Tesseract is one of the more accurate and adaptable OCR systems. It is also open source and free to use. It was originally developed by HP and more recently improved, with significant input by Google. It can be used directly from the "command line", or with user friendly front ends.
To test the training with Tesseract, I created training data using a single scanned page from the book on which I need to run the OCR. As the training data only used this single source, we wouldn't expect perfect results with source documents that use a different font. But it worked quite well for this single case.
OCR recognises the character graphics and gets assistance by a language dictionary, if present. Its graphical recognition is only as good as the range of font types used to train the program. When training Tesseract for new characters, additional training texts, one for each font, can be generated by a word processing program. These are then combined in the training process - see the Tesseract training documentation for details.
Recognition of English and other major language texts is excellent, as good as the original image allows. This is a key point in OCR - first make sure you have good image (tiff, png, jpg, bmp, pdf), with clear text, 10 - 12px, with scan/print resolution of 300dpi.
With a good image, OCR can be near perfect. The following illustration from "gImageReader" (in Ubuntu) shows a 300dpi source image in the centre and the recognised text on the right hand side. This example used the Portugese language (without a dictionary) and the added the special characters used in the Jabêm language, created as a result of this Tesseract training test..
The following table outlines the steps I used to train Tesseract (on Ubuntu) using a single image / font to generate training data. See the Tesseract wiki for the full details on training Tesseract.Action
Linux command (each typed on single line, except for “mv” which are separate)
Result – files outputCreate source image with training data
Scan and process, rename image
jab.gentium.exp0.jpgCreate "box" file (a text image map with recognised characters)
tesseract -l por jab.gentium.exp0.jpg jab.gentium.exp0 batch.nochop makebox
jab.gentium.exp0.boxEdit box file to correct character recognition
Check and correct character recognition – resave box fileCreate base training file from image and box files - If errors show, go back to previous step
tesseract jab.gentium.exp0.jpg jab.gentium.box nobatch box.train
jab.gentium.box.trExtract list of utf characters in the training sample
unicharsetcreate list of fonts and their properties
echo jab.gentium.box 0 0 0 1 0 > font_properties
mftraining -F font_properties -U unicharset -O jab.unicharset jab.gentium.box.tr
Shapetable, inttemp, pffmtable, jab.unicharsetcntraining - output the character normalization sensitivity prototypes file
normprotoRename output files to add language prefix "jab." The dot is needed.
mv inttemp jab.inttemp
mv normproto jab.normproto
mv pffmtable jab.pffmtable
mv shapetable jab.shapetable
jab.shapetablecreate the consolidated training data file
jab.traineddataMove the training data file to the "tessdata" folder under the "tesseract-ocr" installation folder
sudo mv jab.traineddata /usr/share/tesseract-ocr/tessdata/
- Best to keep all files in a single folder
- Image file naming convention: [language name].[font name].exp[number].[file extension]
- Do not mix fonts in the training file - use separate images for different fonts and for italic, bold etc. Best if more than ten of each character are included in the image.
- For this image, I used "tesseract -l por" when creating the initial "box file". This tells the Tesseract command line to use the Portugese language - used in this case as it is better at recognising ê and ô and so requires less processing in the next step.
- ASCII or UTF-8 encoding without BOM with "LF" not "CR/LF" for "End Of Line" for all text files that you are using. Note that Windows uses "CR/LF"as default.
- Font properties
- Moshpytt is a python based "box file editor" - to help correct the character recognition - Download from http://code.google.com/p/moshpytt/ Other graphical "box-file" editors are also available. See the Tesseract wiki for details.
This screenshot shows Moshpytt in action. As you move down the list on the right (contents of the "box" file), the character is highlighted and the Tesseract interpretation of the character shown in red. Wrong characters can be corrected in the table on the right. It will need to be checked and corrected if errors show.
- While I have used Linux here, Tesseract OCR and training can also be done in Windows.