OCR With Special Characters

Training Tesseract to reconise new characters and fonts


Training Tesseract

I was recently faced with the challenge of using Optical Character Recognition (OCR) to convert an old book to an editable text format. This is not difficult with English and other major languages, but the challenge comes when the document uses a minority language with a different alphabet to those recognised by OCR programs. Take the "ŋ" character in the image in this article - with English based OCR, "ŋ" can show as "n" or "g" or "u"... or another, but never as "ŋ" as that character does not exist in the language file. The OCR gets completely lost and the resulting text requires much editing.

 

The easiest solution is to find a language in the OCR program that uses these characters - Portugese is good for ô  and ê , but it is not easy to find one that also includes ŋ , which is extensively used in some minority languages. A better solution is to train the OCR program to recognise the new characters. The training process takes some time, but improves the results many times over.

Tesseract is one of the more accurate and adaptable OCR systems.  It is also open source and free to use.  It was originally developed by HP and more recently improved, with significant input by Google.  It can be used directly from the "command line", or with user friendly front ends.

To test the training with Tesseract, I created training data using a single scanned page from the book on which I need to run the OCR. As the training data only used this single source, we wouldn't expect perfect results with source documents that use a different font. But it worked quite well for this single case.

OCR recognises the character graphics and gets assistance by a language dictionary, if present. Its graphical recognition is only as good as the range of font types used to train the program. When training Tesseract for new characters, additional training texts, one for each font, can be generated by a word processing program. These are then combined in the training process - see the Tesseract training documentation for details.

Recognition of English and other major language texts is excellent, as good as the original image allows. This is a key point in OCR - first make sure you have good image (tiff, png, jpg, bmp, pdf), with clear text, 10 - 12px, with scan/print resolution of 300dpi.

With a good image, OCR can be near perfect. The following illustration from "gImageReader" (in Ubuntu) shows a 300dpi source image in the centre and the recognised text on the right hand side. This example used the Portugese language (without a dictionary) and the added the special characters used in the Jabêm language, created as a result of this Tesseract training test..

glimageReader

The following table outlines the steps I used to train Tesseract (on Ubuntu) using a single image / font to generate training data. See the Tesseract wiki for the full details on training Tesseract.

Action

Linux command (each typed on single line, except for “mv” which are separate)

Result – files output

Create source image with training data

Scan and process, rename image

jab.gentium.exp0.jpg

Create "box" file (a text image map with recognised characters)

tesseract -l por jab.gentium.exp0.jpg jab.gentium.exp0 batch.nochop makebox

jab.gentium.exp0.box

Edit box file to correct character recognition

python moshpytt.py

Check and correct character recognition – resave box file

Create base training file from image and box files - If errors show, go back to previous step

tesseract jab.gentium.exp0.jpg jab.gentium.box nobatch box.train

jab.gentium.box.tr

Extract list of utf characters in the training sample

unicharset_extractor jab.gentium.exp0.box

unicharset

create list of fonts and their properties

echo jab.gentium.box 0 0 0 1 0 > font_properties

font_properties

mftraining

mftraining -F font_properties -U unicharset -O jab.unicharset jab.gentium.box.tr

Shapetable, inttemp, pffmtable, jab.unicharset

cntraining - output the character normalization sensitivity prototypes file

cntraining *.tr

normproto

Rename output files to add language prefix "jab." The dot is needed.

mv inttemp jab.inttemp

mv normproto jab.normproto

mv pffmtable jab.pffmtable

mv shapetable jab.shapetable

jab.inttemp

jab.normproto

jab.pffmtable

jab.shapetable

create the consolidated training data file

combine_tessdata jab.

jab.traineddata

Move the training data file to the "tessdata" folder under the "tesseract-ocr" installation folder

sudo mv jab.traineddata /usr/share/tesseract-ocr/tessdata/

 

Notes

  • Best to keep all files in a single folder
  • Image file naming convention: [language name].[font name].exp[number].[file extension]
  • Do not mix fonts in the training file - use separate images for different fonts and for italic, bold etc. Best if more than ten of each character are included in the image.
  • For this image, I used "tesseract -l por" when creating the initial "box file". This tells the Tesseract command line to use the Portugese language - used in this case as it is better at recognising ê and ô and so requires less processing in the next step.
  • ASCII or UTF-8 encoding without BOM with "LF" not "CR/LF" for "End Of Line" for all text files that you are using. Note that Windows uses "CR/LF"as default.
  • Font properties

fontname

italic

bold

monospace

serif

fraktur

jab.gentium.box

0

0

0

1

0

  • Moshpytt is a python based "box file editor" - to help correct the character recognition - Download from http://code.google.com/p/moshpytt/   Other graphical "box-file" editors are also available. See the Tesseract wiki for details.

    This screenshot shows Moshpytt in action. As you move down the list on the right (contents of the "box" file), the character is highlighted and the Tesseract interpretation of the character shown in red. Wrong characters can be corrected in the table on the right. It will need to be checked and corrected if errors show.

moshPyTT box file editor

  • While I have used Linux here, Tesseract OCR and training can also be done in Windows.  

Tag: tesseract ocr

Comments are closed