Combined orientation and script detection using the Tesseract OCR engine

From AcaWiki

Jump to: navigation, search


Citation: Ranjith Unnikrishnan, Ray Smith (2009) Combined orientation and script detection using the Tesseract OCR engine. MOCR '09: Proceedings of the International Workshop on Multilingual OCR (RSS)

doi: http://doi.acm.org/10.1145/1577802.1577809

Download: http://www.google.de/research/pubs/archive/35506.pdf

Tagged: Computer Science (RSS) OCR (RSS), Script detection (RSS), Page orientation detection (RSS), Tesseract (RSS)


Summary:

Uses a shape classifier trained on connected components of the scripts to be detected, run over each of 4 orientations of the input image. The orientation with the highest cumulative confidence score is selected; the script with the highest number of characters for the selected orientation is selected as the dominant script.

Theoretical and practical relevance:

Script identification removes or reduces the problem of language identification in OCR. An Open Source implementation of this work will be available in version 3.01 of Tesseract.



Personal tools
Namespaces
Variants
Actions
Navigation
New
Tools
Discussion
Help
Toolbox