Combined orientation and script detection using the Tesseract OCR engine
Citation: Ranjith Unnikrishnan, Ray Smith (2009) Combined orientation and script detection using the Tesseract OCR engine. MOCR '09: Proceedings of the International Workshop on Multilingual OCR (RSS)
DOI (original publisher): http://doi.acm.org/10.1145/1577802.1577809
Semantic Scholar (metadata): http://doi.acm.org/10.1145/1577802.1577809
Sci-Hub (fulltext): http://doi.acm.org/10.1145/1577802.1577809
Internet Archive Scholar (search for fulltext): Combined orientation and script detection using the Tesseract OCR engine
Download: http://www.google.de/research/pubs/archive/35506.pdf
Tagged: Computer Science
(RSS) OCR (RSS), Script detection (RSS), Page orientation detection (RSS), Tesseract (RSS)
Summary
Uses a shape classifier trained on connected components of the scripts to be detected, run over each of 4 orientations of the input image. The orientation with the highest cumulative confidence score is selected; the script with the highest number of characters for the selected orientation is selected as the dominant script.
Theoretical and Practical Relevance
Script identification removes or reduces the problem of language identification in OCR. An Open Source implementation of this work will be available in version 3.01 of Tesseract.