Combined orientation and script detection using the Tesseract OCR engine

From AcaWiki
Jump to: navigation, search

Citation: Ranjith Unnikrishnan, Ray Smith (2009) Combined orientation and script detection using the Tesseract OCR engine. MOCR '09: Proceedings of the International Workshop on Multilingual OCR (RSS)
DOI (original publisher): http://doi.acm.org/10.1145/1577802.1577809
Semantic Scholar (metadata): http://doi.acm.org/10.1145/1577802.1577809
Sci-Hub (fulltext): http://doi.acm.org/10.1145/1577802.1577809
Internet Archive Scholar (search for fulltext): Combined orientation and script detection using the Tesseract OCR engine
Download: http://www.google.de/research/pubs/archive/35506.pdf
Tagged: Computer Science (RSS) OCR (RSS), Script detection (RSS), Page orientation detection (RSS), Tesseract (RSS)

Summary

Uses a shape classifier trained on connected components of the scripts to be detected, run over each of 4 orientations of the input image. The orientation with the highest cumulative confidence score is selected; the script with the highest number of characters for the selected orientation is selected as the dominant script.

Theoretical and Practical Relevance

Script identification removes or reduces the problem of language identification in OCR. An Open Source implementation of this work will be available in version 3.01 of Tesseract.