Citation: Ranjith Unnikrishnan, Ray Smith (2009) Combined orientation and script detection using the Tesseract OCR engine. MOCR '09: Proceedings of the International Workshop on Multilingual OCR (RSS)
DOI (original publisher): http://doi.acm.org/10.1145/1577802.1577809
Semantic Scholar (metadata): http://doi.acm.org/10.1145/1577802.1577809
Sci-Hub (fulltext): http://doi.acm.org/10.1145/1577802.1577809
Internet Archive Scholar (search for fulltext): Combined orientation and script detection using the Tesseract OCR engine
Download: http://www.google.de/research/pubs/archive/35506.pdf
Tagged: Computer Science (RSS) OCR (RSS), Script detection (RSS), Page orientation detection (RSS), Tesseract (RSS)

Summary

Uses a shape classifier trained on connected components of the scripts to be detected, run over each of 4 orientations of the input image. The orientation with the highest cumulative confidence score is selected; the script with the highest number of characters for the selected orientation is selected as the dominant script.

Theoretical and Practical Relevance

Script identification removes or reduces the problem of language identification in OCR. An Open Source implementation of this work will be available in version 3.01 of Tesseract.

Combined orientation and script detection using the Tesseract OCR engine

Summary

Theoretical and Practical Relevance

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

New

Discussion

Help

Tools