Combined orientation and script detection using the Tesseract OCR engine
From AcaWiki
Citation: Ranjith Unnikrishnan, Ray Smith (2009) Combined orientation and script detection using the Tesseract OCR engine. MOCR '09: Proceedings of the International Workshop on Multilingual OCR (RSS)
doi: http://doi.acm.org/10.1145/1577802.1577809
Download: http://www.google.de/research/pubs/archive/35506.pdf
Tagged: Computer Science (RSS) OCR (RSS), Script detection (RSS), Page orientation detection (RSS), Tesseract (RSS)
Summary:
Uses a shape classifier trained on connected components of the scripts to be detected, run over each of 4 orientations of the input image. The orientation with the highest cumulative confidence score is selected; the script with the highest number of characters for the selected orientation is selected as the dominant script.
Theoretical and practical relevance:
Script identification removes or reduces the problem of language identification in OCR. An Open Source implementation of this work will be available in version 3.01 of Tesseract.