Name | Modified | Size | Downloads / Week |
---|---|---|---|
Documentation and papers | 2013-10-02 | ||
Data For Training Testing | 2013-10-02 | ||
README.txt | 2013-10-02 | 1.4 kB | |
tool_document_samples.zip | 2013-09-09 | 50.3 MB | |
Totals: 4 Items | 50.3 MB | 0 |
The documentation has a few related papers and instructions on using the tool. Please use the following articles to cite this work: "Devanagari OCR using a recognition driven segmentation framework and stochastic language models", Suryaprakash Kompalli, Srirangaraj Setlur, Venu Govindaraju, IJDAR, 2009, Volume: 12, Pg.: 123–138 “Design and Comparison of Segmentation Driven and Recognition Driven Devanagari OCR”, Suryaprakash Kompalli, Srirangaraj Setlur, and Venu Govindaraju.International Workshop on Document Image Analysis and Libraries, 2006, Pg.: 96-102. "A Framework for Creation of Multi-Lingual OCR Datasets.", Suryaprakash Kompalli, Srirangaraj Setlur, Venu Govindaraju, Ramanaprasad Vemulapati. Symposium on Document Image Understanding Technology, 2003, Pg.: 189-196. The folder "Data For Training Testing" contains character images. These are annotated using the Unicode code converted to Decimal. For instance, images of the vowel "a", represented by the Unicode 0905 is annotated as 2309 in the files/folders. Similarly, the consonant "ka" has unicode 0915, and is annotated as 2325. The relevant Unicode chart is located here: www.unicode.org/charts/PDF/U0900.pdf tool_document_samples.zip: This contains a few grayscale images scanned at 300 dpi. Each tiff image has a coressponding xml groundtruth file. The file contains bounding box of each word, ITRANS transliteration and Unicode representation of the word.