An impact of linguistic features on automated classification of OCR texts

Gudila Paul Moshi, Lazaro S.P. Busagala, Wataru Ohyama, Tetsushi Wakabayashi, Fumitaka Kimura

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Citation (Scopus)


Optical Character reader (OCR) systems can be used in digitizing print documents. OCR texts are generated in the process of digitizing print documents. Usually these texts need to be indexed and organized to simplify their access and retrieval. This can be done by the use of automatic classification techniques. However it is currently impossible for OCR technology to recognize all characters with an accuracy of 100%. Furthermore it is not known whether part of speech (POS) analysis contributes to proper OCR texts representation in a discriminative way. Conventionally, the bag-of-words approach is used in OCR text classification. In this paper we experimentally evaluated POS analysis on OCR texts to formulate an informative feature set. Empirical results indicate that the combination of suitably selected POS improved classification performance of OCR texts.

Original languageEnglish
Title of host publicationProceedings of the 9th IAPR International Workshop on Document Analysis Systems, DAS '10
Number of pages6
Publication statusPublished - 2010
Event2010 IAPR Workshop on Document Analysis Systems, DAS 2010 - Boston, MA, United States
Duration: Jun 9 2010Jun 11 2010

Publication series

NameACM International Conference Proceeding Series


Other2010 IAPR Workshop on Document Analysis Systems, DAS 2010
Country/TerritoryUnited States
CityBoston, MA

All Science Journal Classification (ASJC) codes

  • Software
  • Human-Computer Interaction
  • Computer Vision and Pattern Recognition
  • Computer Networks and Communications


Dive into the research topics of 'An impact of linguistic features on automated classification of OCR texts'. Together they form a unique fingerprint.

Cite this