TY - GEN
T1 - An impact of linguistic features on automated classification of OCR texts
AU - Moshi, Gudila Paul
AU - Busagala, Lazaro S.P.
AU - Ohyama, Wataru
AU - Wakabayashi, Tetsushi
AU - Kimura, Fumitaka
N1 - Copyright:
Copyright 2010 Elsevier B.V., All rights reserved.
PY - 2010
Y1 - 2010
N2 - Optical Character reader (OCR) systems can be used in digitizing print documents. OCR texts are generated in the process of digitizing print documents. Usually these texts need to be indexed and organized to simplify their access and retrieval. This can be done by the use of automatic classification techniques. However it is currently impossible for OCR technology to recognize all characters with an accuracy of 100%. Furthermore it is not known whether part of speech (POS) analysis contributes to proper OCR texts representation in a discriminative way. Conventionally, the bag-of-words approach is used in OCR text classification. In this paper we experimentally evaluated POS analysis on OCR texts to formulate an informative feature set. Empirical results indicate that the combination of suitably selected POS improved classification performance of OCR texts.
AB - Optical Character reader (OCR) systems can be used in digitizing print documents. OCR texts are generated in the process of digitizing print documents. Usually these texts need to be indexed and organized to simplify their access and retrieval. This can be done by the use of automatic classification techniques. However it is currently impossible for OCR technology to recognize all characters with an accuracy of 100%. Furthermore it is not known whether part of speech (POS) analysis contributes to proper OCR texts representation in a discriminative way. Conventionally, the bag-of-words approach is used in OCR text classification. In this paper we experimentally evaluated POS analysis on OCR texts to formulate an informative feature set. Empirical results indicate that the combination of suitably selected POS improved classification performance of OCR texts.
UR - http://www.scopus.com/inward/record.url?scp=77954979705&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=77954979705&partnerID=8YFLogxK
U2 - 10.1145/1815330.1815367
DO - 10.1145/1815330.1815367
M3 - Conference contribution
AN - SCOPUS:77954979705
SN - 9781605587738
T3 - ACM International Conference Proceeding Series
SP - 287
EP - 292
BT - Proceedings of the 9th IAPR International Workshop on Document Analysis Systems, DAS '10
T2 - 2010 IAPR Workshop on Document Analysis Systems, DAS 2010
Y2 - 9 June 2010 through 11 June 2010
ER -