The impact of OCR accuracy and feature transformation on automatic text classification

Mayo Murata, Lazaro S.P. Busagala, Wataru Ohyama, Tetsushi Wakabayashi, Fumitaka Kimura

研究成果: 書籍/レポート タイプへの寄稿会議への寄与

8 被引用数 (Scopus)

抄録

Digitization process of various printed documents involves generating texts by an OCR system for different applications including full-text retrieval and document organizations. However, OCR-generated texts have errors as per present OCR technology. Moreover, previous studies have revealed that as OCR accuracy decreases the classification performance also decreases. The reason for this is the use of absolute word frequency as feature vector. Representing OCR texts using absolute word frequency has limitations such as dependency on text length and word recognition rate consequently lower classification performance due to higher within-class variances. We describe feature transformation techniques which do not have such limitations and present improved experimental results from all used classifiers.

本文言語英語
ホスト出版物のタイトルDocument Analysis Systems VII - 7th International Workshop, DAS 2006, Proceedings
ページ506-517
ページ数12
DOI
出版ステータス出版済み - 2006
外部発表はい
イベント7th International Workshop on Document Analysis Systems, DAS 2006 - Nelson, ニュージ―ランド
継続期間: 2月 13 20062月 15 2006

出版物シリーズ

名前Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
3872 LNCS
ISSN(印刷版)0302-9743
ISSN(電子版)1611-3349

その他

その他7th International Workshop on Document Analysis Systems, DAS 2006
国/地域ニュージ―ランド
CityNelson
Period2/13/062/15/06

!!!All Science Journal Classification (ASJC) codes

  • 理論的コンピュータサイエンス
  • コンピュータサイエンス一般

フィンガープリント

「The impact of OCR accuracy and feature transformation on automatic text classification」の研究トピックを掘り下げます。これらがまとまってユニークなフィンガープリントを構成します。

引用スタイル