Finding characteristic substrings from compressed texts

Shunsuke Inenaga, Hideo Bannai

Research output: Chapter in Book/Report/Conference proceedingConference contribution

5 Citations (Scopus)

Abstract

Text mining from large scaled data is of great importance in computer sci- ence. In this paper, we consider fundamental problems on text mining from compressed strings, i.e., computing a longest repeating substring, longest non-overlapping repeat- ing substring, most frequent substring, and most frequent non-overlapping substring from a given compressed string. Also, we tackle the following novel problem: given a compressed text and compressed pattern, compute the representative of the equiva- lence class of the pattern w.r.t. the text. We present algorithms that solve the above problems in time polynomial in the size of input compressed strings. The compression scheme we consider is straight line program (SLP) which has exponential compres- sion, and therefore our algorithms are more efficient than any algorithms that work on uncompressed strings.

Original languageEnglish
Title of host publicationProceedings of the Prague Stringology Conference 2009
Pages40-54
Number of pages15
Publication statusPublished - Dec 1 2009
EventPrague Stringology Conference 2009, PSC 2009 - Prague, Czech Republic
Duration: Aug 31 2009Sept 2 2009

Publication series

NameProceedings of the Prague Stringology Conference 2009

Other

OtherPrague Stringology Conference 2009, PSC 2009
Country/TerritoryCzech Republic
CityPrague
Period8/31/099/2/09

All Science Journal Classification (ASJC) codes

  • Mathematics(all)

Fingerprint

Dive into the research topics of 'Finding characteristic substrings from compressed texts'. Together they form a unique fingerprint.

Cite this