TY - GEN
T1 - Finding characteristic substrings from compressed texts
AU - Inenaga, Shunsuke
AU - Bannai, Hideo
PY - 2009/12/1
Y1 - 2009/12/1
N2 - Text mining from large scaled data is of great importance in computer sci- ence. In this paper, we consider fundamental problems on text mining from compressed strings, i.e., computing a longest repeating substring, longest non-overlapping repeat- ing substring, most frequent substring, and most frequent non-overlapping substring from a given compressed string. Also, we tackle the following novel problem: given a compressed text and compressed pattern, compute the representative of the equiva- lence class of the pattern w.r.t. the text. We present algorithms that solve the above problems in time polynomial in the size of input compressed strings. The compression scheme we consider is straight line program (SLP) which has exponential compres- sion, and therefore our algorithms are more efficient than any algorithms that work on uncompressed strings.
AB - Text mining from large scaled data is of great importance in computer sci- ence. In this paper, we consider fundamental problems on text mining from compressed strings, i.e., computing a longest repeating substring, longest non-overlapping repeat- ing substring, most frequent substring, and most frequent non-overlapping substring from a given compressed string. Also, we tackle the following novel problem: given a compressed text and compressed pattern, compute the representative of the equiva- lence class of the pattern w.r.t. the text. We present algorithms that solve the above problems in time polynomial in the size of input compressed strings. The compression scheme we consider is straight line program (SLP) which has exponential compres- sion, and therefore our algorithms are more efficient than any algorithms that work on uncompressed strings.
UR - http://www.scopus.com/inward/record.url?scp=80054005774&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=80054005774&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:80054005774
SN - 9788001044032
T3 - Proceedings of the Prague Stringology Conference 2009
SP - 40
EP - 54
BT - Proceedings of the Prague Stringology Conference 2009
T2 - Prague Stringology Conference 2009, PSC 2009
Y2 - 31 August 2009 through 2 September 2009
ER -