We analyze the grammar generation algorithm of the RePair compression algorithm and show the relation between a grammar generated by RePair and maximal repeats. We reveal that RePair replaces step by step the most frequent pairs within the corresponding most frequent maximal repeats. Then, we design a novel variant of RePair, called MR-RePair, which substitutes the most frequent maximal repeats at once instead of substituting the most frequent pairs consecutively. We implemented MR-RePair and compared the size of the grammar generated by MR-RePair to that by RePair on several text corpora. Our experiments show that MR-RePair generates more compact grammars than RePair does, especially for highly repetitive texts.
|Title of host publication
|Proceedings - DCC 2019
|Subtitle of host publication
|2019 Data Compression Conference
|Joan Serra-Sagrista, Ali Bilgin, Michael W. Marcellin, James A. Storer
|Institute of Electrical and Electronics Engineers Inc.
|Number of pages
|Published - May 10 2019
|2019 Data Compression Conference, DCC 2019 - Snowbird, United States
Duration: Mar 26 2019 → Mar 29 2019
|Data Compression Conference Proceedings
|2019 Data Compression Conference, DCC 2019
|3/26/19 → 3/29/19
All Science Journal Classification (ASJC) codes
- Computer Networks and Communications