Sensitivity of string compressors and repetitiveness measures

Tooru Akagi, Mitsuru Funakoshi, Shunsuke Inenaga

Research output: Contribution to journalArticlepeer-review

5 Citations (Scopus)

Abstract

The sensitivity of a string compression algorithm C asks how much the output size C(T) for an input string T can increase when a single character edit operation is performed on T. This notion enables one to measure the robustness of compression algorithms in terms of errors and/or dynamic changes occurring in the input string. In this paper, we analyze the worst-case multiplicative sensitivity of string compression algorithms, which is defined by maxT∈Σn⁡{C(T)/C(T):ed(T,T)=1}, where ed(T,T) denotes the edit distance between T and T. In particular, for the most common versions of the Lempel-Ziv 77 compressors, we prove that the worst-case multiplicative sensitivity is only a small constant (2 or 3, depending on the version of the Lempel-Ziv 77 and the edit operation type), i.e., the size of the Lempel-Ziv 77 factorizations can be larger by only a small constant factor. We strengthen our upper bound results by presenting matching lower bounds on the worst-case sensitivity for all these major versions of the Lempel-Ziv 77 factorizations. We generalize these results to the smallest bidirectional scheme b. In addition, we show that the sensitivity of a grammar-based compressor called GCIS (Grammar Compression by Induced Sorting) is also a small constant. Further, we extend the notion of the worst-case sensitivity to string repetitiveness measures such as the smallest string attractor size γ and the substring complexity δ, and show that the worst-case sensitivity of δ is also a small constant. These results contrast with the previously known related results such that the size z78 of the Lempel-Ziv 78 factorization can increase by a factor of Ω(n1/4) (shown by Lagarde and Perifel), and the number r of runs in the Burrows-Wheeler transform can increase by a factor of Ω(log⁡n) (shown by Giuliani et al.) when a character is prepended to an input string of length n. By applying our sensitivity bounds of δ or the smallest grammar to known results (cf. Navarro's survey) some non-trivial upper bounds for the sensitivities of important string compressors and repetitiveness measures including γ, r, LZ-End, RePair, LongestMatch, and AVL-grammar, are derived. We also exhibit the worst-case additive sensitivity maxT∈Σn⁡{C(T)−C(T):ed(T,T)=1}, which allows one to observe more details in the changes of the output sizes.

Original languageEnglish
Article number104999
JournalInformation and Computation
Volume291
DOIs
Publication statusPublished - Mar 2023

All Science Journal Classification (ASJC) codes

  • Theoretical Computer Science
  • Information Systems
  • Computer Science Applications
  • Computational Theory and Mathematics

Fingerprint

Dive into the research topics of 'Sensitivity of string compressors and repetitiveness measures'. Together they form a unique fingerprint.

Cite this