TY - JOUR

T1 - Finding optimal degenerate patterns in DNA sequences

AU - Shinozaki, Daisuke

AU - Akutsu, Tatsuya

AU - Maruyama, Osamu

N1 - Funding Information:
The authors would like to thank Ayumi Shinohara and Hideo Bannai for fruitful discussions. The authors would also like to thank anonymous referees for valuable comments. This work was supported in part by Research for the Future Program of JSPS and Grant-in-Aid for Scientific Research on Priority Areas (C) of MEXT.

PY - 2003

Y1 - 2003

N2 - Motivation: The problem of finding transcription factor binding sites in the upstream regions of given genes is algorithmically an interesting and challenging problem in computational biology. A degenerate pattern over a finite alphabet ∑ is a sequence of subsets of ∑. A string over IUPAC nucleic acid codes is also a degenerate pattern over ∑ = {A, C, G, T}, and is used as one of the major patterns modeling transcription factor binding sites in the upstream regions of genes. However, it is known that the problem of finding a degenerate pattern consistent with both positive and negative string sets is in general NP-complete. Our aim is to devise a heuristic algorithm to find a degenerate pattern which is optimal for positive and negative string sets w.r.t. a given score function. Results: We have proposed an enumerative algorithm called SUPERPOSITION for finding optimal degenerate patterns with a pruning technique, which works with most all reasonable score functions. The performance score of the algorithm has been compared with those of other popular motif-finding algorithms YMF, MEME and AlignACE on various sets of co-regulated genes of yeast. In the computational experiment, SUPERPOSITION has outperformed the others on several gene sets. Availability: The python script SUPERPOSITION is available at http://www.math.kyushu-u.ac.jp/~om/softwares.html.

AB - Motivation: The problem of finding transcription factor binding sites in the upstream regions of given genes is algorithmically an interesting and challenging problem in computational biology. A degenerate pattern over a finite alphabet ∑ is a sequence of subsets of ∑. A string over IUPAC nucleic acid codes is also a degenerate pattern over ∑ = {A, C, G, T}, and is used as one of the major patterns modeling transcription factor binding sites in the upstream regions of genes. However, it is known that the problem of finding a degenerate pattern consistent with both positive and negative string sets is in general NP-complete. Our aim is to devise a heuristic algorithm to find a degenerate pattern which is optimal for positive and negative string sets w.r.t. a given score function. Results: We have proposed an enumerative algorithm called SUPERPOSITION for finding optimal degenerate patterns with a pruning technique, which works with most all reasonable score functions. The performance score of the algorithm has been compared with those of other popular motif-finding algorithms YMF, MEME and AlignACE on various sets of co-regulated genes of yeast. In the computational experiment, SUPERPOSITION has outperformed the others on several gene sets. Availability: The python script SUPERPOSITION is available at http://www.math.kyushu-u.ac.jp/~om/softwares.html.

UR - http://www.scopus.com/inward/record.url?scp=14744270169&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=14744270169&partnerID=8YFLogxK

U2 - 10.1093/bioinformatics/btg1079

DO - 10.1093/bioinformatics/btg1079

M3 - Article

C2 - 14534191

AN - SCOPUS:14744270169

SN - 1367-4803

VL - 19

SP - ii206-ii214

JO - Bioinformatics

JF - Bioinformatics

IS - SUPPL. 2

ER -