TY - JOUR
T1 - Finding optimal degenerate patterns in DNA sequences
AU - Shinozaki, Daisuke
AU - Akutsu, Tatsuya
AU - Maruyama, Osamu
N1 - Funding Information:
The authors would like to thank Ayumi Shinohara and Hideo Bannai for fruitful discussions. The authors would also like to thank anonymous referees for valuable comments. This work was supported in part by Research for the Future Program of JSPS and Grant-in-Aid for Scientific Research on Priority Areas (C) of MEXT.
PY - 2003
Y1 - 2003
N2 - Motivation: The problem of finding transcription factor binding sites in the upstream regions of given genes is algorithmically an interesting and challenging problem in computational biology. A degenerate pattern over a finite alphabet ∑ is a sequence of subsets of ∑. A string over IUPAC nucleic acid codes is also a degenerate pattern over ∑ = {A, C, G, T}, and is used as one of the major patterns modeling transcription factor binding sites in the upstream regions of genes. However, it is known that the problem of finding a degenerate pattern consistent with both positive and negative string sets is in general NP-complete. Our aim is to devise a heuristic algorithm to find a degenerate pattern which is optimal for positive and negative string sets w.r.t. a given score function. Results: We have proposed an enumerative algorithm called SUPERPOSITION for finding optimal degenerate patterns with a pruning technique, which works with most all reasonable score functions. The performance score of the algorithm has been compared with those of other popular motif-finding algorithms YMF, MEME and AlignACE on various sets of co-regulated genes of yeast. In the computational experiment, SUPERPOSITION has outperformed the others on several gene sets. Availability: The python script SUPERPOSITION is available at http://www.math.kyushu-u.ac.jp/~om/softwares.html.
AB - Motivation: The problem of finding transcription factor binding sites in the upstream regions of given genes is algorithmically an interesting and challenging problem in computational biology. A degenerate pattern over a finite alphabet ∑ is a sequence of subsets of ∑. A string over IUPAC nucleic acid codes is also a degenerate pattern over ∑ = {A, C, G, T}, and is used as one of the major patterns modeling transcription factor binding sites in the upstream regions of genes. However, it is known that the problem of finding a degenerate pattern consistent with both positive and negative string sets is in general NP-complete. Our aim is to devise a heuristic algorithm to find a degenerate pattern which is optimal for positive and negative string sets w.r.t. a given score function. Results: We have proposed an enumerative algorithm called SUPERPOSITION for finding optimal degenerate patterns with a pruning technique, which works with most all reasonable score functions. The performance score of the algorithm has been compared with those of other popular motif-finding algorithms YMF, MEME and AlignACE on various sets of co-regulated genes of yeast. In the computational experiment, SUPERPOSITION has outperformed the others on several gene sets. Availability: The python script SUPERPOSITION is available at http://www.math.kyushu-u.ac.jp/~om/softwares.html.
UR - http://www.scopus.com/inward/record.url?scp=14744270169&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=14744270169&partnerID=8YFLogxK
U2 - 10.1093/bioinformatics/btg1079
DO - 10.1093/bioinformatics/btg1079
M3 - Article
C2 - 14534191
AN - SCOPUS:14744270169
SN - 1367-4803
VL - 19
SP - ii206-ii214
JO - Bioinformatics
JF - Bioinformatics
IS - SUPPL. 2
ER -