TY - GEN
T1 - Unsupervised spam detection by document complexity estimation
AU - Uemura, Takashi
AU - Ikeda, Daisuke
AU - Arimura, Hiroki
PY - 2008
Y1 - 2008
N2 - In this paper, we study a content-based spam detection for a specific type of spams, called blog and bulletin board spams. We develop an efficient unsupervised algorithm DCE that detects spam documents from a mixture of spam and non-spam documents using an entropy-like measure, called the document complexity. Using suffix trees, the algorithm computes the document complexity for all documents in linear time w.r.t. the total length of input documents. Experimental results showed that our algorithm especially works well for detecting word salad spams, which are believed to be difficult to detect automatically.
AB - In this paper, we study a content-based spam detection for a specific type of spams, called blog and bulletin board spams. We develop an efficient unsupervised algorithm DCE that detects spam documents from a mixture of spam and non-spam documents using an entropy-like measure, called the document complexity. Using suffix trees, the algorithm computes the document complexity for all documents in linear time w.r.t. the total length of input documents. Experimental results showed that our algorithm especially works well for detecting word salad spams, which are believed to be difficult to detect automatically.
UR - http://www.scopus.com/inward/record.url?scp=56749179442&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=56749179442&partnerID=8YFLogxK
U2 - 10.1007/978-3-540-88411-8_30
DO - 10.1007/978-3-540-88411-8_30
M3 - Conference contribution
AN - SCOPUS:56749179442
SN - 3540884106
SN - 9783540884101
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 319
EP - 331
BT - Discovery Science - 11th International Conference, DS 2008, Proceedings
PB - Springer Verlag
T2 - 11th International Conference on Discovery Science, DS 2008
Y2 - 13 October 2008 through 16 October 2008
ER -