Unsupervised spam detection by document complexity estimation

Takashi Uemura, Daisuke Ikeda, Hiroki Arimura

Research output: Chapter in Book/Report/Conference proceedingConference contribution

16 Citations (Scopus)


In this paper, we study a content-based spam detection for a specific type of spams, called blog and bulletin board spams. We develop an efficient unsupervised algorithm DCE that detects spam documents from a mixture of spam and non-spam documents using an entropy-like measure, called the document complexity. Using suffix trees, the algorithm computes the document complexity for all documents in linear time w.r.t. the total length of input documents. Experimental results showed that our algorithm especially works well for detecting word salad spams, which are believed to be difficult to detect automatically.

Original languageEnglish
Title of host publicationDiscovery Science - 11th International Conference, DS 2008, Proceedings
PublisherSpringer Verlag
Number of pages13
ISBN (Print)3540884106, 9783540884101
Publication statusPublished - 2008
Event11th International Conference on Discovery Science, DS 2008 - Budapest, Hungary
Duration: Oct 13 2008Oct 16 2008

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume5255 LNAI
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349


Other11th International Conference on Discovery Science, DS 2008

All Science Journal Classification (ASJC) codes

  • Theoretical Computer Science
  • Computer Science(all)


Dive into the research topics of 'Unsupervised spam detection by document complexity estimation'. Together they form a unique fingerprint.

Cite this