Pseudo-words for spam detection in an unmodified Naive Bayesian Text Classifier

John Graham-Cumming POPFile Project

The POPFile program has proved highly accurate at spam detection with a low false positive rate, yet uses an unmodified Naïve Bayesian Text Classifier with no ‘magic’ values or tweaks. Initially, POPFile performed poorly against spam, but a library of email parsing code and a set of pseudo-words (non-words fed into the classifier that indicate particular email features – e.g. obfuscation of a spammy word, such as Viagra) have brought POPFile to over 99.8% accuracy. This paper will detail every POPFile pseudo-word, how they are created from spam and ham messages, and give empirical data on their importance when scored against a large corpus of spam and ham messages.

 del.icio.us  digg this! digg this

Quick Links

Poll
The Japanese government is reported to have commissioned a 'defensive virus'. Is 'defensive' malware ever a good idea?
Yes
No
I don't know
Leave a comment
View 11 comments

99 Subscription Promo

VB100 certification
VB100 This month's VB100 test saw some major changes and a radical overhaul of the VB100 test methodology - for the first time allowing products to use their 'cloud' look-up systems. John Hawes has all the details.
See full results.

Virus Bulletin currently has 224,245 registered users.