Pseudo-words for spam detection in an unmodified Naive Bayesian Text Classifier

John Graham-Cumming POPFile Project

  download slides (PDF)

The POPFile program has proved highly accurate at spam detection with a low false positive rate, yet uses an unmodified Naïve Bayesian Text Classifier with no ‘magic’ values or tweaks. Initially, POPFile performed poorly against spam, but a library of email parsing code and a set of pseudo-words (non-words fed into the classifier that indicate particular email features – e.g. obfuscation of a spammy word, such as Viagra) have brought POPFile to over 99.8% accuracy. This paper will detail every POPFile pseudo-word, how they are created from spam and ham messages, and give empirical data on their importance when scored against a large corpus of spam and ham messages.


Poll

Are you still running IE 6?
Yes, on my machine at work
Yes, on my home machine
Yes, on both work and home machines
No, I use a newer version of IE
No, I use a different browser

Leave a comment

Jobs Career Sidebar

Jobs

In Virus Bulletin's jobs pages among others:
Virus Bulletin currently has 187,828 registered users.