Pseudo-words for spam detection in an unmodified Naive Bayesian Text Classifier

John Graham-Cumming POPFile Project

  download slides (PDF)

The POPFile program has proved highly accurate at spam detection with a low false positive rate, yet uses an unmodified Naïve Bayesian Text Classifier with no ‘magic’ values or tweaks. Initially, POPFile performed poorly against spam, but a library of email parsing code and a set of pseudo-words (non-words fed into the classifier that indicate particular email features – e.g. obfuscation of a spammy word, such as Viagra) have brought POPFile to over 99.8% accuracy. This paper will detail every POPFile pseudo-word, how they are created from spam and ham messages, and give empirical data on their importance when scored against a large corpus of spam and ham messages.


Poll

Do you use the same password(s) across multiple websites?
I use the same password for all sites
I have a number of passwords but use the same for some sites
I use a different password for each site
I don't sign up to any sites that require a password

Leave a comment
View 4 comments

Jobs Recruit Sidebar

Virus Bulletin

In this month's magazine:
  • Social networking meets social engineering
  • Flying solo
  • Geneva convention
  • 7th German Anti Spam Summit 2009
  • Anti-phishing landing page: turning a 404 into a teachable moment
  • An update on spamming botnets: are we losing the war?
  • Windows Server 2008 Standard Edition SP2 x86
Virus Bulletin 10 2009
Subscribe now!
Virus Bulletin currently has 190,584 registered users.