Pseudo-words for spam detection in an unmodified Naive Bayesian Text Classifier

John Graham-Cumming POPFile Project

  Technical stream: Friday 7 October 2005, 11:20 - 12:00.

  download slides (PDF)

The POPFile program has proved highly accurate at spam detection with a low false positive rate, yet uses an unmodified Naïve Bayesian Text Classifier with no ‘magic’ values or tweaks. Initially, POPFile performed poorly against spam, but a library of email parsing code and a set of pseudo-words (non-words fed into the classifier that indicate particular email features – e.g. obfuscation of a spammy word, such as Viagra) have brought POPFile to over 99.8% accuracy. This paper will detail every POPFile pseudo-word, how they are created from spam and ham messages, and give empirical data on their importance when scored against a large corpus of spam and ham messages.


Poll

Will taking client-side security 'into the cloud' provide better security for the end user?
Yes
No
I don't know

Leave a comment
View 1 comment

Jobs Recruit Sidebar

VB100 certification

VB100 VB's testing team put 24 anti-malware products to the test on the server version of Microsoft's latest iteration of the Windows platform: Windows Server 2008. John Hawes has all the details on which products managed to secure a VB100 award and which need have a little more work to do.
See full results.

Virus Bulletin currently has 142,681 registered users.