Continual feature selection: a cost effective method for enhancing the capabilities of enterprise spam solutions

Vipul Sharma, John Gardiner Myers, Steve Lewis Proofpoint

The effectiveness of content-based spam filters is directly related to the quality of the features used in the filter's classification model. Features are the specific attributes examined by the spam filter. Highly effective filters may employ an extremely large number of such features (on the order of hundreds of thousands), which can consume a significant amount of both storage space and classification time. In the ongoing battle between spammers and spam filter developers, new techniques and technologies are continually being introduced by both sides. This means that the number and importance of the features needed to classify spam accurately is subject to continual change. A given feature might be very important at one point in time, but become irrelevant after a few months as spam campaigns and their associated techniques change. Discarding on a regular basis features that have become ineffective ('bad features') will benefit the spam filter with reduced classification time (reduced model training time and email delivery time), reduced storage requirements, increased spam detection accuracy and a reduced risk of over-fitting of the model.

In this paper we discuss and benchmark several statistical methods for feature selection in spam filtering. We also discuss the properties of good and bad features in spam filtering. We report a significant improvement in both the filter's performance and effectiveness.


Poll

Do you use the same password(s) across multiple websites?
I use the same password for all sites
I have a number of passwords but use the same for some sites
I use a different password for each site
I don't sign up to any sites that require a password

Leave a comment
View 4 comments

Jobs Career Sidebar

Virus Bulletin

In this month's magazine:
  • Social networking meets social engineering
  • Flying solo
  • Geneva convention
  • 7th German Anti Spam Summit 2009
  • Anti-phishing landing page: turning a 404 into a teachable moment
  • An update on spamming botnets: are we losing the war?
  • Windows Server 2008 Standard Edition SP2 x86
Virus Bulletin 10 2009
Subscribe now!
Virus Bulletin currently has 190,375 registered users.