Non-English spam: a case study

Vipul Sharma, Yanyan Yang and Jason Wallace Proofpoint

There had been a significant increase in the volume and sophistication of non-English spam over the last few years. We see spam in multiple languages including Russian, German, Japanese, Chinese, Dutch, Spanish, French, Norwegian, Finnish, Italian, Danish, Swedish, Greek, Thai, etc. and the list is continuously increasing. As the number of Internet users increases across the globe over the years, we not only expect an increase in the volume of non-English spam but also an increase in the number of languages used. An efficient sifting of such spam comes with its own challenges.

In this paper we discuss some of these challenges including language detection, implications of various character sets, a model for a language-independent spam filter, etc. We will discuss some intrinsic differences between the structure and techniques used in English spam and non-English spam. We will also discuss the properties that should be used for efficient spam detection and properties that should not be used. We also reflect some insights on the volume, rate of increase, type of languages and our effectiveness on such spam. We show the difference of our language identification algorithm with other language detection algorithms. We also discuss the benefits of using a hybrid model of sender reputation and text classification in dealing with the spam.

 del.icio.us  digg this! digg this

Quick Links

Poll
The Japanese government is reported to have commissioned a 'defensive virus'. Is 'defensive' malware ever a good idea?
Yes
No
I don't know
Leave a comment
View 11 comments

99 Subscription Promo

Virus Bulletin
In this month's magazine:
  • Living the meme
  • If Svar is the answer...
  • Static analysis of mobile malware
  • And the devil is six: the security consequences of the switch to IPv6
  • Behind enemy lines: reporting from the CCC 28C3 Congress
Virus Bulletin 02 2012
Subscribe now!

Virus Bulletin currently has 224,243 registered users.