How wise are crowds when assessing phishing websites?

2008-04-01

Tyler Moore

Universtiy of Cambridge, UK

Editor: Helen Martin

Abstract

Tyler Moore considers the effectiveness of web-based participation in the implementation of anti-phishing mechanisms.

Table of contents

Data collection and analysis

Testing the accuracy of PhishTank’s decisions

Miscategorization in PhishTank
Experience influences user accuracy
Users with bad voting records vote together

Disrupting PhishTank’s verification system

Attacks and countermeasures
Lessons for secure crowd-sourcing

Comparing open and closed phishing feeds

Conclusion

Phishing is the process of enticing people to visit fraudulent websites and persuading them to enter personal information such as usernames and passwords. The information is harvested and used to impersonate the victims in order to empty their bank accounts, run fraudulent auctions, launder money, and so on. New fraudulent websites are set up as quickly as the existing ones are removed.

Maintaining an updated feed of new phishing websites requires constant vigilance and demands significant resources. Most banks and specialist take-down companies maintain their own feeds. One group, called PhishTank [1], has created an open-source list of phishing URLs powered by end-user participation. Users can contribute in two ways. First, they submit reports of suspected phishing websites. Second, they examine suspected websites and vote on whether or not they believe them to be phishing sites. Figure 1 shows a screenshot of PhishTank’s online voting interface. PhishTank relies on the so-called ‘wisdom of crowds’ [2] to pick out incorrect reports (perhaps pointing to a legitimate bank) and confirm correct reports of malicious websites. Each report is only confirmed (and subsequently disseminated to anti-phishing mechanisms) following the vote of a number of registered users.

Figure 1. PhishTank user interface.

PhishTank is part of a growing trend in which web-based participation plays a part in the implementation of security mechanisms, from aggregating spam to tracking malware. Together with my colleague Richard Clayton, I have studied participation in PhishTank in order to gain a better understanding of the effectiveness of crowd-based security (a complete technical paper is available [3]).

We have identified several problems with PhishTank which leave the system vulnerable to manipulation. Unfortunately, these weaknesses are not limited to PhishTank, but reflect fundamental difficulties that can arise whenever security decisions are taken as a result of mass participation.

Data collection and analysis

We examined completed reports from 176,366 phishing URLs submitted to PhishTank between February and September 2007. A total of 3,798 users participated by submitting reports and/or voting. In all, 881,511 votes were cast, suggesting an average of 53 submissions and 232 votes per user. In reality, however, a small number of users are responsible for the majority of submissions and votes. The top two submitters, adding 93,588 and 31,910 phishing records respectively, are actually two anti-phishing organizations that have contributed their own, unverified, feeds of suspect websites. The top verifiers have voted over 100,000 times, while most users vote only a few times.

Many of the leading verifiers have been invited to serve on PhishTank’s panel of 25 moderators. Moderators are assigned additional responsibilities such as cleaning up malformed URLs from submissions. Collectively, moderators cast 652,625 votes, or 74% of the total. So while the moderators are doing the majority of the work, a significant contribution is made by the large number of ‘regular’ users.

In fact, the distributions of user submissions and votes in PhishTank are each characterized by a power law. Power-law distributions appear in many real-world contexts, from the distribution of city populations to the number of academic citations to BGP routing topologies. Power-law distributions have highly skewed populations with ‘long tails’ – that is, a limited number of large values appear several orders of magnitude beyond the much smaller median value. In the case of PhishTank, while most users submit and vote only a handful of times, a few users participate many thousands of times.

The intuitive argument put forth in favour of the robustness of ‘crowd-sourced’ applications like PhishTank is that the opinions of many users can outweigh the occasional mistake, or even the views of a malicious user. However, when the rate of participation follows a power-law distribution, a single, highly active user’s actions can impact greatly a system’s overall accuracy – one subversive participant might undermine the system. This brittleness can lead to big problems if phishers decide to manipulate PhishTank.

PhishTank asks its users to vote on every unique URL submitted, which imposes a very large and unnecessary burden on its volunteers. The ‘rock-phish’ gang is a group of criminals who perpetrate phishing attacks on a massive scale [4]. Instead of compromising machines for hosting fake HTML in an ad-hoc manner, the gang first purchases a number of domains with meaningless names like ‘lof80.info’. They then send email spam containing a long URL of the form ‘http://www.bank.com.id123.lof80.info/vr’. This URL includes a unique identifier; all variants are resolved to a particular IP address using wild-card DNS. Up to 25 banks are impersonated within each domain. For a more complete description of rock-phish attacks see [5].

Transmitting unique URLs trips up spam filters looking for repeated links, and also fools collators like PhishTank into recording duplicate entries. Consequently, voting on rock-phish attacks becomes very repetitive. We observed 3,260 unique rock-phish domains in PhishTank. These domains appeared in 120,662 submissions, 60% of the overall total. Furthermore, 893 users voted a total of 550,851 times on these domains! This is a dreadfully inefficient allocation of user resources, which could instead be directed to speeding up verification times, for example.

Testing the accuracy of PhishTank’s decisions

We now examine the correctness of PhishTank users’ contributions. We first describe common causes of inaccuracy and discuss their prevalence. We then demonstrate that inexperienced users are far more likely to make mistakes than experienced ones. Finally, we show that users with bad voting records ‘cluster’ by often voting together.

Miscategorization in PhishTank

The vast majority of submissions to PhishTank are indeed phishing URLs. Of 176,654 verified submissions, just 5,295, or 3%, are voted down as invalid. Most appear to be honest mistakes. Some users submit all URLs from their spam, while others add URLs for other types of malicious websites, such as those involved in advanced fee fraud (419 scams). Sometimes, though, carefully crafted phishing websites and legitimate non-English websites are miscategorized. Most commonly, an obscure credit union or bank that uses a different domain name for its online banking may be marked as a phish. Even moderators make mistakes: 1.2% of their submissions are deemed invalid.

In addition to invalid submissions that are correctly voted down, submissions that are incorrectly classified present a significant worry. Identifying false positives and negatives is hard because PhishTank rewrites history without keeping any public record of changes. By periodically re-checking all PhishTank records for reversals, we identified 39 false positives – legitimate websites incorrectly classified as phishing sites – and three false negatives – phishing websites incorrectly classified as legitimate. Twelve of these classifications were initially agreed upon unanimously.

Of the false positives, 30 were legitimate banks, and the remaining nine were other scams miscategorized as phishing. Several popular websites’ primary domains were voted as phish, including eBay (ebay.com, ebay.de), Fifth Third Bank (53.com) and National City (nationalcity.com). Minimizing these types of false positive is essential for PhishTank because even a small number of false categorizations could undermine its credibility.

Unsurprisingly, there are many more false positives than false negatives since the vast majority of submitted phishes are valid. Most noteworthy was the fact that a URL for the rock-phish domain eportid.ph was incorrectly classified as innocuous. Five other URLs for the same domain were submitted to PhishTank prior to the false negative, with each correctly identified as a phish. Thus, requiring users to vote for the same rock-phish domain many times is not only inefficient, it is unsafe.

Experience influences user accuracy

Where do these mistakes come from? It is reasonable to expect occasional users to commit more errors than those who contribute often. Indeed, we find strong evidence for this in the data. Figure 2 plots the rates of inaccuracy for submissions and votes grouped by user participation rates. For instance, 44% of URLs from users who submit just once are voted down as invalid. Accuracy rate improves with frequency of submissions (30% of submissions from users who submit between two and 10 URLs are invalid; only 17% are invalid for users submitting between 11 and 100 times), with the top submitters incorrect just 1.2% of the time.

Figure 2. Inaccuracy of user submissions and votes according to the total number of submissions and votes per user, respectively.

A similar, albeit less drastic, difference can be observed in voting accuracy. Users voting fewer than 100 times are likely to disagree with their peers 14% of the time. This improves steadily for more active users, with the most active voters in conflict just 3.7% of the time, which is in line with the overall average. These results suggest that the views of inexperienced users should perhaps be assigned less weight when compared to highly experienced users.

Users with bad voting records vote together

We also found evidence that bad decisions reinforce themselves. Users with bad voting records are more likely to vote on the same phishing reports than would be expected if their votes were independent. For 186 of the 1,791 users who have voted, over half of their votes were disputed. These high-conflict voters voted on the same phishing URLs approximately one thousand times more frequently than would be the case if there were no connection between how they selected their votes.

What are the implications? While it is possible that these high-conflict users are deliberately voting incorrectly together (or are the same person), the more likely explanation is that incorrect decisions reinforce each other. When well-intentioned users vote incorrectly, they have apparently made the same mistakes.

Disrupting PhishTank’s verification system

Recently, a number of anti-phishing websites were targeted by a denial-of-service attack, severely hindering their work in removing malicious sites [6]. Hence, there is already evidence that phishers are motivated to disrupt the operations of groups like PhishTank. But even if enough bandwidth is provisioned to counter these attacks, PhishTank remains susceptible to vote rigging, which could undermine its credibility. Any crowd-based decision mechanism is susceptible to manipulation. However, as we will see, certain characteristics of user participation make PhishTank particularly vulnerable.

Attacks and countermeasures

We anticipate three types of attack on PhishTank: (1) the submitting of invalid reports accusing legitimate websites, (2) the voting of legitimate websites as phish, and (3) the voting of malicious websites as legitimate. A selfish attacker seeks to protect their own phishing websites by voting down any accusatory report as invalid (attack type 3). A selfish attacker may be prepared to implicate the websites of other phishers in order to protect their own sites. An undermining attacker takes a wider view by going after the credibility of PhishTank, which is best achieved by combining attacks 1 and 2: submitting URLs for legitimate websites and promptly voting them to be phish. This attacker may also increase confusion by attempting to create false negatives, voting phishing websites as legitimate.

Detecting and defending against these attacks while maintaining an open submission and verification policy is hard. Many of the straightforward countermeasures can be sidestepped by a smart attacker. We consider a number of countermeasures in turn, demonstrating their inadequacy.

One simple countermeasure is to place an upper limit on the number of actions any user can take. This is unworkable for PhishTank due to its power-law distribution: some legitimate users participate many thousands of times. In any case, an enforced even distribution is easily defeated by a Sybil attack [7], where users register many identities. Given that many phishing attackers use botnets, even strict enforcement of ‘one person, one vote’ can probably be overcome.

The next obvious countermeasure is to impose voting requirements. For example, a user must have participated ‘correctly’ n times before their opinion is weighed. This is ineffective for PhishTank, though the developers tell us that they do implement this countermeasure. Since 97% of all submissions are valid, an attacker can quickly boost their reputation by voting for a phish slightly more than n times. A savvy attacker can even minimize their implication of real phishing websites by voting only for rock-phish domains or duplicate URLs. Indeed, the highly stylized format for rock-phish URLs makes it easy to automate correct voting at almost any desired scale.

What about ignoring any user with more than n invalid submissions or incorrect votes? After all, a malicious user is unlikely to force through all of his bad submissions and votes. Unfortunately, the power-law distribution of user participation causes another problem. Many active users who do a lot of good also make a lot of mistakes. For instance, the top submitter, antiphishing, is also the user with the highest number of invalid submissions (578). An improvement would be to ban users who are wrong more than x% of the time. Nevertheless, attackers can simply pad their statistics by voting randomly, or by voting for duplicates and rock-phish URLs.

Moderators already participate in nearly every vote, so it would not be unreasonable to insist that they were the submitter or voted with the majority. However, we know that even moderators make mistakes – over 1% of moderators’ submissions were voted down as invalid. Nonetheless, perhaps the best strategy for PhishTank is to use trusted moderators exclusively if there is any suspicion that the organization is under attack. Given that the 25 moderators already cast 74% of PhishTank’s votes, silencing the whole crowd to root out the attackers may sometimes be wise, even if it contradicts the principles of open participation.

Lessons for secure crowd-sourcing

After examining the PhishTank data we can draw several general lessons about applying the open-participation model to security tools.

Lesson 1: The distribution of user participation matters. There is a natural tendency for highly skewed distributions, even power laws, in user-participation rates. While there may certainly be cases that are not as skewed as PhishTank, security engineers should check the distribution for wide variance when assessing the risk of leveraging user participation.

Skewed distributions can create security problems. First, corruption (or simply the absence) of a few high-value participants can completely undermine the system. Second, because good users can participate extensively, bad users can too. This can frustrate simple rate-limiting countermeasures.

Lesson 2: Crowd-sourced decisions should be difficult to guess. Any decision that can reliably be guessed can be automated and exploited by an attacker. The underlying accuracy of PhishTank’s raw data (97% phish) makes it easy for an attacker to improve their reputation by voting all submissions blindly as phish.

Lesson 3: Do not make users work harder than necessary. Requiring users to vote multiple times for duplicate URLs and rock-phish domains is not only an efficiency issue. It becomes a security liability since it allows an attacker to build up reputation without making a positive contribution.

Comparing open and closed phishing feeds

PhishTank is not the only organization tracking and classifying phishing websites. Other organizations do not follow PhishTank’s open submission and verification policy; instead, they gather their own proprietary lists of suspicious websites and employees determine whether they are phishing. We have obtained a feed from one such company. This has enabled us to compare the feeds for completeness and speed of verification.

We compared the feeds during a four-week period in July and August 2007. We first examined ordinary phishing websites, excluding rock-phish URLs. PhishTank reported 8,296 unique phishing URLs, while the other company identified 8,730. The two feeds shared 5,711 reports in common. For rock-phish URLs the difference is more stark. PhishTank identified 586 rock-phish domains during the sample, while the other company detected 1,003 – nearly twice as many. Furthermore, the other company identified 78% of the rock-phish domains found in PhishTank, along with an additional 544 missed by PhishTank. Venn diagrams for the feeds are presented in Figure 3.

Figure 3. Venn diagram comparing coverage of phishing websites identified by PhishTank and a take-down company.

It is noteworthy that both feeds include many phishing websites which do not appear on the other. This observation supports the case for a universal feed shared between the banks and the various anti-phishing organizations.

Prompt identification and removal of phishing websites is critical, so a feed’s relevance depends upon how quickly it is updated. Requiring several users to vote introduces significant delays. On average, PhishTank submissions take approximately 46 hours to be verified. A few instances take a very long time to be verified, which skews the average. The median, by contrast, is around 15 hours.

We compared the submission and verification times for URLs appearing in both feeds. On average, PhishTank saw the submissions first, by around 11 minutes, but after an average delay of just eight seconds the other company had verified them. PhishTank’s voting-based verification meant that it did not verify the URLs (and therefore did not disseminate them) until 16 hours later. For the rock-phish URLs, we compared the earliest instance of each domain, finding that overlapping domains appeared in PhishTank’s feed 12 hours after they appeared in the other company’s feed, and were not verified for another 12 hours.

Conclusion

End-user participation is an increasingly popular resource for carrying out information security tasks. Having examined one such effort to gather and disseminate phishing information, we conclude that while such open approaches are promising, they are currently less effective overall than the more traditional closed methods. Compared to a data feed collected in a conventional manner, PhishTank is less complete and less timely. On the positive side, PhishTank’s decisions appear mostly accurate: we identified only a few incorrect decisions, all of which were later reversed. However, we found that inexperienced users make many mistakes and that users with bad voting records tend to commit the same errors. So the ‘wisdom’ of crowds sometimes descends into folly.

We also found that user participation varies greatly, raising concerns about the ongoing reliability of PhishTank’s decisions due to the risk of manipulation by small numbers of people. We have described how PhishTank can be undermined by a phishing attacker bent on corrupting its classifications, and furthermore how the power-law distribution of user participation makes attacks simultaneously easier to carry out and harder to defend against.

Despite these problems, we do not advocate against leveraging user participation in the design of all security mechanisms. Rather, we believe that the circumstances must be examined more carefully for each application, and furthermore that threat models must address the potential for manipulation.

Bibliography

[1] PhishTank. http://www.phishtank.com/.

[2] Surowiecki, J. The wisdom of crowds: why the many are smarter than the few. Doubleday, New York (2004).

[3] Moore, T.; Clayton, R. Evaluating the wisdom of crowds in assessing phishing websites. 12th International Financial Cryptography and Data Security Conference (FC). LNCS, to appear. Springer (2008).

[4] McMillan, R. ‘Rock Phish’ blamed for surge in phishing. InfoWorld, 12 Dec 2006. http://www.infoworld.com/article/06/12/12/HNrockphish_1.html.

[5] Moore, T.; Clayton, R. Examining the impact of website take-down on phishing. Anti-Phishing Working Group eCrime Researcher’s Summit, pp.1–13. ACM Press, New York (2007).

[6] Larkin, E. Online thugs assault sites that specialize in security help. PC World, 11 Sep 2007. http://www.pcworld.com/businesscenter/article/137084/ online thugs assault sites that specialize in security help .html.

[7] Douceur, J.R. The Sybil attack. 1st International Workshop on Peer-to-Peer Systems. Lecture Notes in Computer Science (LNCS), vol. 2429. Springer (2002) 251–260.

Latest articles:

Nexus Android banking botnet – compromising C&C panels and dissecting mobile AppInjects

Aditya Sood & Rohit Bansal provide details of a security vulnerability in the Nexus Android botnet C&C panel that was exploited to compromise the C&C panel in order to gather threat intelligence, and present a model of mobile AppInjects.

Cryptojacking on the fly: TeamTNT using NVIDIA drivers to mine cryptocurrency

TeamTNT is known for attacking insecure and vulnerable Kubernetes deployments in order to infiltrate organizations’ dedicated environments and transform them into attack launchpads. In this article Aditya Sood presents a new module introduced by…

Collector-stealer: a Russian origin credential and information extractor

Collector-stealer, a piece of malware of Russian origin, is heavily used on the Internet to exfiltrate sensitive data from end-user systems and store it in its C&C panels. In this article, researchers Aditya K Sood and Rohit Chaturvedi present a 360…

Fighting Fire with Fire

In 1989, Joe Wells encountered his first virus: Jerusalem. He disassembled the virus, and from that moment onward, was intrigued by the properties of these small pieces of self-replicating code. Joe Wells was an expert on computer viruses, was partly…

Run your malicious VBA macros anywhere!

Kurt Natvig wanted to understand whether it’s possible to recompile VBA macros to another language, which could then easily be ‘run’ on any gateway, thus revealing a sample’s true nature in a safe manner. In this article he explains how he recompiled…

Bulletin Archive