pots using a low false negative filter (e.g. DPA bait messages are unlikely to escape).
False positives are unimportant for preliminary classification since the candidate sets
will be analyzed in depth by later heuristics (described below). Collection of candidate
emails from honeypot accounts will employ existing technology for spam filtering; a
common set of keywords and images will be used to screen emails and identify candi-
date messages.
These candidate messages are forwarded to a phishing detection unit (PDU) which
performs in-depth feature analysis of candidate messages and corresponding links to
determine its DPA likelihood. Initially, the PDU will examine email text and referenced
websites. As the system becomes more sophisticated, PDUs will examine images and
embedded scripts.
Text analysis. Phishers cannot use all the tricks of spammers (such as character sub-
stitution, extra spaces, etc.) to defeat keyword filters because their messages must
appear identical to authentic ones. They can however use morphological variations
at the level of HTML markup, for example introducing invisible tags. Additionally,
a sophisticated attack may obtain a degree of polymorphism by combining pieces of
messages from a pool of candidate components, and interspersing randomly selected
text displayed in an undetectable manner. Therefore we need robust techniques to de-
tect any degree of similarity among emails used in a DPA. Techniques for detecting
partial matches (e.g., [16, 5, 10]) have been successfully applied to detecting large-
scale polymorphic spam attacks [21]. Partial signatures for such documents collected
in honeypots or reported by users have been successfully coordinated both in commer-
cial products such as Brightmail and in open-source projects such as Vipul’s Razor [27]
and the Distributed Checksum Clearinghouse [11].
Link and functional analysis. It may be possible to detect a DPA by examining
hosts that are referenced in DPA candidate messages. Pages pointed to by candidate
members of the DPA are likely to exhibit a large degree of similarity as well, therefore
techniques to detect mirrors on the Web can be applied to this task [4, 7]. Traversing
links to detect this similarity is an inherently unsafe activity for the end user, due to the
possibility that a link could have undesired semantics associated with it. However, such
an analysis could be performed with information sharing between spam and phishing
detection components, as proposed in [12], and automatically by user agents operating
within a sandboxed environment. Once the PDU traverses a link in an email that is
suspected of participating in a DPA, the content of subsequent Web pages may also
be evaluated to determine whether they are similar to other pages known to be part of
a DPA. This is also similar to analysis performed by some phishing toolbars, with an
added degree of scrutiny applied to suspicious emails.
The PDU analysis is resource intensive, and therefore is not suitable for real-time
filtering of incoming mail. Its primary goal is to build an individualized email filter per
phishing attack. We thus frame the problem as a large collection of classification prob-
lems, each with a different (and moving) target concept. Each PDU will construct a
filter, in the form of one or more classifiers, reflecting its attack instances. The labeled
examples needed to train each classifier will be provided by the PDU’s in-depth anal-
ysis. Once trained, the classifier will quickly recognize future instances of the DPA.
Individual classifiers in an ensemble will be obtained in two ways: by direct training us-
5