Ian Stuart, Sung-Hyuk Cha, and Charles Tappert



Yüklə 0,74 Mb.
tarix07.11.2018
ölçüsü0,74 Mb.
#78833


A Neural Network Classifier for Junk E-Mail

  • Ian Stuart, Sung-Hyuk Cha, and Charles Tappert

  • CSIS Student/Faculty Research Day

  • May 7, 2004


Spam, spam, spam, …



Fighting spam

  • Several commercial applications exist

    • Server-side: expensive
    • Client-side: time-consuming
  • No approach is 100% effective

    • Spammers are aggressive and adaptable
    • Best solutions are typically hybrids of different approaches and criteria


Common approaches

  • Simple filters

  • Blacklisting: “just say NO” (if you can)

    • Reject e-mail from known spammers
  • Whitelisting: “friends only, please”

    • Accept e-mail only from known correspondents
  • Classifiers: examine each e-mail and decide

    • Only a few publications on spam classifiers


Naïve Bayesian classifiers

  • Used in commercial classifiers

  • Assumes recognition features are independent

    • Max likelihood = product of likelihoods of features
  • E-mail classifier – examines each word

    • Training assigns a probability to each word
    • Look up each word/probability in a dictionary
    • If the product of the probabilities exceeds a given threshold, it is spam
  • Challenge – creating the “dictionary”

  • We compare our Neural Network against two published Naïve Bayesian classifiers



Naïve Bayesian classifier issues

  • How many features (words), which ones?

  • How is degradation avoided as spammers’ vocabulary changes?

  • What values are assigned to new words?

  • What are the thresholds?

  • How to avoid “sabotage” of classifier?



Which one isn’t spam? (subject headers)

  • 5 Be a mighty warrior in bed! vcrhwt ygjztyjjh

  • Money Back Guarantee_HGH

  • kindle life pddez liw mzac

  • v a l i u m - D i a z e p a m used to relieve anxiety

  • Fairfield tennis schedule

  • :Dramatic E,nhancement fo=r .Men = f"fumqid

  • ,Refina'nce now. Don't wait



Which one isn’t spam? (subject headers)

  • 5 Be a mighty warrior in bed! vcrhwt ygjztyjjh

  • Money Back Guarantee_HGH

  • kindle life pddez liw mzac

  • v a l i u m - D i a z e p a m used to relieve anxiety

  • Fairfield tennis schedule

  • :Dramatic E,nhancement fo=r .Men = f"fumqid

  • ,Refina'nce now. Don't wait



Spammers make patterns

  • The more they try to hide, the easier it is to see them

  • Therefore, we use common spammer patterns (instead of vocabulary) as features for classification

  • Learn these patterns with a Neural Network



Neural Network features

  • Total of 17 features

    • 6 from the subject header
    • 2 from priority and content-type headers
    • 9 from the e-mail body


Features from subject header

  • Number of words with no vowels

  • Number of words with at least two of letters J, K, Q, X, Z

  • Number of words with at least 15 characters

  • Number of words with non-English characters, special characters such as punctuation, or digits at beginning or middle of word

  • Number of words with all letters in uppercase

  • Binary feature indicating 3 or more repeated characters



Features from priority and content-type headers

  • Binary feature indicating whether the priority had been set to any level besides normal or medium

  • Binary feature indicating whether a content-type header appeared within the message headers or whether the content type had been set to “text/html”



Features from message body

  • Proportion of alphabetic words with no vowels and at least 7 characters

  • Proportion of alphabetic words with at lease two of letters J, K, Q, X, Z

  • Proportion of alphabetic words at least 15 characters long

  • Binary feature indicating whether the strings “From:” and “To:” were both present

  • Number of HTML opening comment tags

  • Number of hyperlinks (“href=“)

  • Number of clickable images represented in HTML

  • Binary feature indicating whether a text color was set to white

  • Number of URLs in hyperlinks with digits or “&”, “%”, or “@”



Neural Network spam classifier

  • 3-layer, feed-forward network (Perceptron)

    • 17 input units, variable # hidden layer units, 1 output unit
  • Data – 1,654 e-mails: 854 spam, 800 legitimate

  • Use half of each (spam/non-spam) for training, the other half for testing

  • Test with variations of hidden nodes (4 to 14) and epochs (100 to 500)



Definitions used for classifier success measures

  • nSS = number of spam classified as spam

  • nSL = number of spam classified as legitimate

  • nLL = number of legitimate classified as legitimate

  • nLS = number of legitimate classified as spam



Measure of success: precision



Measure of success: precision

  • Precision: the percentage of labeled spam/legitimate e-mail correctly classified



Measure of success: accuracy

  • Accuracy: the percentage of actual spam/legitimate e-mail correctly classified



Measure of success: accuracy

  • Accuracy: the percentage of actual spam/legitimate e-mail correctly classified



Neural Network results

  • Best overall results with 12 hidden nodes at 500 epochs

    • Spam Precision: 92.45%
    • Legitimate Precision: 91.32%
    • Spam Accuracy: 91.80%
    • Legitimate Accuracy : 92.00%
  • 35 spams misclassified: 8.20%

  • 32 legitimates misclassified: 8.00%



Misclassified e-mails

  • Most spam misclassified as legitimate were short in length, with few hyperlinks

  • Most legitimate e-mails misclassified as spam had unusual features for personal e-mail (that is, they were “spam-like” in appearance)



Comparing Neural Network and Naïve Bayesian Classifiers

  • Accuracy of the NN classifier is comparable to that reported for Naïve Bayesian classifiers

  • NN classifier required fewer features (17 versus 100 in one study and 500 in another)

  • NN classifier uses descriptive qualities of words and messages similar to those used by human readers



Blacklisting Experiment

  • Manually entered IP addresses of e-mail incorrectly tagged by NN classifier

    • Entered first (original) IP address and, when present, second IP address (e.g., mail server or ISP)
  • Into a website that sends IP addresses to 173 working spam blacklists and returns the # hits, http://www.declude.com/junkmail/support/ip4r.htm

  • Counted only hit counts greater than one as spam since single-list hits to be anomalies



Blacklisting Experimental Results

  • Of the 32 legitimate e-mails misclassified by the NN, 53% were identified as spam

  • Of the 35 spam e-mails misclassified by the NN, 97% were identified as spam

  • These poor results indicate that the blacklisting strategy, at least for these databases, is inadequate



Conclusions

  • NN competitive to Naïve Bayesian studies despite using a much smaller feature set

  • Room for refinement of parsing for features

  • Use of descriptive, more human-like features makes NN less subject to degradation than Naïve Bayesian



Conclusions (cont.)

  • Neural Network approach is useful and accurate, but too many legitimate -> spam

  • Should be powerful when used in conjunction with a whitelist to reduce legitimate -> spam (nLS), increasing spam precision and legitimate accuracy

  • Blacklisting strategy is not very helpful



Yüklə 0,74 Mb.

Dostları ilə paylaş:




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©genderi.org 2024
rəhbərliyinə müraciət

    Ana səhifə