Gordon Cormack and Thomas Lynam



Yüklə 1,11 Mb.
tarix07.11.2018
ölçüsü1,11 Mb.
#78830


A Study of Supervised Spam Detection Applied to Eight Months of Personal E-Mail

  • Gordon Cormack and Thomas Lynam

  • Presented by Hui Fang


Feel free to interrupt when you have any question or comment!

  • Feel free to interrupt when you have any question or comment!





What is Spam?

  • Typical legal definition

    • Unsolicited commercial email from someone without a pre-existing business relationship
  • Definition mostly used

    • Whatever the users think


Unofficial Statistics of Spam (Feb.3 to Feb. 12)



Spam Detection



Text classification alone is not enough

  • Spammers now often try to obscure text.

  • Special features are necessary.

    • E.g. subject line vs. body text
    • E.g. Mail in the middle of the night is more likely to be spam than mail in the middle of the day.


Weather Report Guy

  • Content in Image



Secret Decoder Ring Dude

  • Another spam that looks easy

  • Is it?



Secret Decoder Ring Dude

  • Character Encoding

  • HTML word breaking



Diploma Guy

  • Word Obscuring



Diploma Guy

  • Word Obscuring



Diploma Guy

  • Word Obscuring



Diploma Guy

  • Word Obscuring



Diploma Guy

  • Word Obscuring



More of Diploma Guy

  • Diploma Guy is good at what he does



One Solution to Spam Detection

  • Machine Learning

    • Learn spam versus good


Naïve Bayes

  • Want

  • Use Bayes Rule:

  • Assume independence: probability of each word independent of others



A Bayesian Approach to Filtering Junk E-Mail 1998 - Sahami, Dumais, Heckerman, Horvitz

  • One of the first papers on using machine learning to combat spam

  • Used Naïve Bayes

  • Feature Space: Words, Phrases, Domain-Specific Features

  • Evaluation Data: ~1700 Messages, ~88% Spam, from volunteer’s private e-mail



A Bayesian Approach to Filtering Junk E-Mail 1998 - Sahami, Dumais, Heckerman, Horvitz

  • Hand Crafted Features

    • 35 Phrases
      • ‘Free Money’
      • ‘Only $’
      • ‘be over 21’
    • 20 Domain Specific Features
      • Domain type of sender (.edu, .com, etc)
      • Sender name resolutions (internal mail)
      • Has attachments
      • Time received
      • Percent of non-alphanumeric characters in subject
  • Best collection of heuristics discussed in literature

    • Without them: Spam precision 97.1% Spam recall 94.3%
    • With them: Spam precision 100% Spam recall 98.3%


A Plan for Spam 2002 – P. Graham

  • Widely cited in the open source community

  • Uses a heavily tuned version of Naïve Bayes

  • Feature Space: Words in header and body

  • Feature Selection: ~23,000 features

    • all that appeared more than 5 times
  • Evaluation Data: ~8000 messages from author; ~50% spam

  • Results: Spam precision 100%, Spam recall 99.5%



Algorithms Used in Spam Detection



Which Algorithm is Best?

  • Very difficult to tell

    • No consistently-used good data set
    • No standard evaluation measures




Overview of the Paper



Problem: Supervised Spam Detection



Methods

  • Methods in six open-source spam filters

    • Spamassassin
    • Bogofilter
    • CRM-114
    • DSPAM
    • SpamBayes
    • Spamprobe


Data

  • A person’s eight month E-mails

    • From Aug. 2003 to March 2004
  • Stored in the order received

  • 49,086 messages with judgements

    • 9,038 (18.4%) ham
    • 40,048 (81.6%) spam


Evaluation Measures (1)



Evaluation Measures (2)

  • Ham/Spam tradeoff curve, i.e. ROC curve



Evaluation Measures (3)



Misclassification by Genre

  • Not all types of ham are equal

    • Some more likely misclassified
    • Some more likely missed if filtered
    • Some more valuable
  • Spam can similarly be classified



Conclusion

  • Present several possible evaluation measures for spam detection

  • Compare several spam detection methods

  • Provide Analysis of the experiment results

  • However, it would be more interesting to compare the performance of different algorithms (e.g. NB vs. SVM).



The End

  • Thank you!



Yüklə 1,11 Mb.

Dostları ilə paylaş:




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©genderi.org 2024
rəhbərliyinə müraciət

    Ana səhifə