Ian Stuart, Sung-Hyuk Cha, and Charles Tappert

Yüklə 0,74 Mb.

tarix	07.11.2018
ölçüsü	0,74 Mb.
	#78833

A Neural Network Classifier for Junk E-Mail

Ian Stuart, Sung-Hyuk Cha, and Charles Tappert
CSIS Student/Faculty Research Day
May 7, 2004

Spam, spam, spam, …

Fighting spam

Several commercial applications exist

Server-side: expensive
Client-side: time-consuming

No approach is 100% effective

Spammers are aggressive and adaptable
Best solutions are typically hybrids of different approaches and criteria

Common approaches

Simple filters

Common words or phrases
Unusual punctuation or capitalization

Blacklisting: “just say NO” (if you can)

Reject e-mail from known spammers

Whitelisting: “friends only, please”

Accept e-mail only from known correspondents

Classifiers: examine each e-mail and decide

Only a few publications on spam classifiers

Naïve Bayesian classifiers

Used in commercial classifiers
Assumes recognition features are independent

Max likelihood = product of likelihoods of features

E-mail classifier – examines each word

Training assigns a probability to each word
Look up each word/probability in a dictionary
If the product of the probabilities exceeds a given threshold, it is spam

Challenge – creating the “dictionary”
We compare our Neural Network against two published Naïve Bayesian classifiers

Naïve Bayesian classifier issues

How many features (words), which ones?
How is degradation avoided as spammers’ vocabulary changes?
What values are assigned to new words?
What are the thresholds?
How to avoid “sabotage” of classifier?

Which one isn’t spam? (subject headers)

5 Be a mighty warrior in bed! vcrhwt ygjztyjjh
Money Back Guarantee_HGH
kindle life pddez liw mzac
v a l i u m - D i a z e p a m used to relieve anxiety
Fairfield tennis schedule
:Dramatic E,nhancement fo=r .Men = f"fumqid
,Refina'nce now. Don't wait

Which one isn’t spam? (subject headers)

5 Be a mighty warrior in bed! vcrhwt ygjztyjjh
Money Back Guarantee_HGH
kindle life pddez liw mzac
v a l i u m - D i a z e p a m used to relieve anxiety
Fairfield tennis schedule
:Dramatic E,nhancement fo=r .Men = f"fumqid
,Refina'nce now. Don't wait

Spammers make patterns

The more they try to hide, the easier it is to see them
Therefore, we use common spammer patterns (instead of vocabulary) as features for classification
Learn these patterns with a Neural Network

Neural Network features

Total of 17 features

6 from the subject header
2 from priority and content-type headers
9 from the e-mail body

Features from subject header

Number of words with no vowels
Number of words with at least two of letters J, K, Q, X, Z
Number of words with at least 15 characters
Number of words with non-English characters, special characters such as punctuation, or digits at beginning or middle of word
Number of words with all letters in uppercase
Binary feature indicating 3 or more repeated characters

Features from priority and content-type headers

Binary feature indicating whether the priority had been set to any level besides normal or medium
Binary feature indicating whether a content-type header appeared within the message headers or whether the content type had been set to “text/html”

Features from message body

Proportion of alphabetic words with no vowels and at least 7 characters
Proportion of alphabetic words with at lease two of letters J, K, Q, X, Z
Proportion of alphabetic words at least 15 characters long
Binary feature indicating whether the strings “From:” and “To:” were both present
Number of HTML opening comment tags
Number of hyperlinks (“href=“)
Number of clickable images represented in HTML
Binary feature indicating whether a text color was set to white
Number of URLs in hyperlinks with digits or “&”, “%”, or “@”

Neural Network spam classifier

3-layer, feed-forward network (Perceptron)

17 input units, variable # hidden layer units, 1 output unit

Data – 1,654 e-mails: 854 spam, 800 legitimate
Use half of each (spam/non-spam) for training, the other half for testing
Test with variations of hidden nodes (4 to 14) and epochs (100 to 500)

Definitions used for classifier success measures

nSS = number of spam classified as spam
nSL = number of spam classified as legitimate
nLL = number of legitimate classified as legitimate
nLS = number of legitimate classified as spam

Measure of success: precision

Precision: the percentage of labeled spam/legitimate e-mail correctly classified

Measure of success: precision

Precision: the percentage of labeled spam/legitimate e-mail correctly classified

Measure of success: accuracy

Accuracy: the percentage of actual spam/legitimate e-mail correctly classified

Measure of success: accuracy

Accuracy: the percentage of actual spam/legitimate e-mail correctly classified

Neural Network results

Best overall results with 12 hidden nodes at 500 epochs

Spam Precision: 92.45%
Legitimate Precision: 91.32%
Spam Accuracy: 91.80%
Legitimate Accuracy : 92.00%

35 spams misclassified: 8.20%
32 legitimates misclassified: 8.00%

Misclassified e-mails

Most spam misclassified as legitimate were short in length, with few hyperlinks
Most legitimate e-mails misclassified as spam had unusual features for personal e-mail (that is, they were “spam-like” in appearance)

Comparing Neural Network and Naïve Bayesian Classifiers

Accuracy of the NN classifier is comparable to that reported for Naïve Bayesian classifiers
NN classifier required fewer features (17 versus 100 in one study and 500 in another)
NN classifier uses descriptive qualities of words and messages similar to those used by human readers

Blacklisting Experiment

Manually entered IP addresses of e-mail incorrectly tagged by NN classifier

Entered first (original) IP address and, when present, second IP address (e.g., mail server or ISP)

Into a website that sends IP addresses to 173 working spam blacklists and returns the # hits, http://www.declude.com/junkmail/support/ip4r.htm
Counted only hit counts greater than one as spam since single-list hits to be anomalies

Blacklisting Experimental Results

Of the 32 legitimate e-mails misclassified by the NN, 53% were identified as spam
Of the 35 spam e-mails misclassified by the NN, 97% were identified as spam
These poor results indicate that the blacklisting strategy, at least for these databases, is inadequate

Conclusions

NN competitive to Naïve Bayesian studies despite using a much smaller feature set
Room for refinement of parsing for features
Use of descriptive, more human-like features makes NN less subject to degradation than Naïve Bayesian

Conclusions (cont.)

Neural Network approach is useful and accurate, but too many legitimate -> spam
Should be powerful when used in conjunction with a whitelist to reduce legitimate -> spam (nLS), increasing spam precision and legitimate accuracy
Blacklisting strategy is not very helpful

Yüklə 0,74 Mb.

Dostları ilə paylaş:

Ian Stuart, Sung-Hyuk Cha, and Charles Tappert

A Neural Network Classifier for Junk E-Mail

Ian Stuart, Sung-Hyuk Cha, and Charles Tappert

CSIS Student/Faculty Research Day

May 7, 2004

Spam, spam, spam, …

Fighting spam

Several commercial applications exist

No approach is 100% effective

Common approaches

Simple filters

Blacklisting: “just say NO” (if you can)

Whitelisting: “friends only, please”

Classifiers: examine each e-mail and decide

Naïve Bayesian classifiers

Used in commercial classifiers

Assumes recognition features are independent

E-mail classifier – examines each word

Challenge – creating the “dictionary”

We compare our Neural Network against two published Naïve Bayesian classifiers

Naïve Bayesian classifier issues

How many features (words), which ones?

How is degradation avoided as spammers’ vocabulary changes?

What values are assigned to new words?

What are the thresholds?

How to avoid “sabotage” of classifier?

Which one isn’t spam? (subject headers)

5 Be a mighty warrior in bed! vcrhwt ygjztyjjh

Money Back Guarantee_HGH

kindle life pddez liw mzac

v a l i u m - D i a z e p a m used to relieve anxiety

Fairfield tennis schedule

:Dramatic E,nhancement fo=r .Men = f"fumqid

,Refina'nce now. Don't wait

Which one isn’t spam? (subject headers)

5 Be a mighty warrior in bed! vcrhwt ygjztyjjh

Money Back Guarantee_HGH

kindle life pddez liw mzac

v a l i u m - D i a z e p a m used to relieve anxiety

Fairfield tennis schedule

:Dramatic E,nhancement fo=r .Men = f"fumqid

,Refina'nce now. Don't wait

Spammers make patterns

The more they try to hide, the easier it is to see them

Therefore, we use common spammer patterns (instead of vocabulary) as features for classification

Learn these patterns with a Neural Network

Neural Network features

Total of 17 features

Features from subject header

Number of words with no vowels

Number of words with at least two of letters J, K, Q, X, Z

Number of words with at least 15 characters

Number of words with non-English characters, special characters such as punctuation, or digits at beginning or middle of word

Number of words with all letters in uppercase

Binary feature indicating 3 or more repeated characters

Features from priority and content-type headers

Binary feature indicating whether the priority had been set to any level besides normal or medium

Binary feature indicating whether a content-type header appeared within the message headers or whether the content type had been set to “text/html”

Features from message body

Proportion of alphabetic words with no vowels and at least 7 characters

Proportion of alphabetic words with at lease two of letters J, K, Q, X, Z

Proportion of alphabetic words at least 15 characters long

Binary feature indicating whether the strings “From:” and “To:” were both present

Number of HTML opening comment tags

Number of hyperlinks (“href=“)

Number of clickable images represented in HTML

Binary feature indicating whether a text color was set to white

Number of URLs in hyperlinks with digits or “&”, “%”, or “@”

Neural Network spam classifier

3-layer, feed-forward network (Perceptron)

Data – 1,654 e-mails: 854 spam, 800 legitimate

Use half of each (spam/non-spam) for training, the other half for testing

Test with variations of hidden nodes (4 to 14) and epochs (100 to 500)

Definitions used for classifier success measures

nSS = number of spam classified as spam

nSL = number of spam classified as legitimate

nLL = number of legitimate classified as legitimate

nLS = number of legitimate classified as spam

Measure of success: precision

Precision: the percentage of labeled spam/legitimate e-mail correctly classified