Different Methods: Non-content based approach

Yüklə 492 Kb.

tarix	07.11.2018
ölçüsü	492 Kb.
	#78839

Different Methods:

Non-content based approach

remove spam message if contain virus, worms before read.
leaves some messages un-labeled

Content based method:

widely used method
may need lots pre-labeled message
label message based its content
Zdziarski[5] said that it's possible to stop spam, and that content-based filters are the way to do it

Focus on content based method

Method of content-based

Bayesian based method [6]
Centroid-based method[7]
Machine learning method [8]

Latent Semantic indexing LSI
Contextual Network Graphs (CNG)

Rule based method[9]

ripper rule: a list of predefined rules that can be changed by hand

Memory based method[10]

saving cost

Rule-based method

A list of predefined rules that can be changed by hand

ripper rule

Each rule/test associated with a score
If an email fails a rule, its score increased
After apply all rules, if the score is above a certain threshold, it is classified as spam

Rule-based method

Advantage:

able to employ diverse and specific rule to check spam

Check size of the email
Number of pictures it contains

no training messages are needed

Disadvantage:

rules have to be entered and maintained by hand --- can’t be automatically

Latent Semantic indexing

Keyword

important word for text classification
High frequent word in a message
Can used as an indicator for the message

Why LSI?

Polysemy: word can be used in more than one category

ex: Play

Synonymous: if two words have identical meaning

Based on nearest neighbors based algorithm

Latent semantic indexing

consider semantic links between words

Search keyword over the semantic space
Two words have the same meaning are treated as one word
eliminate synonymous

Consider the overlap between different message, this overlap may indicate:

polysemy or stop-word
two messages in same category

Latent semantic indexing

Step1: build a term-document matrix X from the input documents

Latent semantic indexing

Step2: Singular value Decomposition (SVD) is performed on matrix X

To extract a set of linearly independent FACTORS that describe the matrix
Generalize the terms have the same meaning
Three new matrices TSD are produced to reduce the vocabulary’s size

Latent Semantic indexing

Two document can be compared by finding the distance between two document vector, stored in matrix X1
Text classification is done by finding the nearest neighbors – assign to category with max document

Latent Semantic Indexing

Advantage:

Entire training set can be learned at same time
No intermediate model need to be build
Good for the training set is predefined

Disadvantage:

When new document added, matrix X changed, and TSD need to be re-calculated
Time consuming
Real classifier need the ability to change training set

Contextual network Graphs

A weighted, bipartite, undirected graph of term and document nodes

Contextual network graphs

When new document d is added, energizing the weights at node d, and may need re-balance the weights at the connected node
The document is classified to the one with maximum of energy (weight) average for each class

Comparison Bayesian, LSI,CNG, centroid, rule-based

Result

Result and conclusion:

LSI & CNG super Bayesian approach 5% accuracy, and reduce false positive and negatives up to 71%
LSI & CNG shows better performance even with small document set

Comparison content based and non-content based

Non-content based:

dis-adv:

depends on special factor like email address, IP address, special protocol,
leaves some un-classified

Adv: detect spam before reading message with high accuracy

Content based:

Disadvantage:

need some training message
not 100% correct classified due to the spammer also know the anti-spam tech.

Advantage:

leaves no message unclassified

Improvement for spam

Combine both method

[1] proposes an email network based algorithm, which with 100% accuracy, but leaves 47% unclassified, if combine with content based method, can improve the performance.

Build up multi-layers[11]
[11] Chris Miller, A layered Approach to enterprise antispam

Metrics and Benchmarks

Measurements --- Metrics

Accuracy: the percentage of correct classified correct/(correct + un-correct)
False positive: if a message is a spam, but misclassify to un-spam.

Goal:

Improve accuracy
Prevent false positive

Data set for spam:

Non-content based:

Email network:

One author’s email corpus, formed by 5,486 messages

IP address: -- none

Content based:

Data set for spam

LSI & CNG:

Corpus of varying size (250 ~ 4000)
Spam and un-spam emails in equal amount

Bayesian based:

Corpus of 1789 email
211 spam, 1578 non-spam

Cetroid based:

Totally 200 email message
90 spam, 110 non-spam

Most recently used Benchmarks:

Reuters:

About 7700 training and 3000 test documents, 30000 terms,135 categories, 21MB.
each category has about 57 instances
collection of newswire stories

20NG:

About 18800 total documents, 94000 terms, 20 topics, 25MB.
Each category has about 1000 instances

WebKB:

About 8300 documents, 7 categories, 26 MB.
Each category has about 1200 instances
4 university website data

Above three are well-known in recently IR with small in size and used to test the performance and CPU scalability

Benchmarks

OHSUMED:

348566 document, 230000 terms and 308511 topics, 400 MB.
Each category has about 1 instance
Abstract from medical journals

Dmoz:

482 topics, 300 training document for each topic, 271MB
Each category has less than 1 instance
taken from Dmoz(http://dmoz.org/) topic tree

Large dataset, used to test the memory scalability of a model

Sources

Slide 1, image: ttp://www.ecommerce-guide.com
Slide 1, image: ttp://www.email-firewall.jp/products/das.html

References

Anti-spam Filtering: A centroid-based Classification Approach, Nuanwan Soonthornphisaj, Kanokwan Chaikulseriwat, Piyan Tang-On, 2002
Centroid-Based Document Classification: Analysis & Experimental Results, Eui-Hong (Sam) and George Karypis, 2000
Multi-dimensional Text classification, Thanaruk Theeramunkog, 2002
Improving centroid-based text classification using term-distribution-based weighting system and clustering, Thanaruk Theeramunkog and Verayuth Lertnattee
Combining Homogeneous Classifiers for Centroid-based text classifications, Verayuth Lertnattee and Thanaruk Theeramunkog

References

[1] P Oscar Boykin and Vwani Roychowdhury, Personal Email Networks: An Effective Anti-Spam Tool, IEEE COMPUTER, volume 38, 2004
[2] Andras A. Benczur and Karoly Csalogany and Tamas Sarlos and Mate Uher, SpamRank - Fully Automatic Link Spam Detection, citeseer.ist.psu.edu/benczur05spamrank.html
[3]. R. Dantu, P. Kolan, “Detecting Spam in VoIP Networks”, Proceedings of USENIX, SRUTI (Steps for Reducing Unwanted Traffic on the Internet) workshop, July 05(accepted)
[4]. IP addresses in email clients ttp://www.ceas.cc/papers-2004/162.pdf
[5] Plan for Spam ttp://ww.paulgraham.com/spam.html

References

[6] M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz. 1998, “A Bayesian Approach to Filtering Junk E-Mail”, Learning for Text Categorization – Papers from the AAAI Workshop, pages 55–62, Madison Wisconsin. AAAI Technical Report WS-98-05
[7] N. Soonthornphisaj, K. Chaikulseriwat, P Tang-On, “Anti-Spam Filtering: A Centroid Based Classification Approach”, IEEE proceedings ICSP 02
[8] Spam Filtering Using Contextual Networking Graphs www.cs.tcd.ie/courses/csll/dkellehe0304.pdf
[9] W.W. Cohen, “Learning Rules that Classify e-mail”, In Proceedings of the AAAI Spring Symposium on Machine Learning in Information Access, 1996
[10] G. Sakkis, I. Androutsopoulos, G. Paliouras, V. Karkaletsis, C.D. Spyropoulos, P. Stamatopoulos, “A memory based approach to anti-spam filtering for mailing lists”, Information Retrieval 2003

Yüklə 492 Kb.

Dostları ilə paylaş:

Different Methods: Non-content based approach

Different Methods:

Non-content based approach

Content based method:

Focus on content based method

Method of content-based

Bayesian based method [6]

Centroid-based method[7]

Machine learning method [8]

Rule based method[9]

Memory based method[10]

Rule-based method

A list of predefined rules that can be changed by hand

Each rule/test associated with a score

If an email fails a rule, its score increased

After apply all rules, if the score is above a certain threshold, it is classified as spam

Rule-based method

Advantage:

Disadvantage:

Latent Semantic indexing

Keyword

Why LSI?

Based on nearest neighbors based algorithm

Latent semantic indexing

consider semantic links between words

Consider the overlap between different message, this overlap may indicate:

Latent semantic indexing

Step1: build a term-document matrix X from the input documents

Latent semantic indexing

Step2: Singular value Decomposition (SVD) is performed on matrix X

Latent Semantic indexing

Two document can be compared by finding the distance between two document vector, stored in matrix X1

Text classification is done by finding the nearest neighbors – assign to category with max document

Latent Semantic Indexing

Advantage:

Disadvantage:

Contextual network Graphs

A weighted, bipartite, undirected graph of term and document nodes

Contextual network graphs

When new document d is added, energizing the weights at node d, and may need re-balance the weights at the connected node

The document is classified to the one with maximum of energy (weight) average for each class

Comparison Bayesian, LSI,CNG, centroid, rule-based

Result

Result and conclusion:

LSI & CNG super Bayesian approach 5% accuracy, and reduce false positive and negatives up to 71%

LSI & CNG shows better performance even with small document set

Comparison content based and non-content based

Non-content based:

Content based:

Improvement for spam

Combine both method

Build up multi-layers[11]

[11] Chris Miller, A layered Approach to enterprise antispam

Metrics and Benchmarks

Measurements --- Metrics

Accuracy: the percentage of correct classified correct/(correct + un-correct)

False positive: if a message is a spam, but misclassify to un-spam.

Data set for spam:

Non-content based:

Content based:

Data set for spam

LSI & CNG:

Bayesian based:

Cetroid based:

Most recently used Benchmarks:

Reuters:

20NG:

WebKB:

Above three are well-known in recently IR with small in size and used to test the performance and CPU scalability

Benchmarks

OHSUMED:

Dmoz:

Large dataset, used to test the memory scalability of a model

Sources

Slide 1, image: ttp://www.ecommerce-guide.com

Slide 1, image: ttp://www.email-firewall.jp/products/das.html

References

Anti-spam Filtering: A centroid-based Classification Approach, Nuanwan Soonthornphisaj, Kanokwan Chaikulseriwat, Piyan Tang-On, 2002

Centroid-Based Document Classification: Analysis & Experimental Results, Eui-Hong (Sam) and George Karypis, 2000

Multi-dimensional Text classification, Thanaruk Theeramunkog, 2002