Junk or spam email is unwanted email sent by wide range of individuals and organizations (usually called spammers), intentional or unintentional



Yüklə 88,5 Kb.
tarix07.11.2018
ölçüsü88,5 Kb.
#78819

MAIL DEFENSE AGAINST SPAM VIA A SCHEME OF DISTRIBUTED MERIT ACCUMULATION

Authors:
Hiep Pham

E-mail: h.pham@uws.edu.au

Telephone: 61-2-9852 5222

Postal address: School of Computing and Information Technology, Parramatta Campus, University of Western Sydney, Locked bag 1797, Australia


Zhuhan Jiang

E-mail: zhuhan@cit.uws.edu.au

Telephone: 61-2-96859336

Postal address: School of Computing and Information Technology, Parramatta Campus, University of Western Sydney, Locked bag 1797, Australia


Abstract

Infringement of privacy and denial of service attacks can take various forms. Coercing unsolicited emails upon an individual or an organisation is one of those not so obvious forms. Our approach aims at effectively stopping spam and minimizing false-positives by applying filters at both sender side and receiver side on the basis of our proposed portable merit-grading scheme. Our merit scheme is designed in such a way that, while the cooperation of the sender and the receiver of an email is voluntary, their active cooperation will reap much greater benefits. As a result, our scheme will increase the accuracy and the effectiveness of the spam filtering, and the normal email traffic will steadily ferment and enrich the merits that would lead to smarter email classifications and could also propagate across the other participating organizations.


Key words: spam filtering, demerit/merit scheme, portable distributed merit.


MAIL DEFENSE AGAINST SPAM VIA A SCHEME OF DISTRIBUTED MERIT ACCUMULATION


1. Introduction
Junk or spam email is unwanted email sent by a wide range of individuals and organizations usually called spammers, intentional or unintentional. Members of intentional spammers indiscriminately mass email recipients with unsolicited contents and advertisements. Unintentional spammers are more innocent in their intent and approach such as email users by participating in forwarding a chain letter to multiple recipients or companies, taking advantage of cost-effective medium to email large lists of potential customers to reach a wide market. Spam emails are proliferating due to two main factors: 1) bulk email is very cheap to send, and 2) pseudonyms are inexpensive to obtain (Cranor and LaMacchia, 1998). On the contrary, spam emails can cause considerable harms such as flooding receivers’ email servers, increasing the amount of time for reading and removing emails by individuals.
To prevent spam problems, there are technical and legislative solutions. The technical solutions mainly apply header and content analysis on receiving emails to identify spam signatures via such as keywords or sender’s address. The legislative approach aims at discouraging spammers by allowing people to file civil suits against the senders of the unsolicited commercial emails. However, spam is an international problem and it needs worldwide cooperation to effectively rule out spammers. Even though spam is already widespread for most email providers and users, cooperative approaches towards a more effective and efficient solution are still far lagging behind.
In this paper we propose a merit/demerit scheme that can be used to establish trusted lists of email users and behaviors. Our approach distinguishes itself in the active cultivation of the positive merits and the portability of merit across all participating parties. The scheme helps reducing false positive spam classification, coordination between sender and receiver to better respond spammers. The paper will be divided into 5 sections. The next section reviews some existing spam prevention schemes. Section 3 proposes and discusses a merit/demerit accumulation email system, along with extensive design and full analysis. Other factors and implementation issues are mentioned in section 4. Conclusions and ideas for future work finally close the paper in section 5.
2. Background and related work
There has been a number of commercial anti-spam products and academic research in finding effective ways to reduce spam problems. The key challenges are to identify and eliminate spam without creating any false-positives, i.e. those legitimate emails that are erroneously blocked as spam. In order to best understand our proposed approach, a brief overview on the widely used email protocol for Internet transport - Simple Mail Transport Protocol (SMTP) is presented, then followed by a comprehensive review on various methods of preventing spams.

SMTP commands/

replies


Sender

SMTP


Receiver

SMTP


Relay

Server 1


Relay Server n

File System



File System

and mails


MUA

Sender MTA

Receiver MTA

MUA

Figure 1: SMTP MODEL
The SMTP architecture is based on the following model of communication: send-mail requests are originated by mail user agent (MUA) and emails are transferred to local mail transfer server (MTA) for being delivered to the receiving mail transfer server either directly or through intermediate mail relay servers. If the receiving MTA cannot forward emails to the specified recipient it responds with a reply rejecting that recipient.
The email application has been designed with the basic requirement that no email messages should be lost. As a result, if the email sending process cannot confirm that a message was delivered, the process will repeatedly attempt to deliver the message (Neumann, 1990). This is one of the methods of several email attacks (Bass et al, 1998). Feedback mechanism of mail systems could also aggravate the seriousness of spam mails. The spammers typically do not put valid return email addresses on their messages. The fake return addresses are often nonexistent addresses at some innocent companies’ mail servers. They in turn have to suffer from the complaints and bounced messages.
Based on the SMTP model described above, we can now adequately address a number of important factors with which spam emails can be categorized and prevented. The analysis of these factors will partially pivot our proposed scheme to be introduced in the next section.

2.1 The legitimate relationship between senders and receivers

Spammers usually obtain mailing lists from intermediary mail servers or use spam programs to scan web pages, newsgroup, and other online resources to collect email addresses in bulk. So if an email sender is strange to the receiver, it is highly a bulk email. Analysis of the contact lists and recipients’ lists in email account, and of the message content can reliably confirm or rule out a prior relationship. This method is easy to apply and highly effective. However, it is impractical to expect users to create complete contact lists from which they accept to receive emails and users usually will not accept any scheme that could lead to the loss of even a single important email after all due to the automatic elimination. This approach is however still suffering from the limitation of the availability of the users’ address book, previous correspondence, and limited application to users within the service provider.
Hall (1998) and Gabber et al (1998) have similarly developed anti-spam methods by filtering recipient email addresses, instead of traditionally examining sender email addresses. The idea of the approach is to create different email aliases for different purposes while still providing transparent user-friendly core email addresses. Different email extensions represent different channels or communication purposes, based on which incoming mails will be easily filtered. The authors there have introduced small establishment cost incurred for new email senders before being allowed to send emails to receivers, such as computational cost in Gabber et al (1998) and e-money in Hall (1998). As spammers usually send large amount of emails, this method will discourage them in spreading their messages. Multiple channels or email extensions however require complex email management. Moreover it is rather inconvenient for any senders to memorise multiple email aliases with lengthy and somewhat cryptic extension associated with functional purposes unless to keep the receiver’s extended e-mail address in their address book. It is rather troublesome for some legitimate bulk senders such as mailing lists, market survey companies who must go through time-consuming process to obtain valid e-mail extension or channel.
Private Email system (P-Mail) developed by Reticular Systems (http://www.reticular.com) uses real-time messaging approach to protect email privacy and eliminate spam. P-Mail system is a peer-to-peer messaging, the message moves from the sender to receiver without being stored on any intermediate machine. The weakness of this approach is P-Mail can only send and receive emails when both the sending and receiving email agents are online. This has forfeited the ability to send-store-forward emails possessed by most of the current email systems.

2.2 The method of spam delivery

One of the most common techniques that spammers employ to distribute their messages is unauthorised mail relay. Open mail relay occurs when a mail server processes a mail message in which neither sender nor receiver is a local user. The mail server is totally unrelated to mail exchange between the users. Unauthorised use of mail relay by spammers not only makes it difficult and time consuming to trace the source of spam but also costs the organizations that operate relay servers reputations, human energy and time, and the draining of computer resources. Reconfiguration of mail server can prevent open mail relay by applying relay filters to allow relay mails for certain IP address ranges. Third party systems such as Real-Time Blackhole List (RBL), Open Relay Behavior-modification System (http://www.ordb.org) or Relay Spam Stopper (RSS) can provide a list of all identified rouge mail servers, ISPs that facilitate open mail relay to all subscribed email providers to verify whether an incoming email originated from that list so as to classify it as a junk email or to perform additional further filtering. However, this approach could lead to an innocent mail server to being blacklisted, and the outgoing emails from there to being classified as spam.
Spammers aim at large volume of recipients with essentially the same message. At the user level, it is difficult to tell if one is looking at only a single copy. However, at a network or multiple networks’ level large amount of message copies with similar content and header will reliably signal a spam attack. Given this characteristic of bulk message delivery, spam trap can be set up to attract potential spams. The Probe Network by Brightmail (http://www.brightmail.com) is a good example of spam traps. Probe Network contains a large collection of email accounts called probe accounts. These probe accounts are placed at potential locations where spammers often collect email addresses. Based on the collective information retrieved from this large volume of email accounts, the system can help classify if a message is spam.
2.3 The email header and content body

Email headers provide tracing information such as sender of the message, the recipients, and the names of different servers that processed the message along the transmitting route. By verifying email headers to ensure that all mail headers satisfy Internet mail standard one can also effectively eliminate various spams. However, spammers can forge or modify the header information to hide their real identity or to relay spam messages to the open mail server of an unrelated third party. Because of the forged identity, complaints or bounced emails will never get to spammers but to an unrelated mail server which has been made to look like the origin of the spam. As a result, the both receivers and the relay servers are suffering from spam.


Intelligent mobile agent (Cheng & Weinong, 2002) has been investigated as a potential approach to verify a sender’s email address. Extending SMTP architecture to support the operation of mobile agents, the model suggests that the receiver’s MTA once received a “request to send” will send an agent to the sender’s MTA to audit and filter all mails before allowing “good” mails to be sent to SMTP-receiver and refusing the “bad” ones. The applicability of this approach, however, hinges heavily on the assumption that the advance in agent technology and spam analysis algorithms can effectively recognise spam. As with traditional spam filtering where the spam already arrives at MTA-receiver before any filtering is performed, the user has the option to view filtered emails before deleting them. It will be a problem if legitimate emails are discarded without the users’ approval.
Email body content filtering can be considered as a particular instance of Text Categorization (TC) problem. TC breaks all texts into two classes: spam and legitimate. As such, some proven TC techniques such as Ensembles of Decision Trees (Weiss et al, 1999), Support Vector Machines (Drucker et al, 1999), and Booting Decision Trees (Schapire & Singer, 1999) have been utilised to classify and filter emails. Other classification algorithms such as Ripper (Cohen, 1996), Rocchio, Naïve Bayes (Pantel & Lin, 1998; Androutsopoulos et al, 2000; Provost, 1999), and Bayesian (Sahami et al, 1998) have also been experimentally implemented to detect spam. Most of these approaches analyse email content to recognise spam-related key words, the frequency of repeated words to assess spam confidence and to classify them into respective folders so that users can later either read or delete the emails.
Content-based filtering is not effective against constant spam-style changing. It is very difficult to establish static rules that can reliably distinguish bulk mails, market surveys from legitimate messages. Spammers are also changing their wording styles, formats to avoid spam content filtering.
3. Proposed merit/demerit scheme for spam filtering
In this paper, it is not the intention of the authors to critically evaluate existing spam-filtering mechanisms or to compare the approaches taken by different authors. We will instead propose a new scheme to counter spam.
We recall from the earlier discussion that there are several major issues to be concerned with existing anti-spam approaches. Firstly, each approach can cause a certain number of legitimate emails to be classified as spam. This is also called the false positive. Secondly, these approaches have not been efficiently coordinated to respond to spams. Thirdly, most current spam-filtering approaches are looking at the messages which have already arrived at the receiving MTA or MUA that means some of the damages are already done such as flooding mail servers and wasting time in cleansing unwanted emails. New approaches to filter or stop spam at sending mail server will be more desirable.

We will therefore propose a scheme that aims at effectively stopping spam and eliminating false-positives by applying filters at both sending and receiving sides based on a merit/demerit grading scheme and through configuring system features to respond effectively to spammers. Our system is composed of three modules depicted in Figure 2, Figure 3 and Figure 4. These 3 modules will be explained in great details in the following subsections.

3.1 Incoming mail filter module

This module aims to provide a quick access to those privileged or classified emails while imposing comprehensive content and merit filtering on non-privileged ones. It can be an add-on component to the existing email system to provide finer spam filtering. The module contains 4 main components which are to be described in detail in the rest of this section.


3.1.1 Privilege filter component

Incoming emails will be first checked against the combined list of local privilege and merit. Privilege list is composed of accepted servers, email addresses of highest local priorities, IP ranges that are explicitly granted to freely communicate with an enterprise and its employees and users. The emails approved by the privilege filter will be transferred directly to automatic classification component. The addresses in the privilege list may also be associated with an expiration date. Email addresses with expiry authorization will have to pass through further filtering.


Privilege list is dynamically maintained on the basis of the address books, sent email correspondence of the users and the local merit list. The list can always be added, removed or modified by the system administrator.
As privilege filter mainly verify email header against the privilege filter for quick access to user mailbox, header analysis is essential to ensure that messages and senders are legitimate. This analysis ranges from simple checking of e-mail header syntax against Internet mail standard to sender signature verification using public key infrastructure.


External merit/sender verification






Incoming

e-mails






Caching Repository

* Spam signatures

* Blacklist results



Merit Synthesis


Local privilege/merit list









Sending


MTA

High Priority



Low Priority

Trash






Manual Transfer



Figure 2: Incoming e-mail filtering diagram
3.1.2 Early spam screening component

Messages disapproved by the privilege filter then go through the early spam screening (ESS) stage. ESS comprises (1) source address verification with the local blacklists, third-party systems such as Real-Time Blackhole List (RBL) or Relay Spam Stopper (RSS), (2) and e-mail’s body content analysis that takes advantage of the advance in text categorization and classification algorithms as discussed in the previous sections. Previous spam signatures and blacklisted addresses are kept in a caching repository; message contents can be compared against known spam signatures for quick result using content-hash on the signatures. Cached address verification results are also used to identify early spam email attacks and to apply effective response tactics to the senders. By analyzing large numbers of email addresses initial signs of spam attacks from a single or multiple sources can be identified and can be selectively sent out for verification or confirmation, instead of sending indiscriminately error messages, complaints to the senders which usually flood the innocent mail relay servers or hijacked mail servers.


ESS analysis may not provide sufficient information for classifying incoming messages; in that case sender authenticity should be further verified. If the sender’s merit information is available, ESS will contact relevant Merit Management Module (MMM) to retrieve and verify merit weights for the use of the merit synthesis.
Senders without merit information may need to be authenticated by using automatically generated challenges that require human response. As most spam is usually bulk sent by a computer program, a customizable challenge that includes a simple question for a human response can also be used to grade an email as genuine and be given higher merit.
If the challenges are not answered in due time, the messages are given much lower merit and sent to merit synthesis for overall scoring. Answered challenges are given certain merit points and also sent to merit synthesis for synthesizing merit score.
3.1.3 Merit synthesis

Results from ESS are synthesised at the merit synthesis component. Each message will be given a merit weight in regard to its overall merit scoring. The messages’ overall merit will then be updated in the local merit list for the subsequent e-mail merit verification. The email demerit and merit system is not a once-off classification of emails but an accumulated classification. For example, submission of an email to the blacklist will increase demerit points of that email, a demerit point threshold will be set to classify an email address as spam. The scheme will thus reduce the number of legitimate emails classified as spam. Our merit system is hence used to build up the trustability of email addresses. The account’s merit can be portable or verified at all scheme-participating organizations. Another benefit of the merit scheme is that comparison between blacklist and merit list will also identify potential conflicting classifications or false-positive cases. These cases will be presented for further analysis or human processing.


3.1.4 Priority classifier

Incoming messages having passed through privilege filter and merit synthesis will then be classified into 3 separate folders according to their merit grading as High Priority, Low Priority and Trash. High merit-graded and privileged messages are put into the high priority folder; low merit-graded messages are placed in the low priority folder; spam or unverified-merit messages are placed in the trash folder.


Users have the option to manually override the automatic classification and reclassify messages at their own will including blocking certain e-mail addresses from delivering to the inbox. Users’ personal settings will then be updated in the local privilege/merit list for future filtering to each individual personalized setting.
Priority classifier component also performs filtering on bounced messages or complaints originated from Trash folder before they are forwarded to MTA for sending. This will reduce the chance of multiple messages destined to the same recipient source that may have been an innocent mail server taken advantage by spammers to send junk mails.

3.2 Outgoing mail filter module


This module performs pre-sending processing to append merit information pointer into mail header and update account merit detail accordingly with the correspondence history. There are 2 main components in this module: Header Regularisation and Merit Processing.
3.2.1 Header regularization

Each message’s header is modified before being forwarded to the mail server. The merit URL address is appended in the message header. Merit information comprised of Merit Scheme Version, Merit Score, Category (e.g. normal, error, complaint or authentication type), and Public Key if available. The merit URL provides verification path to the corresponding merit repository for spam verification.


3.2.2 Account merit processing

After each message is successfully sent out, the email account’s merit information will be updated.

Merit information for each account includes:


  • Email address

  • Last updated

  • First created

  • Merit increment history

  • Merit decrement history

  • Any details of its public keys, CA parents

Merit of each account is maintained in a secure merit repository or merit management module (MMM) that can provide merit authentication services on the stored e-mail accounts. MMM can be located locally or remotely.


Email composition

Header Regularization




Auto categorisation

- Anything special about the email






Sending


MTA


Figure 3: Pre-processing outgoing e-mail header
Merit processing decrements merit or even marks an e-mail as spam to the account if report of spam is received from authenticated users or mail servers. Merit increment will be executed according to the history of outgoing mails and positive distinct receiving mails (i.e. no spam, complaints reported) to the account. To prevent an account trying to increase its merit, it is essential that large amount of correspondence mails from a single account will not increase substantially merit, but multiple ones do.
3.3 Merit management module

Merit management module maintains all accounts’ merit and it can be distributed across multiple merit repositories, mainly for performance purpose. MMM will provide secure storage, retrieval, update and modification on merit information and grading.


Incoming merit requests to MMM will be classified as inquiry request or update request which includes merit and demerit update. For inquiry request, the email address in question will be looked up against the merit repository and its merit information is returned to the requestor. If the inquired email address is not in the repository, an entry is added and marked as “no report”. For update request, account verification is performed first before any entry is added or modified to prevent illegal modification of merit. The requestor must authenticate its identity using public key and private key infrastructure.
If the email address is first time requested for merit or demerit addition, the request of merit update will be kept in the temporary repository for a pending period before it is added into the merit repository. The pending period is determined according to the emergency level of the request, the trustability of the requestor and frequency of report for that email account. The pending period is aimed at obtaining sufficient evidence about the trustability of that particular email account. As merit and demerit in our scheme are built on accumulated scores, a threshold is set to confirm the status of an email account.

Merit Request




Merit Inquiry



Merit Update

Update merit










Insert new merit




Figure 4: Merit management module
4. Other factors and implementation issues
In this paper we have introduced a spam-filtering scheme based on a merit grading mechanism. The merit scheme will achieve the best result when both sender and receiver take part in the scheme. As a result, the scheme increases the accuracy of spam filter and the benefits of genuine accounts by facilitating merit portability across participating organizations.
There are a few aspects that need further fine-tuning approach. For instance, there is not yet a formal standard on how to set a realistic threshold for demerit and merit score so that an account is categorised as good or bad. Also there has not been a clear-cut solution to the conflict cases where an address is included in the blacklists and has high merit score at the same time. Another issue is the processing overhead involved in the operation of the system which will be more clearly experimentally measured once the prototype system is implemented.
A prototype of this scheme is currently being investigated and implemented. The modules are built on a Unix platform. For remote merit verification and modification request, SOAP protocol (Simple Object Access Protocol) is the preferred application. SOAP is simple, relevant for structured and strong-type information exchange in a decentralised, distributed environment (Tsenov, 2002). Merit repository can be stored on MySQL database and be accessed by SOAP-based messages via web server running on IIS 5.0 on Windows .NET server. A 2-tier model in Figure 5 is recommended for communication between clients and MMM.

Merit client

SOAP

Windows 2000 server



IIS 5.0

Php scripts



Merit Management Module


Figure 5: 2-tier communication with MMM
5. Conclusion
Spam can be prevented in various ways such as content filter, removal of open mail relay, dedicated channels for different communication purpose, and sender authentication. Each approach has its own strength and weakness. Our approach has utilised a multiple-component filtering methodology through our proposed distributed accumulation scheme. The merit scheme enables portability of the account merit across all participating organizations. It also reduces the chance of false-positives in classifying legitimate emails as spam by considering a wide range of factors before classifying an email as spam. Future work on the scheme should be expanded in investigating the application of merit/demerit scheme to wider business areas such as online business rankings, e-shop classifications for automatic intelligent agent-based shopping.
Reference:

  1. Androutsopoulos, I., Koutsias, J., Chandrinos, K., and Spyropoulos, C., An experimental comparison of naïve Bayesian and keyword-based anti-spam filtering with personal e-mail messages, Proceedings of the 23rd annual international ACM SIGIR conference, pp.160-167, 2000.

  2. Cheng, Li and Weinong, W., Internet Mail Transfer and check system based on Intelligent Mobile Agents, Proceedings of the 2002 Symposium on Applications and the Internet (SAINT’02).

  3. Cohen, W., Learning rules that classify email, Papers from the AAAI Spring Symposium on Machine Learning in Information Access, pp. 18-25, 1996.

  4. Cranor, L.F and LaMacchia, B.A., Spam!, Communications of the ACM, 41(8): 74-83, 1998.

  5. Drucker, H., Wu, D., and Vapnik, V.N., Support Vector Machines for Spam Categorization, IEEE Transactions on neural networks, Vol.10, No.5, September 1999.

  6. Gabber, E., Jakobsson, M., Matias, Y and Mayer, A.J., Curbing Junk E-Mail via Secure Classification, Financial Cryptography, pp. 198-213, 1998.

  7. Hall, R.J. "Channels: Avoiding unwanted electronic mail," in Proc. 1996 DIMACS Symposium on Network Threats DIMACS 1996.

  8. http://www.ietf.org/rfc/rfc0821.txt

  9. Joaching, T., Text categorization with support vector machines: Learning with many relevant features, Proceedings of the 10th European Conference on Machine learning, LNCS, Springer Verlag, Heidelberg, DE, 1998.

  10. Pantel, P., and Lin, D., Spamcop: A spam classification and organization program, Proceedings of AAAI-98 Worshop on Learning for Text Categorization, pp. 95-98, 1998.

  11. Provost, J., Naïve-bayes vs. Rule-learning in classification of email, The university of Texas at Austin, Artificial Intelligent Lab. Technical report AI-TR-99-284.

  12. Sahami, M., Dumais, S., Heckerman, D., and Horvitz, E., A bayesian approach to filtering email, AAAI-98 Workshop on Learning for Text Categorization, 1998.

  13. Schapire, R. E., and Singer, Y., Improved Boosting Algorithms using confidence-rated predictions, Machine learning, 37(3): 297-336, 1999.

  14. Tsenov, M., Application of SOAP protocol in E-commerce solution, First International IEEE symposium “Intelligent Systems”, September 2002

  15. Weiss, S.M, Apte, C., Damerau, F.J., Johnson, D.E., Oles, F.J., Goetz, T., and Hampp, T., Maximizing text-mining performance, IEEE Intelligent Systems, 1999.

Yüklə 88,5 Kb.

Dostları ilə paylaş:




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©genderi.org 2024
rəhbərliyinə müraciət

    Ana səhifə