Spam Junk Mail Filter
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
nit_cal
Active In SP
**

Posts: 237
Joined: Oct 2009
#1
29-10-2009, 03:58 PM



.pdf   Spam Junk Mail Filter.pdf (Size: 154.32 KB / Downloads: 97)
Spam Junk Mail Filter
Use Search at http://topicideas.net/search.php wisely To Get Information About Project Topic and Seminar ideas with report/source code along pdf and ppt presenaion
Reply
project topics
Active In SP
**

Posts: 2,492
Joined: Mar 2010
#2
22-04-2010, 10:50 PM


.zip   java_mail_filter source code.zip (Size: 25.78 KB / Downloads: 52)
Abstract
The filter implemented is used to block spam also called unsolicited email. It uses statistical approach called Bayesian filtering to block the spam. First of all the program has to be trained using a set of spam and non-spam mails. These are put in a database. The performance increases with the number of training it gets. When a new mail comes it is tokenised and probability of each word is found by looking into the database. The total probability is found out and if it is greater than 0.9 it is marked as spam. With good training it can block 99% of the spam mails with 0 false positives.

Presented By:
Binu Ashiq Y1066
National Institute of Technology, Calicut Department of Computer Engineering

1 Introduction
Spam is a growing problem for email users, and many solutions have been proposed, from a postage fee for email to Turing tests to simply not accepting email from people you don't know. Spam filtering is one way to reduce the impact of the problem on the individual user (though it does nothing to reduce the effect of the network traffic generated by spam). In its simplest form, a spam filter is a mechanism for classifying a message as either spam or not spam.
There are many techniques for classifying a message. It can be examined for "spam-markers" such as common spam subjects, known spammer addresses, known mail forwarding machines, or simply common spam phrases. The header and/or the body can be examined for these markers. Another method is to classify all messages not from known addresses as spam. Another is to compare with messages that others have received, and find common spam messages. And another technique, probably the most popular at the moment, is to apply machine learning techniques in an email classifier.
2 Design
The filter uses a method called bayesian filtering. The project and implimentation is implemented in C language and uses a linux platform for its working. A database called SQLITE is used to store the training data.
2.1 Bayesian Filtering
In a nutshell, the approach is to tokenize a large corpus of spam and a large corpus of non-spam. Certain tokens will be common in spam messages and uncommon in non-spam messages, and certain other tokens will be common in non-spam messages and uncommon in spam messages. When a message is to be classified, we tokenize it and see whether the tokens are more like those of a spam message or those of a non-spam message. How we determine this similarity is what the math is all about. It isn't complicated, but it has a number of variations.
2.2 Theory of Operation
Probabilities in this algorithm are calculated using a degenerate case of Bayes' Rule. There are two simplifying assumptions: that the probabilities of features (i.e. words) are independent, and that we know nothing about the prior probability of an email being spam.
The first assumption is widespread in text classification. Algorithms that use it are called "naive Bayesian.'
If spammers get good enough at obscuring tokens for this to be a problem, we can respond by simply removing whitespace, periods, commas, etc. and using a dictionary to pick the words out of the resulting sequence. And of course finding words this way that weren't visible in the original text would in itself be evidence of spam.
Picking out the words won't be trivial. It will require more than just reconstructing word boundaries; spammers both add ("xHot nPorn cSite") and omit ("Prn") letters. Vision research may be useful here, since human vision is the limit that such tricks will approach.
3 Implementation
The user first trains the filter. The training data is stored in the database. Initially, the database is empty.
On spam detection, the user can choose to move spam to a Spam table in the database by using -g option. Initially for training, non-spam message are moved to a Ham table in the database by using -b option. Finally we get to a stage with one corpus of spam and one of non-spam mail.
To train the database do: ./a.out dbase.db -g *good.msg -b *bad.msg To classify do: ./a.out dbase.db message.msg
4 Conclusion
Once you have enough spam messages and non-spam messages correctly classified, you can think about using a Bayesian filter. You really want a few hundred of each type, preferably more. You also want to make sure there isn't an unintended identifying feature of the spam messages or non-spam messages. For example, don't use non-spam messages from the past 6 months and only the last month of spam messages; the learning algorithm might decide that messages with old dates are non-spam messages and messages with new dates are spam messages. Don't try to pad the numbers with duplicates; it will overtrain the filter on the features in those messages.
5 References
[1] Paul Graham. "A Plan for Spam." August 2002. paulgrahamspam.html.
[2] Steven Hauser. "Statistical Spam Filter Works for Me." sofbot.com.
[3] Mehran Sahami, Susan Dumais, David Heckerman and Eric Horvitz. "A Bayesian Ap¬proach to Filtering Junk E-Mail." Proceedings of AAAI-98 Workshop on Learning for Text Categorization.
Use Search at http://topicideas.net/search.php wisely To Get Information About Project Topic and Seminar ideas with report/source code along pdf and ppt presenaion
Reply

Important Note..!

If you are not satisfied with above reply ,..Please

ASK HERE

So that we will collect data for you and will made reply to the request....OR try below "QUICK REPLY" box to add a reply to this page

Quick Reply
Message
Type your reply to this message here.


Image Verification
Please enter the text contained within the image into the text box below it. This process is used to prevent automated spam bots.
Image Verification
(case insensitive)

Possibly Related Threads...
Thread Author Replies Views Last Post
Brick V3 MAIL SERVER full report project report tiger 4 6,475 04-10-2014, 02:39 PM
Last Post: GaCcBuH
  intranet mail server mechanical engineering crazy 2 3,836 23-08-2014, 10:00 PM
Last Post: cb644
  EZEE MAIL SYSTEM PPT study tips 5 1,600 10-06-2014, 11:39 AM
Last Post: java projects
  Online Examination System in Indic Languages with SMS and E-mail facilities Abstract seminar projects maker 0 342 19-09-2013, 03:39 PM
Last Post: seminar projects maker
  Kalman Filter-Based Distributed Predictive Control of Large-Scale Multi-Rate pdf study tips 0 365 09-09-2013, 03:32 PM
Last Post: study tips
  Design and Implementation of a Web-based Internet/Intranet Mail Server pdf study tips 0 730 26-06-2013, 03:15 PM
Last Post: study tips
  E-Mail Client pdf study tips 0 330 20-06-2013, 04:03 PM
Last Post: study tips
  Design of Intranet Mail System nit_cal 8 9,277 31-03-2013, 02:06 AM
Last Post: Guest
  An Efficient Edge-based Bilateral Filter for Restoring Real Noisy Image pdf study tips 0 357 19-02-2013, 03:30 PM
Last Post: study tips
  The Kalman Filter as the Optimal Linear Minimum Mean-Squared Error Multiuser CDMA project girl 0 368 07-02-2013, 04:54 PM
Last Post: project girl