Signed Approach for Mining Web Content Outliers
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
project report tiger
Active In SP
**

Posts: 1,062
Joined: Feb 2010
#1
01-03-2010, 11:24 PM


Signed Approach for Mining Web Content Outliers

Abstract”

The emergence of the Internet has brewed the revolution of information storage and retrieval. As most of the data in the web is unstructured, and contains a mix of text, video, audio etc, there is a need to mine information to cater to the specific needs of the users without loss of important hidden information. Thus developing user friendly and automated tools for providing relevant information quickly becomes a major challenge in web mining research. Most of the existing web mining algorithms have concentrated on finding frequent patterns while neglecting the less frequent ones that are likely to contain outlying data such as noise, irrelevant and redundant data. This paper mainly focuses on Signed approach and full word matching on the organized domain dictionary for mining web content outliers. This Signed approach gives the relevant web documents as well as outlying web documents. As the dictionary is organized based on the number of characters in a word, searching and retrieval of documents takes less time and less space

Presented By
G. Poonkuzhali, K.Thiagarajan, K.Sarukesi and G.V.Uma


. I. INTRODUCTION

the exponential growth of information available on the internet, updating incoming data and retrieving relevant information from the web quickly and efficiently is a growing concern. Most of the web search engines typically employ conventional information retrieval and data mining techniques to discover automatically useful and previously unknown information from web content. With the enormous growth on the web, users get easily lost in the rich hyper structure. In addition, as most of the data in the web is unstructured, and contains a mix of text, video, audio etc, there is a need to mine information to cater to the specific needs of the users[9]. Efforts are being made to make such data available, usually in some structured form as in matrix G.Poonkuzhali is Assistant professor in the Department of Computer Science and Engineering with the Rajalakshmi Engineering College, Affiliated to Anna University Chennai, Tamil Nadu, India, phone: 9444836861, email : Kuzhal_s@yahoo.co.in K.Thiagarajan is Senior Lecturer in the Department of Mathematics with the Rajalakshmi Engineering College, Affiliated to Anna University Chennai, Tamil Nadu, India, email : vidhyamannan@yahoo.com K.Sarukesi is Vice Chancellor with the Hindusthan University “ Chennai, email: profsaru@yahoo.com G.V.Uma is Professor in the Department of Computer Science and Engineering with the Anna University-Chennai, email: gvuma@annauniv.edu form for further manipulation. Web mining is an emerging research area focused on resolving these problems. The proposed work in web mining aims to develop new methodology to effectively mine useful knowledge or information from the web documents quickly. In general, web mining tasks can be classified into three major categories, web structure mining, web usage mining and web content mining. Web structure mining tries to discover useful knowledge from the structure of hyperlinks. Web usage mining refers to the discovery of user access patterns from web usage logs. Web content mining aims to extract/mine useful information from the web pages based on their contents [1],[4],[10],[11]. Two groups of web content mining are those that directly mine the content of documents and those that improve on the content search of other tools like search engine. For Web content mining data can be image, audio, text and video [15]-[16]. Existing web mining algorithms do not consider documents having varying contents within the same category called web content outliers. Generally, Outliers are the data that obviously deviate from others, disobey the general mode or behavior of data and disaccord with other existing data. Outliers may also reflect the true properties of data, such as the rare disastrous weather recorded in meteorological database, which often contains one or more properties whose values seriously deviate from the normal values. However, these data may contain more valuable information than normal data. Researches on outlier detection broadly fall into following categories: A. Distribution based methods are conducted by the statistics community. These methods deploy some known distribution model and detect as outliers points that deviate from the model. B. Depth based algorithms organize objects in convex hull layers in data space according to peeling depth and outliers expected to be with shallow depth values[13]. C. Deviation based techniques detect outliers by checking the characteristics of objects and identify an object as that deviates these features as outlier. D. Distance based algorithms give a rank to all points, using distance of point from k-th nearest neighbor, and orders points by this rank. The top n points in ranked list identified as outliers. Alternative approaches compute the outlier factor as sum of distances from k nearest neighbors. E. Density based methods rely on local outlier factor (LOF) of each point, which depends on local density of neighborhood. Points with high factor are indicated as outliers Unlike traditional outlier mining algorithm designed only for numeric data sets, web outliers mining algorithm should be applicable to various types of data including text, hypertext, image, video etc. Web pages that have different contents from the category in which they were taken constitute web content outliers.[7]-[8] Web content outliers mining concentrates on finding outliers such as noise, irrelevant and redundant pages from the web documents[10]-[11] Also, web content outliers mining can be used to determine pages with entirely different contents from their parent web sites. In the proposed system, web documents are extracted from the search engines by giving query by the user to the web. Then the obtained web documents D is preprocessed, i.e., stop words, stem words and except text other data such as hyperlinks, sound, images etc are removed. The output is a set of documents with white-spaced separated words and it is indexed in two dimensional format (i,j), where ˜i™ represent web pages and ™j™ represent words. Therefore, first word from first web page is indexed as (1,1), second word from the first page is indexed as (1,2) etc,. The domain dictionary is arranged in such a way that, all 1-letter word will be indexed first, followed by 2-letter words, then 3-letter words similarly up to 15-letters word which is a very reasonable upper bounds for number of characters in a word. Each page is mined individually to detect relevant and irrelevant documents using signed approach. Finally, a relevant web document is obtained which contains required information catering to the user needs.


full report
wasetjournals/waset/v56/v56-150.pdf
Reply
eelnazz
Active In SP
**

Posts: 1
Joined: Jul 2011
#2
21-07-2011, 03:19 AM

Hello
I want to implemen this paper but i have some problem and question
like how can i do preprocessing web content?
how can i provide dataset and how an i use it?
i demand you help me,please.

thank you
Reply
seminar addict
Super Moderator
******

Posts: 6,592
Joined: Jul 2011
#3
21-07-2011, 09:35 AM

you can refer these page details of "Signed Approach for Mining Web Content Outliers"link bellow

topicideashow-to-signed-approach-for-mining-web-content-outliers?pid=51835#pid51835
Reply

Important Note..!

If you are not satisfied with above reply ,..Please

ASK HERE

So that we will collect data for you and will made reply to the request....OR try below "QUICK REPLY" box to add a reply to this page

Quick Reply
Message
Type your reply to this message here.


Image Verification
Please enter the text contained within the image into the text box below it. This process is used to prevent automated spam bots.
Image Verification
(case insensitive)

Possibly Related Threads...
Thread Author Replies Views Last Post
  web spoofing full report computer science technology 13 8,928 20-05-2016, 11:59 AM
Last Post: Dhanabhagya
  web image re-ranking using query-specific semantic signatures ppt jaseelati 0 249 02-03-2015, 01:23 PM
Last Post: jaseelati
  web operating system seminar jaseelati 0 308 17-02-2015, 02:20 PM
Last Post: jaseelati
  web enabled automated manufacturing system jaseelati 0 211 13-01-2015, 02:34 PM
Last Post: jaseelati
  web based claim processing system pdf jaseelati 0 362 10-01-2015, 02:34 PM
Last Post: jaseelati
  embedded web technology ppt jaseelati 0 498 16-12-2014, 04:34 PM
Last Post: jaseelati
  Calling a Web Service from an ASP.NET Web Page ppt study tips 1 528 19-10-2014, 11:24 PM
Last Post: LICjKYTCf
  Content Management System (CMS) seminar ideas 10 6,252 20-03-2014, 04:30 PM
Last Post: navasfiroz
  Intrusion Detection: An Energy Efficient Approach in Heterogeneous WSN pdf study tips 1 934 09-02-2014, 05:40 PM
Last Post: Guest
  A MACHINE LEARNING APPROACH FOR IDENTIFYING DISEASE-TREATMENT RELATIONS IN SHORT TEXT project uploader 2 1,780 10-01-2014, 04:56 PM
Last Post: seminar project topic