WEB MINING
Thread Rating:
  • 1 Vote(s) - 2 Average
  • 1
  • 2
  • 3
  • 4
  • 5
seminar projects crazy
Active In SP
**

Posts: 604
Joined: Dec 2008
#1
31-01-2009, 12:52 AM


The purpose of Web mining is to develop methods and systems for discovering models of objects and processes on the World Wide Web and for web-based systems that show adaptive performance. Web Mining integrates three parent areas: Data Mining (we use this term here also for the closely related areas of Machine Learning and Knowledge Discovery), Internet technology and World Wide Web, and for the more recent SemanticWeb.

The World Wide Web has made an enormous amount of information electronically accessible. The use of email, news and markup languages like HTML allow users to publish and read documents at a world-wide scale and to communicate via chat connections, including information in the form of images and voice records. The HTTP protocol that enables access to documents over the network via Web browsers created an immense improvement in communication and access to information. For some years these possibilities were used mostly in the scientific world but recent years have seen an immense growth in popularity, supported by the wide availability of computers and broadband communication. The use of the internet for other tasks than finding information and direct communication is increasing, as can be seen from the interest in ?e-activities? such as e-commerce, e-learning, e-government, e-science. Independently of the development of the Internet, Data Mining expanded out of the academic world into industry. Methods and their potential became known outside the academic world and commercial toolkits became available that allowed applications at an industrial scale. Numerous industrial applications have shown that models can be constructed from data for a wide variety of industrial problems. The World-Wide Web is an interesting area for Data Mining because huge amounts of information are available. Data Mining methods can be used to analyze the behavior of individual users, access patterns of pages or sites, properties of collections of documents.

Almost all standard data mining methods are designed for data that are organized as multiple ?cases? that are comparable and can be viewed as instances of a single pattern, for example patients described by a fixed set of symptoms and diseases, applicants for loans, customers of a shop. A ?case? is typically described by a fixed set of features (or variables). Data on the Web have a different nature. They are not so easily comparable and have the form of free text, semi-structured text (lists, tables) often with images and hyperlinks, or server logs. The aim to learn models of documents has given rise to the interest in Text Mining methods for modeling documents in terms of properties of documents. Learning from the hyperlink structure has given rise to graph-based methods, and server logs are used to learn about user behavior.

Instead of searching for a document that matches keywords, it should be possible to combine information to answer questions. Instead of retrieving a plan for a trip to Hawaii, it should be possible to automatically construct a travel plan that satisfies certain goals and uses opportunities that arise dynamically. This gives rise to a wide range of challenges. Some of them concern the infrastructure, including the interoperability of systems and the languages for the exchange of information rather than data. Many challenges are in the area of knowledge representation, discovery and engineering. They include the extraction of knowledge from data and its representation in a form understandable by arbitrary parties, the intelligent questioning and the delivery of answers to problems as opposed to conventional queries and the exploitation of formerly extracted knowledge in this process. The ambition of representing content in a way that can be understood and consumed by an arbitrary reader leads to issues in which cognitive sciences and even philosophy are involved, such as the understanding of an asset?s intended meaning.
Use Search at http://topicideas.net/search.php wisely To Get Information About Project Topic and Seminar ideas with report/source code along pdf and ppt presenaion
Reply
sriman
Active In SP
**

Posts: 2
Joined: Jan 2010
#2
12-01-2010, 11:20 AM

look like me yar
Reply
justlikeheaven
Active In SP
**

Posts: 247
Joined: Jan 2010
#3
26-01-2010, 09:56 AM

Web mining ppt can be found in these links:
kict.iiu.edu.my/amzeki/courses/dm/dm11%20Web%20Mining.ppt

and

cs.sunysb.edu/~cse634/presentations/Webmining-I.ppt
Use Search at http://topicideas.net/search.php wisely To Get Information About Project Topic and Seminar ideas with report/source code along pdf and ppt presenaion
Reply
seminar class
Active In SP
**

Posts: 5,361
Joined: Feb 2011
#4
03-03-2011, 11:31 AM

PREPARED BY:
TRIPTI


.ppt   web mining.an application of soft computing.ppt (Size: 460 KB / Downloads: 145)
INTRODUCTION
Explosive growth in information available on www
Web broswers provide easy access to data & text
Finding the desired information is not an easy task
Profusion of resources prompted the need for web Mining
SOFT WEB MINING - a good candidate for developing automated tools in order to find and extract and to evaluate user’s desired info from unlabeled,heterogeneous data.
WEB MINING
Discovery & analysis of useful info from www
Data can be collected at
 Server side
 Client side
 Proxy servers
 Can be obtained from organizational database
Characteristics of web Data
 Unlabeled
 Distributed
 Heterogenous
 Semistructued
 Time varying
WEB MINING COMPONENTS & METHODOLOGIES
INFORMATION RETRIEVAL

Deals with automatic retrieval of all relevant documents
All non-relevant documents are fetched as few as possible
IR process mainly includes
 Document representation
By Furnkranz[36] Bag of words & hyperlink information
By Soderland[40] Sentence, phrases & named entity
 Indexing (collection of terms with pointers to place where documents can be found)
POPULAR INDEX
 ALTA-VISTA
 WEB CRAWLER(can scan millions of documents and store an index of words in the document)
 Searching for Document
Search Engines are used(programs written to query and retrieve info)
INFORMATION SELECTION/EXTRACTION & PREPROCESSING
Task of identifying specific fragments of a single document that constitute its core semantic content
METHODS USED ARE
 Involves writing wrappers(Hand coding)which map the documents to some data models
 Operates by interpreting the various sites as knowledge sources & extract information from them.To do so,system processes the site document to extract relevant text fragments
 To extract info from hypertext.each page is approached with a set of questions and the problem therefore reduces to identifying the text fragments which answer those specific questions

INFORMATION SELECTION/EXTRACTION & PREPROCESSING
LSI(LATENT SEMANTIC INDEXING)
:-Preprocessing technique for IE.
When a user requests a web page it includes:
 variety of files
 Images
 Sound
 Video
 Html pages
Server contains relevant & irrelevant entities,which needs to be removed using this preprocessing technique.
LIS transform the original document to a lower dimensional space by analysing the correlational structure of terms
Similar documents that do not share the same terms are not placed in same category
GENERALIZATION
Uses Pattern Recognition and Machine Learning techniques
Machine learning system learn about user’s interest than web itself
Major OBSTACLE when learning about web is Labelling problem
Data mining technique require inputs labelled as(+ve) or(-ve)
FOR EXAMPLE
given large set of web pages labeled as (+ve) or (-ve) examples of homepage,then
We can design a classifier that predicts whether unknown page is homepage or not..But unfortunately web pages are not labelled.
Clustering technique do not require labelled inputs and outputs
Association Rule Mining(INTEGRAL PART OF THIS PHASE)
X=>Y
X,Y ->Sets of Items
Expresses whenever a Transaction(T) contains X then T probably contains Y also
ANALYSIS
Data Driven Problem
Presumes that there is a sufficient data available to extract & analyse useful information
Important for validation & interpretation of mined patterns
Uses Online Analytical processing(OLAP)techniques
Webminer proposes a SQL like quering mechanism for quering the discovered knoweledge
WEB MINING CATEGORIES
WCM(Web Content Mining)

Deals with the discovery of useful information from the web contents/data/documents/services.
web contents contains
Text
audio
Video
symbolic
metadata
hyperlinked data.
Web Text Data(3 TYPES)
1) unstructured data( free text)
2) semistructured data(HTML)
3) fully structured data( tables or databases).
(WSM)Web Structure Mining
Mining the structure of hyperlinks within the web itself
Structure represents the graph of the links in a site or between sites
Reveals more information than just the information contained in documents.
Rather than collecting all the index,it focues only on the links that are relevant and avoid irrelevant regions
LInks pointing to a document indicate the popularity of the document,
while links coming out of a document indicate the variety of documents
WUM(WEB USAGE MINING
Mines secondary data generated by the user’s interaction with web
Also known as web log mining
Works on user profiles, user access patterns, and mining navigation paths
Plays a key role in personalizing space, which is the need of the hour.
Uses Techniqes like:
 Association Rules
 Clustering
 Sequential Patterns
 Rough Sets
 Fuzzy Logic
LIMITATIONS OF EXISTING WEB MINING METHODS
INFORMATION RETRIEVAL

 Subjectivity, Imprecision, and Uncertainty
 Deduction
 Page Ranking
 Dynamism, Scale, and Heterogeneity
 INFORMATION EXTRACTION
 Based on the “wrapper” technique
 Limitation : Each wrapper is an IE system customized for a particular site and is not universally applicable.
 Ad hoc formatting conventions, used in one site, are rarely relevant elsewhere.
GENERALIZATION
 Clustering
 Outliers
 Association Rule Mining
ANALYSIS
 Knowledge Discovery out of the information Is a challenge to the analysts
 The output of knowledge mining algorithms are not suitable for direct human interpretation.
 The patterns discovered are mainly in mathematical form
Soft Computing & Its Relevance
SOFT COMPUTING & ITS RELEVANCE
SOFT COMPUTING- a collection of methodologies which provide information processing capabilities for handling real life ambiguous situations
FUZZY LOGIC AND WEB MINING
FUZZY SETS – Their elements possess degrees of membership
Classically membership of an element in a set was bivalent
- Element belongs to set(1)
- Element does not belong to set(0)
Designated by a pair (A, m)
A->Set and m : A->[0,1]
Values strictly between 0&1 represent fuzzy members
Here the degree of truth of a statement can range between 0 & 1.
Degree is not restricted to 2 truth values truth->1 and false->0
Deals with reasoning that is approximate rather then precise.
FUZZY LOGIC AND WEB MINING
YAGER described an IR language which enables a user to specify interrelationships between desired attributes of documents sought using linguistic quantifiers e.g. “at least ”, “most” ,”about half”
Q->linguistic expression for quantifier “most”
Represented by fuzzy subset over I=[0,1]
For any proportion r belong to I,
Q ®-> degree to which r satisfies the concept indicated by
quantifier Q.
Model proposed by Koczky & Gedeon
-Helps in retrieval of documents where it cannot be guaranteed that the user queries include actual words that occur in the documents to be retrieved
FUZZY LOGIC AND WEB MINING
Model proposed by Bordonga and Pasi for semi-structured (e.g.HTML) document retrieval
-> representation of document d D
D ->set of archive of documents
t T where T is the set of index terms
Membership function of is
->significance of term t in section s of document d
COMMERCIALLY AVAILABLE SYSTEMS
NZsearch
- Search engine based on Fuzzy Logic
- It considers entire phrase rather than individual words for the purpose of matching
DNS Search
- Uses FL to find the closest DNS entry to your typed URL.
- E.g. You type gogle.com
- System will give suggestions on possible close URLs
Finder
- Uses Multidimensional optimization to display best or “Most suited” matches to the query.
- Existing search engines provide exact match to the query.
- Finder goes beyond “yes” or “no” criterion used by SQL or Btrieve.
- Uses SCORING MODEL
- E.g. If one is looking for a blue car but the car in database was red, it will not ignore the entry all together but will give it a lower score.
FL – Prospective areas of application
Provides human like deductive capability to the search engine.
Can be used in terms of matching by compromising slightly on precision.
For “Page ranking”, the degree of closeness of hits in a document can be used e.g. variables like ”close”, “far”, “nearby” can be used.
NEURAL NETWORK AND WEB MINING
Parallel interconnected network of simple processing elements which is intended to interact with the objects of the real world in the same way as biological systems do.
Designated by
- Network Topology
- Weights
- Node characteristics
- Status updating rules
Characteristics
- Generalization capability
- Adaptivity to new data/information
- Speed due to massively parallel architecture
- Robustness to missing, confusing, ill-defined/noisy data.
- Capability for modeling non-linear decision boundaries.
NEURAL NETWORK AND WEB MINING
WISCONSIN ADAPTIVE WEB ASSISTANT (WAWA-IE+IR) SYSTEM
Suggested by Shavlik
Uses 2 network models
 SCORE LINK
Uses unsupervised learning.
 SCORE PAGE
Uses supervised learning in form of advice from users. Here the system uses Knowledge based Neural Nets (KBNNs) as its knowledge base to encode the initial knowledge of users which is then refined.
Reply
seminar class
Active In SP
**

Posts: 5,361
Joined: Feb 2011
#5
04-03-2011, 03:34 PM


.ppt   improving web efficiency.ppt (Size: 157.5 KB / Downloads: 80)
INTELLIGENT WEB MINING
Improvising efficiency of web search engines
PRESENT SCENARIO
• INDEX BASED SEARCHING
• Page Ranking Algorithm.
PROBLEMS
• DIFFICULT FOR INEXPERIENCED USERS.(You don’t get what you want!)
• POLYSEMY PROBLEM DUE TO INDEX SEARCH.
• POSSIBLE FLAW IN PAGE RANKING ALGORITHM.
• REAL TIME EXAMPLE
• SEARCH FOR “BUSH” IN WWW://GOOGLE.COM
• 266,000,000 RESULTS!!!
• FIRST TEN PAGES ONLY HAS PRESIDENT BUSH.
• IS THERE ONLY PRESIDENT BUSH IN THIS WORLD?
HOW SEARCH ENGINES WORK?
• BY ANALYSING WEB PAGE STRUCTURE, USING DOM TREE STRUCTURE.
OUR PROPOSALS

TO OVER COME THE PRESENT PROBLEMS :
• CONTEXT BASED SEARCH
• DUAL ROLE TREE STRUCTURE
• TAGGING SIMILAR WORDS TOGETHER
CONTEXT BASED SEARCH
• IDENTIFIES CONTEXTS IN WEB PAGES THROUGH AUTOMATED KEYWORD IDENTIFICATION.
• CONTEXT WORDS BECOME NODES OF CONTEXT BASED TREE.
• NODES ARE ORDERED BASED ON SIMILARITY WITH KEYED IN WORD.
• SEARCH ENGINE SEARCHES CONTEXT TREE.
• DISPLAY.
• DUAL ROLE BASED TREE
DOM TREE STRUCTURE
+

CONTEXT BASED
TREE STRUCTURE
BUT,HOW TO CREATE CONTEXTS?
ANT ANALOGY
 ANT IDENTIFIES SMELL OF FOOD.
HERE SMELL IS ATTRIBUTE.
SIMILARLY IDENTIFY ATTRIBUTES OF DATA.
SEARCH FOR THEM.
 ANT LOOKS IN LIKELY PLACES.
SIMILARLY, SEARCH FOR
LIKELY CLUSTERS
USING:
CORRELATION
ANALYSIS.
PROTOTYPE
Search for : BUSH
• PRESIDENT,SHRUBS,TRIBES(BUSHMEN) could be possible nodes of context tree.
• PRIORITY WOULD BE GIVEN FOR EVERY NODE.
• CHANCES LIKELY THAT USER IS NOT DISAPPOINTED.
BENEFITS
• EFFICIENT
• UN-NECESSARY INFORMATION WILL BE ABSENT.
• IMPROVISES PAGE RANKING ALGORITHM.
Reply
seminar class
Active In SP
**

Posts: 5,361
Joined: Feb 2011
#6
05-03-2011, 04:08 PM


.doc   Web miningown.doc (Size: 83.5 KB / Downloads: 94)
1. INTRODUCTION
Web:
Now a day the internet, otherwise called the web, is widely used in several fields. The advent of e-Commerce has revolutionized several business industries such as banking and insurance.
You can get information studies and communicate with educational institutions. Because of the dynamic nature of the web, you may face challenges to locate any specific domain of information.
Mining:
Mining is a technique of information or knowledge discovery from millions of sources across the Web.
Web Mining:
Web mining is a specialized application of data mining. The web mining is a technique to process information available on web and search for useful data. Web mining enables you to discover web pages, text documents, multimedia files, images and other types of resources from web. Web mining is widely used in several fields.
2. WEB MINING APPLICATIONS
The various fields where web mining is applied are:
 E-Commerce
 Information filtering
 Fraud detection
 Education and research
E-Commerce:
In e-commerce, web mining helps in generating user profiles by customizing the choice of users. For example, web mining enables a user to search for an advertisement and information regarding a product of his interest. Internet advertising is one of the major fields in e-commerce, where web mining is widely used. Advertising in a specific domain of an e-commerce web site or a general web site and is considered as one of the major application area of web mining.
Information filtering:
Information filtering is the method to identify the most important results from a list of discovered frequent set of data items for which you can make use of web mining.
Fraud detection:
Fraud detection can be performed using web mining by maintaining a list of signatures of all the users. Web mining is also applied for plagiarism detection and research works.
3. WEB MINING TASKS
The general techniques and algorithms of data mining are also applicable in web mining. Web mining tasks can be decomposed into four subtasks:
 Resource Searching: Indicates the task of retrieving documents from the web.
 Information selection: denotes automatic extraction of information from the web documents. Several web mining tools such as Web Miner are available to perform this task.
 Generalization of patterns: Denotes automatic discovery of patterns across multiple web sites.
 Analysis of web documents: Denotes validation and analysis of the extracted patterns.
4. WEB MINING ISSUES
Several software’s are available to perform the various tasks of web mining. For example, you can download a complete web site with its structure using software such as Teleport Pro, Back Street Browser and Grab-a-Site. These softwares are called offline browsers. Offline browsers enable you to download a copy of the web pages of a web site and explore the web pages offline. While browsing the downloaded web pages offline, you need not worry about communication link and cost of using the web.
An offline web browser can save all the files of a web site in a single folder. In this case, you will lose the path information of a file.
If there are two files with the same name, the previous file will be overwritten. Otherwise, you can save all the files with the original directory structure. This is called saving a mirror of a web site. In such a case, the home page of the web site is saved at the topmost level of the hierarchy of the downloaded files and directories. The directory containing the home page of the web site is called virtual root directory.
Various web analyzing softwares are available to analyse the statistics regarding the range of factors of a web site such as average number of web pages viewed by the visitors. You can also analyse the behaviour of the visitors of the web site by using the various web analyzing softwares. The visitor information is usually stored in a log file in the server of a web site. The structure of a log file varies depending upon the type of web server. The structure of log file of an Internet Information Server (IIS) is different from Apache web server. The web analyzing softwares read data regarding a web site from the log file.
The technique to retrieve visitor based information from web servers based log files and apply this information to analyse data in known as web log mining. For example, consider that a web site consists of four web pages. Whenever you visit the web site, the home page is shown displayed by default. You can access the remaining three pages only by clicking the specific link provided in the home page.
There are two major types of log files used in web log mining.
1. access log.
2. agent log files.
An access log file maintains a list of all the web pages that the visitors have requested. These web pages include the HTML files and their Embedded graphic images and any other associated files such as texts. An agent log file consists of information about the browser that was used to explore the various web pages.
You can use a web analyzing software to analyse, which web page has been accessed the most by the visitors. Thus, you can verify the behaviour of the visitors that whether the visitors are analyzing softwares are OneStat, Web analizer and Click Tracks Analyzer. The Web analizer works on Linux platform while other web analyzing softwares work on Windows platform.
Creating and maintaining a community-specific web site is another important issue in web mining. People with the same interest can form a community for cooperation and sharing of web documents. For example, some universities and research institutes forms a community. In the web site of each of the community, you can find hyperlinks to other members of the community. This helps users to navigate between different community-specific web sites to share information.
In terms of mathematic, a community is a set of web pages, where each page has multiple links to other members in the community, rather than links outside the community. You can visualize a community in terms of a graph. There can be two types of entities in a community: hub and authority. Hubs do not provide any information, rather hubs points to the sources of information. In other words, hubs specify the location where you can search for information. Examples of a hub are directories and online yellow pages. A single person or a group of persons are authorized to provide information about the concerned subject. These authorized persons are also responsible to update the information at different intervals. A good hub points to numerous authorities. On the other hand, a good authority is linked to by several hubs.
Reply
seminar class
Active In SP
**

Posts: 5,361
Joined: Feb 2011
#7
10-03-2011, 11:59 AM


.ppt   improving web efficiency1.ppt (Size: 155.5 KB / Downloads: 41)
INTELLIGENT WEB MINING
Improvising efficiency of web search engines
PRESENT SCENARIO
• INDEX BASED SEARCHING
• Page Ranking Algorithm.
• PROBLEMS
• DIFFICULT FOR INEXPERIENCED USERS.(You don’t get what you want!)
• POLYSEMY PROBLEM DUE TO INDEX SEARCH.
• POSSIBLE FLAW IN PAGE RANKING ALGORITHM.
REAL TIME EXAMPLE
• SEARCH FOR “BUSH” IN WWW://GOOGLE.COM
• 266,000,000 RESULTS!!!
• FIRST TEN PAGES ONLY HAS PRESIDENT BUSH.
• IS THERE ONLY PRESIDENT BUSH IN THIS WORLD?
HOW SEARCH ENGINES WORK?
• BY ANALYSING WEB PAGE STRUCTURE, USING DOM TREE STRUCTURE.
OUR PROPOSALS
TO OVER COME THE PRESENT PROBLEMS :

• CONTEXT BASED SEARCH
• DUAL ROLE TREE STRUCTURE
• TAGGING SIMILAR WORDS TOGETHER
• CONTEXT BASED SEARCH
• IDENTIFIES CONTEXTS IN WEB PAGES THROUGH AUTOMATED KEYWORD IDENTIFICATION.
• CONTEXT WORDS BECOME NODES OF CONTEXT BASED TREE.
• NODES ARE ORDERED BASED ON SIMILARITY WITH KEYED IN WORD.
• SEARCH ENGINE SEARCHES CONTEXT TREE.
• DISPLAY.
• DUAL ROLE BASED TREE
DOM TREE STRUCTURE
CONTEXT BASED
TREE STRUCTURE
BUT,HOW TO CREATE CONTEXTS?
ANT ANALOGY
 ANT IDENTIFIES SMELL OF FOOD.
HERE SMELL IS ATTRIBUTE.
SIMILARLY IDENTIFY ATTRIBUTES OF DATA.
SEARCH FOR THEM.
 ANT LOOKS IN LIKELY PLACES.
SIMILARLY, SEARCH FOR
LIKELY CLUSTERS

USING:
CORRELATION
ANALYSIS.
• PROTOTYPE
Search for : BUSH
• PRESIDENT,SHRUBS,TRIBES(BUSHMEN) could be possible nodes of context tree.
• PRIORITY WOULD BE GIVEN FOR EVERY NODE.
• CHANCES LIKELY THAT USER IS NOT DISAPPOINTED.
BENEFITS
• EFFICIENT
• UN-NECESSARY INFORMATION WILL BE ABSENT.
• IMPROVISES PAGE RANKING ALGORITHM.
Reply
seminar class
Active In SP
**

Posts: 5,361
Joined: Feb 2011
#8
15-03-2011, 10:05 AM

Abstract
Well-designed object-oriented programs typically consistof a few key classes that work tightly together to providethe bulk of the functionality. As such, these key classes areexcellent starting points for the program comprehensionprocess. We propose a technique that uses webminingprinciples on execution traces to discover these importantand tightly interacting classes. Based on two medium-scalecase studies – Apache Ant and Jakarta JMeter – anddetailed architectural information from its developers, weshow that our heuristic does in fact find a sizeable numberof the classes deemed important by the developers.
KeywordsReverse engineering, dynamic analysis, webmining, programcomprehension
Introduction
Reverse engineering is defined as the analysis of a systemin order to identify its current components and their dependenciesand to create abstractions of the systems design[3]. A reverse engineering operation almost always takesplace in service of a specific purpose, such as re-engineeringto add a specific feature, maintenance to improve the efficiencyof a process, reuse of some of its modules in a newsystem, etc.[ In order to perform any of these operationsthe software engineer must comprehend a given programsufficiently well to plan, design and implement modificationsand/or additions. As such, program comprehensioncan be defined as the process the software engineer goesthrough when studying the software artifacts, with as goalthe sufficient understanding of the system to perform therequired operations [18]. Such software artifacts could includethe source code, documentation and/or abstractionsfrom the reverse engineering process.Gaining understanding of a program is a time-consumingtask taking up to 40% of the time-budget of a maintenanceoperation [26, 21, 4]. The manner in which a programmergets understanding of a software system varies greatly anddepends on the individual, the magnitude of the program,the level of understanding needed, the kind of system, ...[17, 28, 8, 22]As such sizeable gains in overall efficiency can be attainedby providing assistance to the software (reverse) engineerfor his/her program understanding process. We proposea heuristic in this paper that can help the engineer withfinding important classes that should be looked at first whenstarting the comprehension process.Studies and experiments reveal that the success of decomposinga program into effective mental models dependson one’s general and program-specific domain knowledge[25]. While a number of different models for the cognitionprocess have been identified, most models fall into oneof three categories: top-down comprehension, bottom-upcomprehension or a hybrid model combining the previoustwo [19]. The top-down model is traditionally employedby programmers with code domain familiarity. By drawingon their existing domain knowledge, programmers are ableto efficiently reconcile application source code with systemgoals. The bottom-up model is often applied by programmersworking on unfamiliar code [5]. To comprehend theapplication, they build mental models by evaluating pro gram code against their general programming knowledge[18].Because of the human cognition process , programunderstanding can never be a fully automated process: theprogrammer should be free to explore the software, with thehelp of specialized tools [9, 6]. These program explorationtools should identify those parts of the program that arelikely to be interesting from a program understanding pointof view [14]. For instance, in the case of object-orientedprograms – which is the main focus of our work – programexploration tools should reveal those classes that form coreparts of the design.Orthogonal to the selection of the cognitive strategy, i.e.which mental model to employ, is the choice between severaltechnical strategies, namely (1) static analysis, i.e., byexamining the source code, (2) dynamic analysis, i.e., byexamining the program’s behavior, or (3) a combination ofboth.In the context of object-oriented systems, due to polymorphism,static analysis is often imprecise with regard tothe actual behavior of the application [27]. Dynamic analysis,however, allows to create an exact image of the program’sintended runtime behavior. Our actual goal is to findfrequently occurring interaction patterns between classes.These interaction patterns can help us build up understandingof the software.In this paper we propose a technique that appliesdatamining techniques to event traces of program runs. Assuch, our technique can be catalogued in the dynamic analysiscontext. The technique we use was originally developedto identify important hubs on the Internet, i.e., pages withmany links to authorative pages, based on only the links betweenweb pages [16]. Hence, the Internet is viewed as alarge graph. We verify that important classes in the programcorrespond to the hubs in the dynamic call-graph of aprogram trace.We apply the proposed technique to two medium-scalecase studies, namely Apache Ant and Jakarta JMeter. Theresults show that the hubiness is indeed a good measure forfinding important classes in the system’s architecture.The organization of the paper is as follows. First, in Section2, we give an overview of the different steps in the processand the different algorithms we use. Section 3 showshow we plan to validate the results of our technique. Section4 explains the datamining algorithm in detail, while inSection 5 the results of applying our technique on the twocase studies are discussed. Section 6 explores related work,while Section 7 points to future research and concludes thepaper.
2 Overview of our proposed technique
The technique we propose can be seen as a 4-step process.In this section we explain each of the 4 steps
.Define execution scenario. Applying dynamic analysisrequires that the program is executed at least once. Theexecution scenario, i.e., which functionality of the programgets executed, is very important as it has a great influenceon the results of the technique. For example, if the softwareengineer is reverse engineering a banking application andmore specifically wants to know the inner workings of howinterest rates are calculated, the execution scenario shouldat least contain one interest rate calculation.On the other hand, by keeping the execution scenario specific,e.g. only calculating the interest rate, and not executingmoney transfers, the final results will be more precise.In terms of UML, this would be the same as limiting thenumber of use cases [13].
Non-selective profiling. Once the execution scenario hasbeen defined, the program must be executed according tothe defined scenario. During the execution all calls to andreturns from methods are logged in the event trace. For thisstep, we relied on a custom-made JVMPI1 profiler.Please note however that even for small and medium-scalesoftware systems and precisely defined execution scenarios,event traces become very large. Table 1 gives an overviewof some metric-data for our two case studies.
Datamining. By examining the event trace we want todiscover the classes in the system that play an active rolein the execution scenario. Classes that have an active roledepend on many other classes to perform functions forthem.In Figure 1 we show an example of a compactedcall graph. The compacted call graph is derived from thedynamic call graph; it shows an edge between two classesA ! B if an instance of class A sends a message to aninstance of class B. The weights on the edges give anindication of the tightness of the collaboration as it is thenumber of distinct messages that are sent between instancesof both classes.


download full report
googleurl?sa=t&source=web&cd=4&ved=0CC4QFjAD&url=http%3A%2F%2Fwwwis.win.tue.nl%2F~tcalders%2Fpubs%2FANDYCSMR05.pdf&ei=4et-TbKyOM7irAfX16XDBw&usg=AFQjCNEpn7trIRNCBU6kk85U3T8Q1uY98Q
Reply
seminar class
Active In SP
**

Posts: 5,361
Joined: Feb 2011
#9
15-03-2011, 03:04 PM


.ppt   WEB MINING.ppt (Size: 1.58 MB / Downloads: 67)
INTRODUCTION
What is web mining?

Web mining is the extraction of interesting and potentially useful pattern and implicit information from artifacts or activity related to World Wide Web
Why web usage mining?
E-commerce
E-business
How to perform web usage mining?

Web server log files were used initially by the webmasters and system administrators for the purposes of :
1. How much traffic they are getting?
2. How many requests fail?
3.What kind of errors are being generated?
TAXONOMY OF WEB MINING
 Web content Mining:
 Web crawler: To search the Web pages the problems are:
Scale, Variety, Duplicates, Domain Name Resolution
Types of crawler:
1. Traditional Crawler
2. Periodic Crawler
3. Incremental Crawling
4. Focused Crawling.
 Harvest system:
1. Collector-Internet Service Provider
2. Broker-Index and query interface
 Virtual Web View:
This approach is based in the database.
Personalization:
With Web personalization, users can get more information on the Internet faster because Web sites know their interests and needs.
The Web site then uses the database to match user’s needs to the products or information provided at the site with middleware facilitating the process.
Web Structure Mining
 The two techniques for structure mining:
1. Page Rank: PR is one of the methods Google uses to determine a pages relevance or importance. The PR value for a page is calculated based on the number of pages that point to it. PR is displayed on the toolbar of your browser if you’ve installed the Google toolbar.
Page Rank: The actual page rank for each page is calculated by Google.
Toolbar PR: The page rank is displayed in the Google toolbar in your
browser. This ranges from 0 to 10.
Backline: If page A out to page B then page B is said to have a “Back link” from page A.
 Definition by Google:
We assume page A has pages T1…Tn which point to it. The parameter d is a damping factor, which can be set between 0 and 1. We usually set d to 0.85. Also C(A) is defined as the number of links going out of page A.
The PR of a page A is given as follows:
PR(A)=(1-d)+d(PR(T1)/C(T1)+…+PR(Tn)/C(Tn))
2. Important Pages:
A page is important if important pages link to it.

Assume that the Web consists of
only three pages say Netscape,
Microsoft and Amazon. The links
among these pages were shown
In the limit, the solution is n=a=6/5; m=3/5. That is Netscape and
Amazon each have the same importance and twice the importance of
Microsoft.
Following are the problems that are faced by on the Web:
a. Dead ends: A page that has no successors has now here to send its importance . Eventually all importance will “leak out of” the web.
b. Spider traps: A group of one or more pages that have no links out of the group eventually accumulate all the importance of the web
Web usage Mining:
Web usage mining has three activities given below:

Preprocessing activities center around reforming the web log data before processing.
Pattern discovery activities form the major portion of the mining activities because these activities look to find hidden pattern within log data.
Pattern analysis is the process of looking at and interpreting the results of discovery activities.
Application is totally different from other traditional data mining application such as “Goods Basket” model. We can interpret this problem from two aspects:
1. Weak Relations between user and site
2. Complicated behaviors
WEB MINING ARCHITECTURE
WebMiner system:

 This system divides the Web Usage mining process into three main parts i.e., access referrer, agents, HTML files that make up the site
Data cleaning
Transaction Identification
Date integration
User identification
Session identification
Preprocessing:
Before processing in web usage mining include the following:
 Collection of usage data for web visitors: In some services it needs the user registration.
 User identification: It is easy to identify different users but it cannot avoid that some private personal registration information is misused by hackers.
 Session construction: A session is a visit. Two time constraints needed for this session construction i.e., time gap between any two continuously accessed and duration for any session can not exceed a defined threshold.
 Behavior recovery:
User behavior is recovered from the session for this user and defined as b=(S’,R), R is relation among S.
<0,292,300,304,350,326,512,510,512,515,513,292,319,350,517,286>
It includes two kinds of behavior
The first is that user behaviors are represented with only those unique accessed pages.
S’= <0,292,300,304,326,510,512, 513,515,319,350,517,286>
The second is that user behaviors are represented wit those unique accessed pages and also the access sequence among these pages.
<0-292-300-304-350-326-512-510-513-515-319-517-286>
Applications
 Intelligent Web services
 Log analysis for security applications
 Contextual information access and retrieval
 Recommendation and personalization systems
 Fraud and misuse detection, such as credit-card fraud and
network detection.
Services:
 User Modeling and Profiling
 Enabling Technologies
 Web content, usage, structure mining
Conclusion:
 In this paper we proposed a definition of Web mining and developed a taxonomy of the various ongoing efforts related to it.
 Companies find a new and better way to do business.
 However, E-business cannot just build a web site and then sit back and reap the benefits, which , in most cases is fruitless.
 Companies have to implement Web mining systems to understand their customers’ profiles, and to identify their own strength and weakness of their E-marketing efforts on the web through continuous improvements.
Reply
jacktorson
Active In SP
**

Posts: 5
Joined: Mar 2011
#10
15-03-2011, 04:20 PM

This is why it is necessary. The purpose of Web mining is to develop methods and systems for discovering models of objects and processes on the World Wide Web and for web-based systems that show adaptive performance. Web Mining integrates three parent areas: Data Mining (we use this term here also for the closely related areas of Machine Learning and Knowledge Discovery), Internet technology and World Wide Web, and for the more recent SemanticWeb.
Reply
seminar class
Active In SP
**

Posts: 5,361
Joined: Feb 2011
#11
19-04-2011, 11:34 AM

Presented by:
Shaikh Farhat I


.pptx   Introduction.pptx (Size: 353.78 KB / Downloads: 41)
INTRODUCTION
The Web is huge, dynamic & diverse, and thus raises the scalability, multimedia data and temporal issues respectively.
àThus we are drowning in information and facing information overload. Information users can encounter problems when interacting with the Web.
àHighly Dynamic…
Explosive growth of amount of content on the internet.
àWeb search engines return thousands of results so difficult to browse.
àOnline repositories are growing rapidly.
àOnline organizations generate a huge amount of data …How to make best use of data?
Finding Relevant information: irrelevance of many of the search results, inability to index all the information available on the web.
àCreating new knowledge out of the information available on the web: presumes that we already have a collection of web data and we want to extract potentially useful knowledge out of it.
àPersonalization of the information: This problem is often associated with the type and presentation of the information.
àLearning about consumers or individual users: This problem is about knowing what the customers do and want.
ROLE OF WEB MINING
àWeb mining techniques could be directly or indirectly used to solve the information overload problems described before.
Directly - application of web mining techniques directly addresses the problem.
Attack the problem with web mining techniques
Indirectly- web mining techniques are used as a part of a bigger application that addresses the problems mentioned before.
OTHER APPROACHES
àWeb mining NOT only useful tool: other useful techniques include
1. DB
2. IR
3. NLP
In-depth syntactic and semantic analysis
4. Web document community
Standards, manually appended meta-information, maintained directories, etc
DEFINATION OF WEB MINING
àWeb mining is the use of data mining techniques to automatically discover and extract information from Web documents/services.
Reply
seminar class
Active In SP
**

Posts: 5,361
Joined: Feb 2011
#12
12-05-2011, 09:25 AM


.doc   Web mining11.doc (Size: 54 KB / Downloads: 45)

.ppt   slides.ppt (Size: 2.61 MB / Downloads: 66)
Web
Now a day the internet, otherwise called the web, is widely used in several fields. The advent of e-Commerce has revolutionized several business industries such as banking and insurance.
You can get information studies and communicate with educational institutions. Because of the dynamic nature of the web, you may face challenges to locate any specific domain of information.
Mining
Mining is a technique of information or knowledge discovery from millions of sources across the Web
Web Mining:
Web mining is a specialized application of data mining. The web mining is a technique to process information available on web and search for useful data. Web mining enables you to discover web pages, text documents, multimedia files, images and other types of resources from web. Web mining is widely used in several fields. The various fields where web mining is applied are:
 E-Commerce
 Information filtering
 Fraud detection
 Plagiarism detection
 Education and research
E-Commerce
 In e-commerce, web mining helps in generating user profiles by customizing the choice of users. For example, web mining enables a user to search for an advertisement and information regarding a product of his interest. Internet advertising is one of the major fields in e-commerce, where web mining is widely used. Advertising in a specific domain of an e-commerce web site or a general web site and is considered as one of the major application area of web mining.
Information filtering
 Information filtering is the method to identify the most important results from a list of discovered frequent set of data items for which you can make use of web mining.
Fraud detection
 Fraud detection can be performed using web mining by maintaining a list of signatures of all the users. Web mining is also applied for plagiarism detection and research works.
The problems and difficulties in Web mining are
The size and span of World Wide Web (WWW) is huge and very wide. Thus it takes time to explore such a large volume of data.
The web is widely distributed so that any break in communication links results in break in the web mining process.
The information available in the web is highly dynamic and your web mining tools need to locate a specific part of the available information.
The online web documents contain several unstructured and semi structured data, which is difficult to process in comparison to structured data.
Interconnection of web pages may create difficulties in web mining. While searching for a topic, if you navigate between the linked web sites referred to as hyperlink, you may fall in an infinite loop.
Hidden web resources cannot be located easily and are difficult targets for retrieving information.
Scope to present customized data to individual users is limited.
Web Mining Tasks and Characteristics
The general techniques and algorithms of data mining are also applicable in web mining. Web mining tasks can be decomposed into four subtasks:
Resource Searching: Indicates the task of retrieving documents from the web.
Information selection: denotes automatic extraction of information from the web documents. Several web mining tools such as Web Miner are available to perform this task.
Generalization of patterns: Denotes automatic discovery of patterns across multiple web sites.
Analysis of web documents: Denotes validation and analysis of the extracted patterns.
Reply
bhawnaAggarwal
Active In SP
**

Posts: 1
Joined: Oct 2011
#13
09-10-2011, 08:01 PM

plz can anyone guide me how can i make a crawler using web mining....plz...
Reply
seminar addict
Super Moderator
******

Posts: 6,592
Joined: Jul 2011
#14
10-10-2011, 09:56 AM



to get information about the topic"WEB MINING" refer the link bellow

topicideashow-to-web-mining?page=2

topicideashow-to-web-mining

topicideashow-to-web-mining?pid=57579#pid57579
Reply
seminar addict
Super Moderator
******

Posts: 6,592
Joined: Jul 2011
#15
02-02-2012, 01:23 PM

WEB MINING


.pdf   web.pdf (Size: 177.1 KB / Downloads: 48)

Introduction
It is not exaggerated to say the Web World Web is the most excited impacts to the
human society in the last 10 years. It changes the ways of doing business, providing
and receiving education, managing the organization etc. The most direct effect is the
completed change of information collection, conveying, and exchange. Today,Web has
turned to be the largest information source available in this planet. The Web is a huge,
explosive, diverse, dynamic and mostly unstructured data repository, which supplies
incredible amount of information, and also raises the complexity of how to deal with
the information from the different perspectives of view - users, Web service providers,
business analysts. The users want to have the effective search tools to find relevant
information easily and precisely.

Terminology related to web mining

1.1.1 Web
Now a day the internet, otherwise called the web, is widely used in several fields. The
advent of e-Commerce has revolutionized several business industries such as banking
and insurance. You can get information studies and communicate with educational
institutions. Because of the dynamic nature of the web, you may face challenges to
locate any specific domain of information[10].

1.1.2 Mining
Mining is a technique of information or knowledge discovery from millions of sources
across theWeb[10].

1.1.3 Web Mining
Web mining is a specialized application of data mining. The web mining is a technique
to process information available on web and search for useful data. Web mining enables
you to discover web pages, text documents, multimedia files, images and other
types of resources from web. Web mining is widely used in several fields[10].

1.2 Motivation
TheWorldWideWeb (Web) is popular and interactive medium to disseminate information
today. The Web is huge, diverse, dynamic, widely distributed global information
service center. Users could encounter following problems when interacting with the
Web:

1.2.1 Finding relevant information
Most people use some search service when they want to find specific information on
the Web. A user usually inputs a simple keyword query and a result is a list of ranked
pages. This ranking is based on their similarity to the query. Today’s search tools have
some problems: Low precision and low recall, mainly because of wrong or incomplete
keyword query. This leads to irrelevance of many search results.
Reply

Important Note..!

If you are not satisfied with above reply ,..Please

ASK HERE

So that we will collect data for you and will made reply to the request....OR try below "QUICK REPLY" box to add a reply to this page

Quick Reply
Message
Type your reply to this message here.


Image Verification
Please enter the text contained within the image into the text box below it. This process is used to prevent automated spam bots.
Image Verification
(case insensitive)

Possibly Related Threads...
Thread Author Replies Views Last Post
  orange data mining tool ppt jaseelati 0 177 07-01-2015, 01:58 PM
Last Post: jaseelati
  Embedded Web Technology PPT study tips 0 398 02-07-2013, 02:51 PM
Last Post: study tips
  Recent Advances in Mining Haul Trucks pdf study tips 0 337 13-06-2013, 04:45 PM
Last Post: study tips
  Porting mobile web application engine to the Android platform pdf study tips 0 388 04-03-2013, 04:00 PM
Last Post: study tips
  DATABASE DESIGN ON EMBEDDED HOME GATEWAY AND WEB SERVER IMPLEMENTATION seminar ideas 1 580 31-08-2012, 05:40 PM
Last Post: Guest
  Integrating Wireless Sensor Networks with the Web seminar flower 0 355 27-06-2012, 04:23 PM
Last Post: seminar flower
  A DATA MINING SOFTWARE TOOL INTEGRATING GENETIC FUZZY SYSTEMS seminar ideas 0 437 16-06-2012, 11:26 AM
Last Post: seminar ideas
  A WEB-BASED SYSTEM FOR HOME MONITORING OF PATIENTS WITH PARKINSON'S DISEASE seminar ideas 0 436 19-05-2012, 10:42 AM
Last Post: seminar ideas
  WEB ENABLED SECURE TOLL TRAFFIC TRANSPORT VEHICLE TRAVEL SYSTEM (WEST3 TS) uploader 0 487 15-05-2012, 11:42 AM
Last Post: uploader
  WEB-BASED REAL-TIME REMOTE MONITORING FOR PERVASIVE HEALTHCARE seminar ideas 0 691 11-05-2012, 02:13 PM
Last Post: seminar ideas