DATA MINING AND WARE HOUSING
Active In SP
Joined: Mar 2010
02-04-2010, 03:36 PM
Generally, data mining is the process of analyzing data from different perspectives
And summarizing it into useful information - information that can be used to increaseRevenue, cuts costs, or both. Data mining software is one of a number of analytical tools for Analyzing data. It allows Users to analyze data from many different dimensions or angles, Technically, data mining is the Process of finding correlations or patterns among dozens of Fields in large relational Databases. Although data mining is a relatively new term, the Technology is not.
The objective of this paper is to provide full fledged information about the process Of data mining, the steps to process the mining etc., it also provides the more advantageous Techniques like data cleaning, integration etc., and all schemas for effective process and Mining.
Active In SP
Joined: Mar 2010
12-04-2010, 08:39 PM
Many software project and implimentations are accumulated by a great deal of data, so we really need information about the effective maintenance and reteving of data from the database. The newest, hottest technology to address these concerns is data mining and data warehousing.
Data Mining is the process of automated extraction of predictive information from large databases. It predicts future trends and finds behavior that the experts may miss as it lies beyond their expectations. Data Mining is part of a larger process called knowledge discovery, specifically, the step in which advanced statistical analysis and modeling techniques are applied to the data to find useful patterns and relationships
Active In SP
Joined: May 2010
29-05-2010, 09:52 PM
DATA MINING AND WARE HOUSING
Data mining is the extraction of hidden predictive information from large databases. It is a is a powerful new technology with great potential to help companies focus on the most important information in their data warehouses.
The Foundations of Data Mining:
Data mining takes us beyond data access and navigation to prospective and proactive information delivery. Data mining is supported by technologies like Massive data collection, Powerful multiprocessor computers. Data mining algorithms etc and as a result is ready for business deployment. Commercial databases are growing at unprecedented rates.
The Scope of Data Mining:
its name is derived from searching for valuable business information in a large database like a miner does for gold. data mining technology can generate new business opportunities by providing these capabilities:
-Automated prediction of trends and behaviors: It automates the process of finding predictive information in large databases.
-Automated discovery of previously unknown patterns.: Data mining parses through databases and identify previously hidden patterns in one step.
The most commonly used techniques in data mining are:
-Artificial neural networks
-Nearest neighbor method
An Architecture for Data Mining:
Many data mining tools currently operate outside of the warehouse, requiring extra steps for extracting, importing, and analyzing the data. The ideal starting point is a data warehouse containing a combination of internal data tracking all customer contact coupled with external market data . Sybase, Oracle, Redbrick, etc can be used to implement this warehouse. An OLAP (On-Line Analytical Processing) server enables a more sophisticated end-user business model to be applied when navigating the data warehouse.
for more details on this topic, visit this link:
and for data warehousing, visit this link:
Active In SP
Joined: Feb 2011
21-04-2011, 09:21 AM
Vikrant M. Kale
Abhijeet V. Thakare
Data Mining - 2.DOC (Size: 69 KB / Downloads: 53)
Data mining, the extraction of hidden predictive information from large database, is a powerful new technology with great potential to help Companies focus on the most important information in their data warehouses.Data Mining tools predicts future trends and behaviors, allowing business to make proactive knowledge-driven decisions.The automated, prospective analysis offered by data mining move beyond the analysis of past events provided by retrospective tools typical of decision support systems.Data Mining tools can answer business questions that traditionally were too time consuming to resolve.Data Mining techniques can be implemented rapidly onn existing software and hardware platforms to enhance the value of existing information resources,and can be integrated with new products and system as they are brought online.
Analyzing data can provide further knowledge about a business by going beyond the data explicitly stored to derive knowledge about the business.
What is Data Mining?
The past two decades has seen a dramatic increase in the amount of information or data being stored in electronic format.This accumulation of data has taken place at an expensive rate. It has been estimated that the amount of information in the world doubles every 20 months and the size and number of databases are increasing even faster. The increase in use of electronic data gathering devices such as point-of-scale or remote sensing devices has contributed to this explosion of available data.
Information is at the heart of business operations and that is used by decision makers to make use of data stored to gain valuable insights into the business.Database Management system gave access to data stored but this was only a small part of what could be gained from the data .Traditional online transaction processing systems, OLTPs, are good at putting data into database quickly, safely and efficiently but are not good at delievering meaningful analysis in return.Analyzing data can provide further knowledge about a business by going beyond the data explicitely stored to derive knowledge about the business.This is where Data Mining or Knowledge Discovery in Databases (KDD) has obvious benefits for any enterprise.
3.Data Mining Background :
Data Mining has drawn on a number of fields such as inductive learning, machine learning, statistics, etc.
Induction is the inference of information from data and inductive learning is the model building process where the environment i.e. database is analyzed with a view to finding patterns. Similar objects are grouped in classes and rules formulated where by it is possible to predict the class of unseen objects. This process of classification identifies classes such that each class has a unique pattern of values, which forms the class description. The nature of the environment is dynamic hence the model must be adaptive i.e. should be able learn.
Inductive learning where the system infers knowledge itself from observing its environment has two main strategies :
* Supervised learning
* Unsupervised learning
Statistics has a solid theoretical foundation but the results from statistics can be overwhelming and difficult to interpret, as they require user guidance as to where and how to analyze the data .Data Mining however allows the expert’s knowledge of the data and the advanced analysis techniques of the computer to work together. For example statistical induction is something like the average rate of failure of machines.
Machine Learning :
Machine learning is the automation of a learning process and learning is tantamount to the construction of rules based on observations of environmental states and transitions. This is a broad field, which includes not only learning from examples, but also reinforcement learning, learning with teacher, etc.
Some Of The Definitions Of Data Mining are :
Data Mining achieves different technical approaches , such as clustering, data summarization, learning classification rules, finding dependency networks, analyzing and detecting anomalies.
Data Mining is the search for the relationships and global patterns that exists in large databases but are ‘hidden’ among the vast amount of data, such as a relationship between patient data and their medical diagnosis. This relationship between patient data and their medical diagnosis. These relationships represent valuable knowledge about the database and the objects in the database.
Basically data mining is concerned with the analysis of data and the use of software techniques for finding patterns and regularities in sets of data. It is the computer which is responsible for finding the patterns by identifying the underlying rules and features in the data.
4. Stages/Process in Data Mining
The following diagram summarizes some of the stages/processes identified in Data Mining and Knowledge Discovery.
Active In SP
Joined: Feb 2011
22-04-2011, 09:35 AM
Data Mining & Data Warehousing.DOC (Size: 325.5 KB / Downloads: 78)
A data warehouse is a relational database management system designed specifically to meet the needs of transaction processing systems. Data warehousing is a new powerful technique making it possible to extract archived operational data and overcome inconsistencies between different legacy data formats. Data warehouses contain consolidated data from many sources, with summary information and covering a long time period. The sizes of data warehouses ranging from several gigabytes to terabytes are common. Data warehousing technology comprises a set of new concepts and tools, which support the knowledge worker (executive, manager and analyst) with information material for decision making. Thus, the Data warehousing is the process of extracting and transforming operational data into informational data and loading it into a central data store or warehouse.
Data Mining or Knowledge Discovery in Databases (KDD) is the nontrivial extraction of implicit, previously unknown, and useful information from data. Data mining can be defined as "a decision support process in which we search for patterns of information in data". Data mining uses sophisticated statistical analysis and modeling techniques to find patterns and relationships hidden in organizational databases. Once found, the information needs to be presented in a suitable form, with graphs, reports etc. Data Mining includes a number of different technical approaches for extraction of information such as clustering, data summarization, learning classification rules, finding dependency networks, analysing changes, and detecting anomalies. Basically data mining is concerned with the analysis of data and the use of software techniques for finding patterns and regularities in sets of data.
2. DATA WAREHOUSING (DWH):
The fundamental reason for building a data warehouse is to improve the quality of information in the organization. Data coming from internal and external sources, existing in a variety of forms from traditional structural data to unstructured data like text files or multimedia is cleaned and integrated into a single repository. The need of data warehousing is that information systems must be distinguished into operational and informational systems. Operational systems support the day-to-day conduct of the business, and are optimized for fast response time of predefined transactions, with a focus on update transactions. Operational data is a current and real-time representation of the Business State. In contrast, informational systems are used to manage and control the business. They support the analysis of data for decision making about how the enterprise will operate now and in the future. A data warehouse can be normalized or denormalized. It can be a relational database, multidimensional database, flat file, hierarchical database, object database etc. And data warehouses often focus on a specific activity or entity.
2.1 Characteristics of a Data warehouse:
There are generally four characteristics that describe a data warehouse:
1) Subject-oriented: Data are organized according to subject instead of application e.g. an insurance company using a data warehouse would organize their data by customer, premium, and claim, instead of by different products (auto, life, etc.). The data organized by subject contain only the information necessary for decision support processing.
2) Integrated: When data resides in many separate applications in the operational environment, encoding of data is often inconsistent. For instance, in one application, gender might be coded as "m" and "f" in another by 0 and 1. When data are moved from the operational environment into the data warehouse, they assume a consistent coding convention e.g. gender data is transformed to "m" and "f".
3) Time-variant: The data warehouse contains a place for storing data that are 5 to 10 years
old, or older, to be used for comparisons, trends, and forecasting. These data are not updated.
4) Non-volatile: Data are not updated or changed in any way once they enter the data warehouse, but are only loaded and accessed. Modifications of the warehouse data take place only when modifications of the source data are propagated into the warehouse.
5) Derived Data: A data warehouse contains usually additional data, not explicitly stored in the operational sources, but derived through some process from operational data called as derived data. For example, operational sales data could be stored in several aggregation levels (weekly, monthly, quarterly sales) in the warehouse
2.2 Data warehouse systems:
A data warehouse system (DWS) comprises the data warehouse and all components used for building, accessing and maintaining the DWH as shown in Figure 1. The center of a data warehouse system is the data warehouse itself. The data acquisition includes all programs, applications and legacy systems interfaces that are responsible for extracting data from operational sources, preparing and loading it into the warehouse. The access component includes all different applications that make use of the information stored in the warehouse. The typical components of a DWS are as follows:
1) Pre-Data Warehouse
2) Data Acquisition
3) Data Repositories
4) Front End Analytics
1) Pre-Data Warehouse: The pre-Data Warehouse zone provides the data for data warehousing. OLTP databases are where operational data are stored. OLTPs are design for transaction speed and accuracy. Organizations daily operations access and modify operational databases. Data from these opertional databases and any other external data sources are extraced by using interfaces such as JDBC.The Metadata Repository keeps the track of data currently stored in the DWH. Metadata ensures the accuracy of data entering into the DWH. Meta-data ensures that data has the right format and relevancy. The Meta data is "data about data or data describing the meaning of data”.
2) Data Acquisition: Data acquisition is achieved by using following five steps:
a) Extract: Data is extracted from opertational databases and external sources by using interfaces such as JDBC.
b) Clean: Data is cleaned to minimize errors, fill in missing information and removal of as low-level transaction information, which slow down the query times.
c) Transform: The data is transformed to enrich data to correct values & reconcile differences between multiple sources, due to the use of homonyms, synonyms or different units of measurement.
d) Load: The cleaned & transformed data is finally loaded into the warehouse. Additional preprocessing such as sorting and generation of summary information is carried out at this stage. Data is partitioned and indexes are built for efficiency. Due to large volume of data, loading is a slow process.
e) Refresh: Data in the data warehouse is periodically refreshed to reflect updates to the data sources.
3) Data Repositories: The Data Warehouse repository is the database that stores active data of business value for an organization. There are variants of Data Warehouses - Data Marts and ODS. Data Marts are smaller Data Warehouses built on a departmental rather than on a company-wide level. Instead of running ad hoc queries against a huge data warehouse, data marts allow the efficient execution of predicted queries over a significantly smaller database. Data Warehouses collects data and is the repository for historical data. Hence it is not always efficient for providing up-to-date analysis. Hence, the ODS, Operational Data Stores are used. ODS are used to hold recent data before migration to the Data Warehouse.
4) Front End Analytics: Different users to interact with data stored in the repositories use the front-end Analytics potion of the Data Warehouse. Data Mining is the discovery of useful patterns in data. Data Mining are used for prediction analysis and classification. OLAP, Online Analytical Processing, is used to analyze historical data and slice the business information required. Reporting tools are used to provide reports on the data. Data are displayed to show relevancy to the business and keep track of key performance indicators. Data Visualization tools is used to display data from the data repository. Data visualization is combined with Data Mining and OLAP tools. Data visualization shows relevancy and patterns.
2.3 Stages in Implementation: A DW implementation requires the integration of implementation of many products. Following are the steps of implementation:
Step1: Collect and analyze the business requirements.
Step2: Create a data model and physical design for the DW.
Step3: Define the Data sources.
Step4: Choose the DBMS and software platform for DW.
Step5: Extract the data from the operational data sources, transfer it, clean it & load into the DW model or data mart.
Step6: Choose the database access and reporting tools.
Step7: Choose the database connectivity software.
Step8: Choose the data anlysis and presentation software.
Step9: Keep refreshing the data warehouse periodically.
Active In SP
Joined: Feb 2011
07-05-2011, 11:48 AM
data warehousepapers-1.doc (Size: 271 KB / Downloads: 40)
The words “DATAWAREHOUSE & DATA MINING” seem interesting because in today’s World, the competitive edge is coming less from optimization and more from the use of the information that these systems have been collecting over the years.
Data warehousing has quickly evolved into a unique and popular business application class. Early builders of data warehouses already consider their systems to be key components of their IT strategy and architecture.
In reviewing the development of data warehousing, we need to begin with a review of what had been done with the data before of evolution of data warehouses.
In this paper firstly, the primary emphasis had been given to the different types of Data warehouses, Architecture of Data warehouses and Analysis Process of Data warehouses. In the next section ways to build Data warehouses have been discussed along with specification of the requirements needed for them. To add more importance another key attribute- about the ETL TOOLS was also given.
No discussion of the data warehousing systems is complete without review of “DATA MINING” This section explores the Processes, Working along with the different approaches and components that are commonly found in Data Mining.
Further evolution of the hardware and software technology will also continue to greatly influence the capabilities that are built into data warehouses. Data warehousing systems have become a key component of information technology architecture. A flexible enterprise data warehouse strategy can yield significant benefits for a long period.
The main idea of data ware houses and data mining have been realized through this paper, with help of several diagrams and examples. As a concluding point it is shown as how “DATA WAREHOUSES & DATA MINING” can be used in it’s nearest Future.
"A data warehouse is a subject oriented, integrated, time variant, non volatile collection of data in support of management's decision making process".
A data warehouse is a relational/multidimensional database that is designed for query and analysis rather than transaction processing. It usually contains historical data that is derived from transaction data. It separates analysis workload from transaction workload and enables a business to consolidate data from several sources.
In addition to a relational/multidimensional database, a data warehouse environment often consists of an ETL solution, an OLAP engine, client analysis tools, and other applications that manage the process of gathering data and delivering it to business users.
Data warehouses can be classified into three types:
Enterprise data warehouse: An enterprise data warehouse provides a central database for decision support throughout the enterprise.
Operational data store (ODS): This has a broad enterprise wide scope, but unlike the real enterprise data warehouse, data is refreshed in near real time and used for routine business activity.
Data Mart: Data mart is a subset of data warehouse and it supports a particular region, business unit or business function.
Data warehouses and data marts are built on dimensional data modeling where fact tables are connected with dimension tables. This is most useful for users to access data since a database can be visualized as a cube of several dimensions. A data warehouse provides an opportunity for slicing and dicing that cube along each of its dimensions.
It is designed for a particular line of business, such as sales, marketing, or finance. In a dependent data mart, data can be derived from an enterprise-wide data warehouse. In an independent data mart, data can be collected directly from sources.
In order to store data, over the years, many application designers in each branch have made their individual decisions as to how an application and database should be built. Therefore, source systems will be different in naming Conventions, variable measurements, encoding structures, and physical attributes of data.
Consider a bank that has several branches in several countries has millions of customers and the lines of business of the enterprise are savings, and loans. The following example explains how the data is integrated from source systems to target
EXAMPLE OF SOURCE DATA:
Attribute Name Column Name Data type Values
Source System 1 Customer Application Date CUSTOMER_APPLICATION_DATE NUMERIC(8,0) 11012005
Source System 2 Customer Application Date CUST_APPLICATION_DATE DATE 11012005
Source System 3 Application Date APPLICATION_DATE DATE 01NOV2005
In the aforementioned example, attribute name, column name, data type and values are entirely different from one source system to another. This inconsistency in data can be avoided by integrating the data into a data warehouse with good standards.
EXAMPLE OF TARGET DATA (DATA WAREHOUSE)
Target System Attribute Name Column Name Data type Values
Record #1 Customer Application Date CUSTOMER_APPLICATION_DATE DATE 01112005
Record #2 Customer Application Date CUSTOMER_APPLICATION_DATE DATE 01112005
Record #3 Customer Application Date CUSTOMER_APPLICATION_DATE DATE 01112005
In the above example of target data, attribute names, column names, and data types are consistent throughout the target system. This is how data from various source systems is integrated and accurately stored into the data warehouse.
The primary concept of data warehousing is that the data stored for business analysis can most effectively be accessed by separating it from the data in the operational systems. Many of the reasons for this separation have evolved over the years. In the past, legacy systems archived data onto tapes as it became inactive and many analysis reports ran from these tapes or mirror data sources to minimize the performance impact on the operational systems. Data warehousing systems are most successful when data can be combined from more than one operational system. When the data needs to be brought together from more than one source application, it is natural that this integration be done at a place independent of the source applications. Before the evolution of structured data warehouses, analysts in many instances would combine data extracted from more than one operational system into a single spreadsheet or a database.
The data warehouse model needs to be extensible and structured such that the data from different applications can be added as a business case can be made for the data.
The architecture of the data warehouse and the data warehouse model greatly impact the success of the project and implimentation.
Fig:1 ARCHITECTURE OF DATA WAREHOUSE:
The Data warehouse architecture has several flows in it. The first stage in this architecture is:
Business modeling: Many organizations justify building data warehouses as an “act of faith”. This stage is necessary as to identify the project and implimentationed
business benefits that should be derived from using the data warehouse.
Data Modeling: Developing data modules for the source system and develops dimensional data modules for the Data warehouse.
In this third stage several data source systems are collected together.
In the ETL process stage, the fourth stage actions like developing ETL process, extraction of data, Transformation of data, loading of data are done.
The fifth stage is Target data, which is in the form of several data marts.
The last stage is generating several business reports called the “Business Intelligence stage.”
Data marts are generally called the subset of data warehouse. They diagrammatically look like: Generally, when we consider an example of an organization selling products throughout the world, the main four major dimensions are product, location, time and organization.
Interface with other data warehouses:
The data warehouse system is likely to be interfaced with other applications that use it as the source of operational system data. A data warehouse may feed data to other data warehouses or smaller data warehouses called data marts.
The operational system interfaces with the data warehouse often become increasingly stable and powerful. As the data warehouse becomes a reliable source of data that has been consistently moved from the operational systems, many downstream applications find that a single interface with the data warehouse is much easier and more functional than multiple interfaces with the operational applications. The data warehouse can be a better single and consistent source for many kinds of data than the operational systems. It is however, important to remember that the much of the operational state information is not carried over to the data warehouse. Thus, data warehouse cannot be source of all operation system interfaces.
Fig: 2 The Analysis processes of Data warehouse
Figure 2 illustrates the analysis processes that run against a data warehouse. Although a majority of the activity against today’s data warehouses is simple reporting and analysis, the sophistication of analysis at the high end continues to increase rapidly. Of course, all analysis run at data warehouse is simpler and cheaper to run than through the old methods. This simplicity continues to be a main attraction of data warehousing systems.
Four ways to build a data warehouse:
Although we have been building data warehouses since the early 1990s, there is still a great deal of confusion about the similarities and differences among the four major architectures: “top-down, bottom-up, hybrid and federated.” As a result, some firms fail to adopt a clear vision of how their data warehousing environments can and should evolve. Others, paralyzed by confusion or fear of deviating from prescribed tenets for success, cling too rigidly to one approach, undermining their ability to respond to new or unexpected situations. Ideally, organizations need to borrow concepts and tactics from each approach to create an environment that meets their needs.
Top-down vs. bottom-up:
The two most influential approaches are championed by industry heavyweights Bill Inmon and Ralph Kimball, both prolific authors and consultants in the data warehousing field.
Inmon, who is credited with coining the term "data warehousing" in the early 1990s, advocates a top-down approach in which organizations first build a data warehouse followed by data marts.
The data warehouse holds atomic or transaction data that is extracted from one or more source systems and integrated within a normalized, enterprise data model. From there, the data is “summarized”, "dimensionalized" and “distributed” to one or more "dependent" data marts. These data marts are "dependent" because they derive all their data from the centralized data warehouse. A data warehouse surrounded by dependent data marts is often called a "hub-and-spoke" architecture.
Kimball, on the other hand, advocates a bottom-up approach because it starts and ends with data marts, negating the need for a physical data warehouse altogether. Without a data warehouse, the data marts in a bottom-up approach contain all the data -- both atomic and summary -- users may want or need, now or in the future. Data is modeled in a star schema design to optimize usability and query performance. Each data mart (whether it is logically or physically deployed) builds on the next, reusing dimensions and facts so that users can query across data marts, if they want to, to obtain a single version of the truth.
smart paper boy|
Active In SP
Joined: Jun 2011
30-07-2011, 11:08 AM
ASHISH SAMWATSAR.pptx (Size: 157.12 KB / Downloads: 40)
DATA WARE HOUSING
IT VERE IMPORTANT PART OF OUR LIFE, WHEN WE START LIFE , SO WE HAVE LOTS OF DATA IN OUR MIND.
BUT WE NOT GETTING A EXJECCT POSITION IN OUR MIND.
WHEN WE DO WORK SYSTEMETICLY IT CALL PROGRRAMING OR PLANING.
WHEN WE WAKE UP IN MORNING OUR MIND IS EMTY JUST LIKE EMTY PEN DRIVE .
We are only movable human being and it very , vast data basein our body acid , base, and so many liquid present .
It thought full thing
IT IS ITERIGATED FORM
smart paper boy|
Active In SP
Joined: Jun 2011
10-08-2011, 12:59 PM
DATA MINING AND WAREHOUSING.DOC (Size: 627 KB / Downloads: 35)
Data mining has become a popular buzzword but, in fact, promises to revolutionize commercial and scientific exploration. Databases range from millions to trillions of bytes of data. Data mining, the extraction of hidden predictive information from large databases, is a powerful new technology with great potential to help companies focus on the most important information in their data warehouses. The automated, prospective analyses offered by data mining move beyond the analyses of past events provided by retrospective tools typical of decision support systems. Data mining tools can answer business questions that traditionally were too time consuming to resolve.
A data warehouse is a relational database that is designed for query and analysis rather than transaction processing. It usually contains historical data that is derived from transaction Data in the warehouse can be seen as materialized views generated from the underlying multiple data sources. Materialized views are used to speed up query processing on large amounts of data. It separates analysis workload from transaction workload and enables an organization to consolidate data from several sources. These views need to be maintained in response to updates in the source data. This is often done using incremental techniques that access data from underlying sources. In the data-warehousing scenario, accessing base relations can be difficult; sometimes data sources may be unavailable, since these relations are distributed across different sources.
This paper provides an introduction to the basic technologies of data mining. As well as a basic description of how data warehouse architectures can evolve to deliver the value of data mining to end-users.
According to my view to install a new information source by DATAMINING & WAREHOUSING in order to serve the people is introduced as follows.
Data mining is the process of analyzing data from different perspectives and summarizing it into useful information - information that can be used to increase revenue, cuts costs, or both. It allows users to analyze data from many different dimensions or angles, categorize it, and summarize the relationships identified.
Data mining architecture:
How does data mining work?
Classes: Stored data is used to locate data in predetermined groups.
Clusters: Data items are grouped according to logical relationships or consumer preferences.
Associations: Data can be mined to identify associations.
Sequential patterns: Data is mined to anticipate behavior patterns and trends.
The Data Mining Process:
Data mining consists of five major elements:
Extract, transform, and load transaction data onto the data warehouse system.
Store and manage the data in a multidimensional database system.
Provide data access to business analysts and information technology professionals.
Analyze the data by application software.
Present the data in a useful format, such as a graph or table.
Different levels of analysis are available:
Artificial neural networks: Non-linear predictive models that learn through training and resemble biological neural networks in structure.
Genetic algorithms: Optimization techniques that use processes such as genetic combination, mutation, and natural selection in a design based on the concepts of natural evolution.
What technological infrastructure is required?
Size of the database: The more data being processed and maintained, the more powerful the system required.
Query complexity: The more complex the queries and the greater the number of queries being processed, the more powerful the system required.
Data Mining Infrastructure:
Ability to access data from many sources & consolidates
Ability to score customers based on existing models