Application of Rough Sets in Data Mining
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
seminar project explorer
Active In SP
**

Posts: 231
Joined: Feb 2011
#1
14-02-2011, 05:29 PM


Application of Rough Sets in Data Mining
A Project Report
Submitted in partial fulfilment of
the requirements for the award of the degree of
Master of Technology
in
Computer Science and Engineering
by
Abdul Nassar . A.A.
M105101
Department of Computer Science & Engineering
College of Engineering Trivandrum
Kerala - 695016
2010-11

Contents
1 Introduction 5
1.1 Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.1 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.2 Association Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.1.3 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Advantages of RST Approach in Clustering . . . . . . . . . . . . . . . . . . . . 7
2 An Algorithm For Clustering Using Similarity - Measure In RST 8
2.0.1 Functional Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3 Data Flow Diagram 9
4 Analysis of the Algorithm 12
5 Conclusion 13
3
Abstract
Data mining is the technique of extracting meaningful information from large and mostly
unorganized data banks.
Rough set is a mathematical approach proposed by Z. Pawlak in the early 1980’s. It deals
with classificatory analysis of information systems. The basic concepts of rough set theory
discussed in this project and implimentation include indiscernibility relation, reduct, core, upper approximation
and lower approximation.
Clustering is a major task in Data mining. The use of clustering enables you to create
new groups and classes based on the study of patterns and relationship between values of
data in a data bank. Rough Set based Indiscernibility relations can be used for clustering by
measuring the similarity among the data items. In the proposed approach the strict notion of
indiscernibility is relaxed and classes are formed on the basis that objects are similar rather
than identical.


.pdf   Application of Rough Sets in Data Mining.pdf (Size: 153.16 KB / Downloads: 89)

1 Introduction
Organizations worldwide generate large amount of data, mostly unorganized. This unorga-
nized data requires processing to be done to generate meaningful and useful information. In
order to organize large amount of data, you implement the concept of database management
systems such as Oracle and SQL Server, which require you to use SQL, a specialized query
language to retrieve data from a database. However, the use of SQL is not always adequate to
meet the end user requirements of specialized and sophisticated information from an unorga-
nized large data bank.
1.1 Data Mining
Data mining is technique of extracting meaningful information from large and mostly un-
organized data banks. It is the process of performing automated extraction and generating
predictive information from large data banks.
The extraction of meaningful information from a large bank is otherwise known as Knowledge
discovery. One school of thought considers data mining as a step in the process of knowledge
discovery in databases or KDD while other school of thought considers data mining considers
synonym to KDD.
Data mining makes use of various algorithms to perform a variety of tasks. These algorithms
examine the sample data of a problem and determine a model that fits close to solving the
problem. These models are classified as predictive and descriptive models.A predictive model
enables you to predict the values of data by making use of known results from a different set
of sample data. The data mining tasks that forms the part of predictive model are:
1. Classification
2. Regression
3. Time series analysis
A descriptive model enables you to determine the patterns and relationships in a sample data.
The data mining tasks that forms the part of descriptive model are:
1. Clustering
2. Summarization
3. Association Rules
4. Sequence discovery.
1.1.1 Clustering
The use of clustering enables you to create new groups and classes based on the study of
patterns and relationship between values of data in a data bank.
5
1.1.2 Association Rules
The use of association rules enables you to establish association and relationships between large
and unclassified data items based on certain attributes and characteristics. Association rules
define certain rules of associativity between data items and then use those rules to establish
relationships.
1.1.3 Problem Statement
How the concepts of Rough Set Theory - Indiscernibility , Reduct and Core can be used in data
mining area- clustering.
A rough set is a formal approximation of crisp set in terms of a pair of sets, which give the
lower and upper approximation of the original set. Rough set is an emerging soft computing
tool with wide range of applications, which includes problems in Machine Learning.
Data mining is one of the areas in which rough sets are widely used. Data mining is the process
of automatically searching large volumes of data for patterns using tools such as classification,
association rule mining, clustering etc. The rough set theory is a well-understood format
framework for building data mining models in the form of logic rules, on the basis of which it
is possible to issue predictions that allow classifying new cases. Indiscernibility relation of RST
can be used as a measure of similarity without any distance function for clustering the object.
1.2 Objective
By applying the concept of Rough Set Theory, develop/propose innovative algorithms/approaches
in clustering/rule mining. The project and implimentation mainly concentrates the application of rough set for clus-
tering in data mining.The project and implimentation is divided into two phases
1. In the first phase of the project and implimentation, the indiscernibility relation of RST is used for the
generation of clusters and an algorithm is developed for clustering of data.
2. In the second phase, The algorithm developed has to be implemented and tested on a
variety of databases of different sizes and for different applications.
6
1.3 Advantages of RST Approach in Clustering
1. Cluster formation is natural and easy.
2. RST approach provides definitions and method for finding which attribute separates one
classification from another.
3. It uses only internal information for the formation of clusters.
4. It rely on attribute reduction.
5. This approach handles uncertainty in clustering process.
6. It is rather easy to implement and can handle any volume of data.
7
2 An Algorithm For Clustering Using Similarity - Measure In RST
Basically, there are two requirements. The first one is to form all the identical groups together
to form base clusters. In the case of base clusters, all the attribute values of the objects that
belong to the same cluster will be identical. This forms the first functional requirement of our
algorithm. The process is to identify and club objects having the same attribute values, which
in turn forms the base clusters.
In the case of the second requirement, the strict notion of indiscernibility is relaxed. With
r-value ’n’, there may utmost ’n’ attribute values of object that may differ between them are
clubbed together to form a cluster. The process basically starts form the base clusters, where,
identical objects are clubbed together. These base clusters are compared each other and clubbed
when there is a maximum difference of ’n’ attribute values between the objects to form new
clusters.
2.0.1 Functional Requirements
Requirement R1 - Generate Clusters with r = 0
1. database that contains data records.
2. Generate database that contains groups of data records of same attribute values.
Process
1. Identify data records with the same attribute value (r = 0) and store it, which forms
identical groups
2. Continue the above process to generate all such groups
3. Identify all the distinct records/clusters and form each one as separate group
4. Store all the data groups.
Requirement R2 - Generate Clusters with r = k
1. Input data file that contains data records with r = 0
2. Generate a database that contains groups of data records with r = k
Process
1. From the database with r = 0, generate groups of data records with attribute value
difference of ’k’ between the groups of data records whose r value is 0.
2. Repeat the process to form the minimum number of clusters thus formed.
3. Repeat the process to form the minimum number of clusters thus formed.
4. Store all the data groups.
A Data Flow Diagram has been developed for the above said process. There are five pro-
cesses in the DFD, each of which can be refined further.
8
3 Data Flow Diagram
.
Figure 5Big Grinata Flow Diagram.
9
An algorithm is proposed for clustering data based on the above approach. The algorithm
is very simple. When this algorithm is implemented and tested with various databases of small
and medium size, we expect to get encouraging results.
Algorithm - Basic Steps
1. Classify the objects with the same attribute values ( indiscernibility with r value = 0 )
to form base clusters. Form all such base clusters.
2. From the clusters thus formed, identify and club groups with indiscernibility r value = k
between them to form new groups.
3. Repeat step 2, such that maximum groups can be clubbed together thus attaining mini-
mum number of clusters with r = k.
Procedure BaseClusters( object[] );
Declare baseclust[size];
Begin
. K := 1;
. Repeat
. For I := 2 to totalobjects do
. If ( difference( object[K], object[I] ) == 0 ) then
. Begin
. Addtobasecluster(object[K],object[I] );
. K := k + 1;
. End;
. Until all the objects are processed.
. //Add the remaining distinct clusters into baseclust
. I := 1;
. Repeat
. If ( object[I] is not in baseclust ) then
. Begin
. Addtobasecluster(object[K],object[I] );
. K := k + 1;
. End;
. I := I + 1;
. Until all objects are processed.
{ procedure ends }
End
10
The above pseudo code generates base clusters in which indiscernibility with r value is 0.
The pseudo code given below generates clusters in which indiscernibility with r value ’n’.
Procedure ClusterwithDifferN( baseclust[], n )
Declare clustN[size];
Begin
. Repeat
. K := 1;
. I:= 1;
. Repeat
. For J := I + 1 to last record do
. . . Begin
. . . If ( difference( aseclust[I].object,baseclust[J].object )¡= n ) then
. . . Begin
. . . Addtoclustn( clustN[K], baseclust[I], baseclust[J] )
. . . K := K + 1;
. . . . End;
End;{ for }
. .
. I := I + 1;
. Until all base cluster objects are processed.
. //add remaining base cluster objects into clustern
. . For M := 1 to last record do
. . Begin
. . . If ( baseclust[M] is not in clustN[] )
. . . . Begin
. . . . . Addtoclustn( clustN[K], baseclust[M] )
. . . . . K := K + 1;
. . . . End;
. . End;
. Until no two groups in clustN have difference n.
End;
11
4 Analysis of the Algorithm
The above algorithm for the generation of clusters is quite easy to implement. The RST
approach provides definitions and methods for finding which attribute separates one classifica-
tion from another and hence cluster formation is easy and natural. It uses only the internal
information to form clusters.
Even though the algorithm can handle small and medium sized databases effectively, there
may be some restrictions in the case of large sized databases due to the system limitations (
available memory ). In such cases, the following modification is suggested.
In the case of large databases, files may be used for the storage of intermediate clusters. A
part of the data structure/database is loaded in to memory, processed to generate clusters and
store to a file. The next part is then loaded and processed to generate clusters and updated to
the file and so on for the whole database. Then file created may be refined again until no more
refinem
12
5 Conclusion
Rough Set Theory can be used for Data Mining applications like Clustering and Rule gener-
ation. In clustering, the concept, indiscernibility relation of rough set theory is utilized for the
generation of clusters. Using this concept, clusters are generated without making use of any
additional information such as probability distribution or a membership function in fuzzy set
theory. An algorithm is developed based on this concept and can easily implement. Encourag-
ing results are expected when the algorithm is tested on a variety of databases.
The Rough Set concepts, Reduct, Core, Lower approximation and Upper approximation are
used for Rule Mining. Important rules can be generated by considering the Lower approxima-
tion of the target set. By considering the generated rules as attributes and by constructing a
new decision table, a reduct rule set can be generated. The reduct rules thus generated are
more important, and it does not contain any rules with low rule importance.
13
References
[1] Y. Y. Yao T. Y. Lin and L. A. Zadeh (Editors). “Data mining, rough sets and granular
computing”. Physica- Verlag, March 2002.
[2] S.K. Pal and A. Skowron. “Rough fuzzy hybridization - a new trend in decision making”.
Springer - Verlag, April 1999.
14
Reply

Important Note..!

If you are not satisfied with above reply ,..Please

ASK HERE

So that we will collect data for you and will made reply to the request....OR try below "QUICK REPLY" box to add a reply to this page

Quick Reply
Message
Type your reply to this message here.


Image Verification
Please enter the text contained within the image into the text box below it. This process is used to prevent automated spam bots.
Image Verification
(case insensitive)

Possibly Related Threads...
Thread Author Replies Views Last Post
  mobile data internetworking standards jaseelati 0 333 29-01-2015, 01:15 PM
Last Post: jaseelati
  cloud data protection for the masses project documentation jaseelati 0 335 21-01-2015, 04:19 PM
Last Post: jaseelati
  3d optical data storage technology seminar report jaseelati 0 424 06-01-2015, 04:47 PM
Last Post: jaseelati
  3d optical data storage technology seminar report jaseelati 0 338 30-12-2014, 03:23 PM
Last Post: jaseelati
  Towards Reliable Data Delivery for Highly Dynamic Mobile Ad Hoc Networks seminar ideas 11 3,979 02-04-2014, 12:50 PM
Last Post: Guest
  Karnataka ration card online application form 2013 study tips 17 17,302 07-11-2013, 03:45 PM
Last Post: Guest
  Design of an Error Detection and Data Recovery Architecture for Motion Estimation seminar projects maker 0 768 30-09-2013, 04:40 PM
Last Post: seminar projects maker
  PERFORMANCE OF ADDRESS CODING WHEN APPLIED TO DATA STORAGE AND RETRIEVAL ppt seminar projects maker 0 440 28-09-2013, 03:18 PM
Last Post: seminar projects maker
  Controlled Data Hiding Technique for Color Image Authentication in Frequency pdf seminar projects maker 0 482 26-09-2013, 03:02 PM
Last Post: seminar projects maker
  DATA STRUCTURES USED IN NETWORK PROGRAMMING seminar projects maker 0 572 25-09-2013, 04:27 PM
Last Post: seminar projects maker