Application of Rough Sets in Data Mining
seminar project explorer Active In SP Posts: 231 Joined: Feb 2011 
14022011, 05:29 PM
Application of Rough Sets in Data Mining
A Project Report Submitted in partial fulfilment of the requirements for the award of the degree of Master of Technology in Computer Science and Engineering by Abdul Nassar . A.A. M105101 Department of Computer Science & Engineering College of Engineering Trivandrum Kerala  695016 201011 Contents 1 Introduction 5 1.1 Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.1.1 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.1.2 Association Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.1.3 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.2 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.3 Advantages of RST Approach in Clustering . . . . . . . . . . . . . . . . . . . . 7 2 An Algorithm For Clustering Using Similarity  Measure In RST 8 2.0.1 Functional Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3 Data Flow Diagram 9 4 Analysis of the Algorithm 12 5 Conclusion 13 3 Abstract Data mining is the technique of extracting meaningful information from large and mostly unorganized data banks. Rough set is a mathematical approach proposed by Z. Pawlak in the early 1980’s. It deals with classificatory analysis of information systems. The basic concepts of rough set theory discussed in this project and implimentation include indiscernibility relation, reduct, core, upper approximation and lower approximation. Clustering is a major task in Data mining. The use of clustering enables you to create new groups and classes based on the study of patterns and relationship between values of data in a data bank. Rough Set based Indiscernibility relations can be used for clustering by measuring the similarity among the data items. In the proposed approach the strict notion of indiscernibility is relaxed and classes are formed on the basis that objects are similar rather than identical. Application of Rough Sets in Data Mining.pdf (Size: 153.16 KB / Downloads: 89) 1 Introduction Organizations worldwide generate large amount of data, mostly unorganized. This unorga nized data requires processing to be done to generate meaningful and useful information. In order to organize large amount of data, you implement the concept of database management systems such as Oracle and SQL Server, which require you to use SQL, a specialized query language to retrieve data from a database. However, the use of SQL is not always adequate to meet the end user requirements of specialized and sophisticated information from an unorga nized large data bank. 1.1 Data Mining Data mining is technique of extracting meaningful information from large and mostly un organized data banks. It is the process of performing automated extraction and generating predictive information from large data banks. The extraction of meaningful information from a large bank is otherwise known as Knowledge discovery. One school of thought considers data mining as a step in the process of knowledge discovery in databases or KDD while other school of thought considers data mining considers synonym to KDD. Data mining makes use of various algorithms to perform a variety of tasks. These algorithms examine the sample data of a problem and determine a model that fits close to solving the problem. These models are classified as predictive and descriptive models.A predictive model enables you to predict the values of data by making use of known results from a different set of sample data. The data mining tasks that forms the part of predictive model are: 1. Classification 2. Regression 3. Time series analysis A descriptive model enables you to determine the patterns and relationships in a sample data. The data mining tasks that forms the part of descriptive model are: 1. Clustering 2. Summarization 3. Association Rules 4. Sequence discovery. 1.1.1 Clustering The use of clustering enables you to create new groups and classes based on the study of patterns and relationship between values of data in a data bank. 5 1.1.2 Association Rules The use of association rules enables you to establish association and relationships between large and unclassified data items based on certain attributes and characteristics. Association rules define certain rules of associativity between data items and then use those rules to establish relationships. 1.1.3 Problem Statement How the concepts of Rough Set Theory  Indiscernibility , Reduct and Core can be used in data mining area clustering. A rough set is a formal approximation of crisp set in terms of a pair of sets, which give the lower and upper approximation of the original set. Rough set is an emerging soft computing tool with wide range of applications, which includes problems in Machine Learning. Data mining is one of the areas in which rough sets are widely used. Data mining is the process of automatically searching large volumes of data for patterns using tools such as classification, association rule mining, clustering etc. The rough set theory is a wellunderstood format framework for building data mining models in the form of logic rules, on the basis of which it is possible to issue predictions that allow classifying new cases. Indiscernibility relation of RST can be used as a measure of similarity without any distance function for clustering the object. 1.2 Objective By applying the concept of Rough Set Theory, develop/propose innovative algorithms/approaches in clustering/rule mining. The project and implimentation mainly concentrates the application of rough set for clus tering in data mining.The project and implimentation is divided into two phases 1. In the first phase of the project and implimentation, the indiscernibility relation of RST is used for the generation of clusters and an algorithm is developed for clustering of data. 2. In the second phase, The algorithm developed has to be implemented and tested on a variety of databases of different sizes and for different applications. 6 1.3 Advantages of RST Approach in Clustering 1. Cluster formation is natural and easy. 2. RST approach provides definitions and method for finding which attribute separates one classification from another. 3. It uses only internal information for the formation of clusters. 4. It rely on attribute reduction. 5. This approach handles uncertainty in clustering process. 6. It is rather easy to implement and can handle any volume of data. 7 2 An Algorithm For Clustering Using Similarity  Measure In RST Basically, there are two requirements. The first one is to form all the identical groups together to form base clusters. In the case of base clusters, all the attribute values of the objects that belong to the same cluster will be identical. This forms the first functional requirement of our algorithm. The process is to identify and club objects having the same attribute values, which in turn forms the base clusters. In the case of the second requirement, the strict notion of indiscernibility is relaxed. With rvalue ’n’, there may utmost ’n’ attribute values of object that may differ between them are clubbed together to form a cluster. The process basically starts form the base clusters, where, identical objects are clubbed together. These base clusters are compared each other and clubbed when there is a maximum difference of ’n’ attribute values between the objects to form new clusters. 2.0.1 Functional Requirements Requirement R1  Generate Clusters with r = 0 1. database that contains data records. 2. Generate database that contains groups of data records of same attribute values. Process 1. Identify data records with the same attribute value (r = 0) and store it, which forms identical groups 2. Continue the above process to generate all such groups 3. Identify all the distinct records/clusters and form each one as separate group 4. Store all the data groups. Requirement R2  Generate Clusters with r = k 1. Input data file that contains data records with r = 0 2. Generate a database that contains groups of data records with r = k Process 1. From the database with r = 0, generate groups of data records with attribute value difference of ’k’ between the groups of data records whose r value is 0. 2. Repeat the process to form the minimum number of clusters thus formed. 3. Repeat the process to form the minimum number of clusters thus formed. 4. Store all the data groups. A Data Flow Diagram has been developed for the above said process. There are five pro cesses in the DFD, each of which can be refined further. 8 3 Data Flow Diagram . Figure 5ata Flow Diagram. 9 An algorithm is proposed for clustering data based on the above approach. The algorithm is very simple. When this algorithm is implemented and tested with various databases of small and medium size, we expect to get encouraging results. Algorithm  Basic Steps 1. Classify the objects with the same attribute values ( indiscernibility with r value = 0 ) to form base clusters. Form all such base clusters. 2. From the clusters thus formed, identify and club groups with indiscernibility r value = k between them to form new groups. 3. Repeat step 2, such that maximum groups can be clubbed together thus attaining mini mum number of clusters with r = k. Procedure BaseClusters( object[] ); Declare baseclust[size]; Begin . K := 1; . Repeat . For I := 2 to totalobjects do . If ( difference( object[K], object[I] ) == 0 ) then . Begin . Addtobasecluster(object[K],object[I] ); . K := k + 1; . End; . Until all the objects are processed. . //Add the remaining distinct clusters into baseclust . I := 1; . Repeat . If ( object[I] is not in baseclust ) then . Begin . Addtobasecluster(object[K],object[I] ); . K := k + 1; . End; . I := I + 1; . Until all objects are processed. { procedure ends } End 10 The above pseudo code generates base clusters in which indiscernibility with r value is 0. The pseudo code given below generates clusters in which indiscernibility with r value ’n’. Procedure ClusterwithDifferN( baseclust[], n ) Declare clustN[size]; Begin . Repeat . K := 1; . I:= 1; . Repeat . For J := I + 1 to last record do . . . Begin . . . If ( difference( aseclust[I].object,baseclust[J].object )¡= n ) then . . . Begin . . . Addtoclustn( clustN[K], baseclust[I], baseclust[J] ) . . . K := K + 1; . . . . End; End;{ for } . . . I := I + 1; . Until all base cluster objects are processed. . //add remaining base cluster objects into clustern . . For M := 1 to last record do . . Begin . . . If ( baseclust[M] is not in clustN[] ) . . . . Begin . . . . . Addtoclustn( clustN[K], baseclust[M] ) . . . . . K := K + 1; . . . . End; . . End; . Until no two groups in clustN have difference n. End; 11 4 Analysis of the Algorithm The above algorithm for the generation of clusters is quite easy to implement. The RST approach provides definitions and methods for finding which attribute separates one classifica tion from another and hence cluster formation is easy and natural. It uses only the internal information to form clusters. Even though the algorithm can handle small and medium sized databases effectively, there may be some restrictions in the case of large sized databases due to the system limitations ( available memory ). In such cases, the following modification is suggested. In the case of large databases, files may be used for the storage of intermediate clusters. A part of the data structure/database is loaded in to memory, processed to generate clusters and store to a file. The next part is then loaded and processed to generate clusters and updated to the file and so on for the whole database. Then file created may be refined again until no more refinem 12 5 Conclusion Rough Set Theory can be used for Data Mining applications like Clustering and Rule gener ation. In clustering, the concept, indiscernibility relation of rough set theory is utilized for the generation of clusters. Using this concept, clusters are generated without making use of any additional information such as probability distribution or a membership function in fuzzy set theory. An algorithm is developed based on this concept and can easily implement. Encourag ing results are expected when the algorithm is tested on a variety of databases. The Rough Set concepts, Reduct, Core, Lower approximation and Upper approximation are used for Rule Mining. Important rules can be generated by considering the Lower approxima tion of the target set. By considering the generated rules as attributes and by constructing a new decision table, a reduct rule set can be generated. The reduct rules thus generated are more important, and it does not contain any rules with low rule importance. 13 References [1] Y. Y. Yao T. Y. Lin and L. A. Zadeh (Editors). “Data mining, rough sets and granular computing”. Physica Verlag, March 2002. [2] S.K. Pal and A. Skowron. “Rough fuzzy hybridization  a new trend in decision making”. Springer  Verlag, April 1999. 14 


