Fraud Detection in High Voltage Electricity Consumers Using Data Mining
Active In SP
Joined: Oct 2010
29-10-2010, 12:11 AM
Fraud Detection in High Voltage Electricity Consumers Using Data Mining
SARATH KUMAR P
Electrical and Electronics
College Of Engineering, Trivandrum
Fraud Detection in High Voltage Electricity Consumers Using Data Mining.doc (Size: 519.5 KB / Downloads: 105)
One of the greatest problems that Brazilians electrical energy power distribution companies have to deal is the commercial losses, mainly resulted from consumer’s frauds. To reduce losses, the companies realize in loco inspections to try to detect frauds. The inspections are made by technicians that visit the consumer and evaluate equipments and electricity connections. Generally, company specialists indicate which consumer must undergo an inspection. However, due the high quantity of consumer, it is almost impossible to evaluate each consumer behavior and indicate the ones that are suspicious of fraud. Also, it is not viable to inspect all the units, because the number of consumers fraudulent is small compared to the total number of clients. This problem is present in all consumer classes, from residential to industrial. Still, high voltage electricity consumers reflect major financial loss because of its high energy consumption and differentiated electrical demand (KW) fees and consumption (KWh). However, it’s known that electrical energy distribution companies store client information on their database. This information can be used as input on a data mining system, identifying clients with suspicious behavior, that is, good candidates to undergo inspection. There are many works about fraud detection, however a few of them about fraud detection for electrical energy consumers, and with some of the data mining techniques, such as Decision Tree and Rough Sets. Basically, the methodology for those works is:
1. Generate a sample database of normal and fraudulent consumers;
2. Preprocess the data to the data mining tools;
3. Apply the tools to create decision rules;
4. Verify which consumers fit the rules with “fraud” decision and inspect them.
However, there are some important characteristics that distinguish low from high voltage consumers, which make impossible the application of the above cited methodology. First, the number of high voltage electrical energy consumers (mainly industries) is reduced, what economically makes telemetering possible; the opportunity to follow others variables in addition to the consumption (KWh), and thepossibility to inspect all the clients over a relatively short period of time (annually). Second, to perform a high voltage electrical energy fraud is a complex and dangerous act, having a reduced number of detected cases, and the creation of a consistent fraudulent consumer database and the derivation of rules from this database are almost impossible. Finally, the clients must inform the maximum electrical demand (contracted demand) that they will need each month, information that may be related to the effectively registered electrical demand, showing possible abnormal consumption. So, the objective of this work is to present a methodology and an identification system for possible frauds of high voltage electrical energy consumers. Initially, the methodology based on an Artificial Intelligence technique called Self-Organizing Maps (SOM) will be proposed. After that, some implementation details of the frauds detection system (FDS) will be shown. Finally, validation, result and conclusion will be presented.
The methodology proposed here is base on a reference model known as Knowledge Discovery in Databases (KDD), largely use on data mining project and implimentations. In the sequence the methodology steps are shown.
2.1 Choice of the Variables and Data Consolidation
Among the considered variables (or attributes) for each consumer, there are those whose values changing with time, called dynamic, and the ones that are kept constantly
unaltered or have rare actualizations, called static (or contract variable). The dynamic variables are the most important for fraud detection, because they represent the behavior of the consumer on time domain. However, for each time unit considered there are new values, the dynamic variables are more complexes to be handled and analyzed. Table I shows the chosen set of static and dynamic variables. To obtain information about the high voltage consumers, the data collected by measurement devices (telemeter) installed by the company for each consumer were used, as well the information existent on the contracts between consumers and electrical energy company. The measurement device (telemeter) registers the data of each consumer on a 15 minutes period, which gives 96 registers per day and almost 3,000 per month. Registers are stored on the device, and transmitted via RGPS at the end of the month. The cost of online transmission, or at the end of the day, still too expensive for the company. The database from a Brazilian electric energy distribution company was used on this work. The database store data of approximately 2,000 high voltage consumers. So, each month, almost 6 millions registers are added to it. When selecting the client’s data, within a desired time
interval (12 months for example), a great data volume is received, and it have to be prepared in order to apply the data mining tool of the next step. The consumption (KWh)
variable, with 96 registers per day, was grouped on weekdays– Monday to Friday, which means, blocks of 480 (5x96) registers. Therefore, register 1 represents 00:00:00 Monday consumption and register 480 is consumption on Friday at 23:45:00. So, client consumption was converted into big weekly registers, each one with 480 values. Saturdays and Sundays were excluded because they are atypical days, on these days the client could be consuming normally, partially or even not realizing any activity.
CHOSEN VARIABLES TO BE USED ON THE METHODOLGY
2.2 Self-Organizing Maps (SOM)
SOM is an specific Artificial Neural Network model of non-supervised knowledge that maps a time variant input according to its graphical representation, allowing the identification of clusters or patterns comparable to the inputs.In other words, given a set of registers that can be graphically visualized, the SOM identify groups of registers that are similar (clusters). An important SOM characteristic is that information or orientation about the clusters is unnecessary, it can be used as identification tool for standard profiles on data without classification (or decision), like the one here. To illustrate how SOM was used as a data mining tool on the proposed methodology, data from a client was selected and weekly grouped (Monday to Friday). Figure 2.2.1 illustrates consumption (kWh) for 68 weeks (period of data collection). On the x-axis are all the 480 values that compose a week register, Monday (2) to Friday (6). The curve that is highlighted (black) represents the mean consumption of all the weeks (colored). It can be observed that are many distinct weeks and that each day behavior is similar in a way, what is not necessarily common for all the consumers. When applying the same weeks to the SOM, it found 2 clusters. The weeks that compose each cluster can be seen on the graphics of Figs. 2.2.2 and 2.2.3. Analyzing Clusters 1 (Fig. 2.2.2) and 2 (Fig. 2.2.3), it is possible to notice that this consumer have a typical profile, represented by Cluster 1 (with 44 weeks), and Cluster 2 has an atypical profile, with an relatively low mean consumption. The graphic on Fig.2.2.4 shows on the x-axis all the weeks chronologically orientated, and on the y-axis is the cluster mean consumption. Now it can be clearly seen that Cluster 2 represents an atypical and sporadical consumption until week 50. However, after this week, it is the only cluster. The mean consumption for the weeks on Cluster 2 is 40% of the mean consumption for the weeks on Cluster 2, the suspicion that the client is performing some type of electrical energy fraud from the 50th week could be raised. However, the immediate supposition of fraud may lead to many false positives, for the reason that atypical behavior are common to some clients, specially those who present variable production throughout the year in due to the characteristic of it commercial or industrial
activity. Thus, the application of SOM for this problem needs to be complemented by other operations.
Graphic with 68 weeks of one consumer.
Graphic with the weeks of Cluster 1 (44 weeks).
Graphic with the weeks of Cluster 1 (24 weeks).
Graphic with the consumption mean for the weeks of each
2.3. Automatic Behavior Analysis
The same way SOM is able to identify which ones are the week profiles that a consumer possess in a given time interval, it also can classify new weeks according to pre-computed clusters. Based on this, it is proposed that the behavior of a consumer may be analyzed as follow:
1. Verify if there is a consumption drop (negative variation) between current and anterior month of the analysis (30% drop, for example);
2. Select the last 12 months of data (historical) and organize them into weeks;
3. Compute the weeks clusters with the SOM;
4. Attribute each new week of the current month to one of the clusters found by the SOM (4 or 5 weeks per month);
5. Verify if each new week adequately fits into the cluster that it was attributed (fitness), or if this week probably represents a new profile unknown until now;
6. Verify if the unknown profile is justified by modifications of the consumer contract, keeping approximately constant the reason between monthly registered electrical demand and contracted electrical demand (RD/CD = k).
Flowchart indicating the steps of the behavior analysis of
each consumer, each new month.
The flowchart presented on Fig. 2.3.1 illustrates the steps of the behavior analysis described above. On this analysis, it is admitted that all clients are normally consuming electrical energy. Those who present abrupt drops will go over a consumption behavior analysis, supported by the clusters found with the SOM. The methodology will point fraud suspects only when a really abnormal behavior is identified and not explained by contractual modifications of the electrical demand.
3. FRAUD DETECTION SYSTEM
The methodology presented on the previous section fundamented the implementation of a fraud detection system (FDS). This system was integrated to the information system (IS) of a Brazilian electrical energy distribution company. It is important to emphasize that it is not expected as a result a FDS that substitute the critical sense and the specialist experience. This is because the quantity of high voltage clients is much less than normal clients (residential for example). This way, even so the system identifies consumers with high level of fraud suspicion, this normally small quantity is passive of supervision. This specialist posterioranalysis or verification of suspicions leads to eliminate inspections of the false positives, which are the consumers with atypical behavior of suspects, but that did not committed any illicit act. MATLAB was chosen as FDS development platform because it comprehends a series of toolboxes that facilitate
data manipulation and analysis, even with the use of the SOM. The FDS is executed as a monthly scheduled task (service). So, every month, the system will perform the following tasks:
1. Select the clients that must be analyzed due to fraud suspicion;
2. For each client, its data is selected and the developed methodology is applied;
3. All fraud suspicions found are inserted into the database, as well justifications and additional information about the consumers, facilitating the analysis of the suspicions, which will be performed by the specialists.
To the specialists analyze the suspicions, a interface within the IS was created. When the suspicions are visualized, the specialists can: immediately launch an inspection; detail information of the consumer suspect to understand the motive that brought this alarm; free the client of any suspicion, once the specialist knows the motive of the alarm raised by the FDS.
Fig. 3.1 illustrates the integration between IS and FDS, defining the operations that each system execute on the database.
Integration between IS and FDS.
4. VALIDATION AND RESULTS
For the FDS validation, a simulation was realized, where 156 random consumers with different behavior were selected. First, all of them were analyzed with the FDS regarding fraud suspicion. Then, all the consumers suffered an intentional and temporary 30% drop on their consumption register for a specific period of time, and were submitted again to the FDS. The quantities of suspicion, before and after the intentional consumption drop are presented on Table II. Analyzing this result, it can be observed that FDS is able to identify and judge as suspects the abrupt consumption drops without justificative or similar antecedent. In other words, if a consumer realizes a fraud, the FDS will certainly indicate this consumer as suspect. The justificative for the consumers, that even after consumption modification still as normals (15%), is that they present natural atypical behavior on the same months of the simulation period, however for the previous year. Therefore, the FDS admitted that this abnormality was expected.
Since the incidence of frauds on high voltage consumers is historically small (approximately 1 per year), and the developed FDS is on its first months of functioning, for now there is no registration of suspicions effectively confirmed as fraud after inspection in loco. However, the suspicions raised by the FDS are helping specialists to understand the behavior of their consumers, since only the more severe and intense abnormalities were observed before the implementation of the system.
One of the most relevant points of this work is the proposition of a practical methodology for data mining application on a real problem. This methodology can be easily applied on other behavior analysis problems, especially when historical abnormal cases are limited. SOM as a data mining tool showed to be very efficient. Clusters identification from data is not a simple task, especially when the identification is non-supervised. The consumer clusters showed to be very consistent with reality, taking apart regular from atypical weeks, mainly when consumer’s commercial or industrial activities drastically influence its consumption profile. Fraud detection is a very complex problem, once the differentiation between fraudulent and normally atypical profile is very subtle. The developed FDS showed to be satisfactory because it could perceive alterations on the consumption profile, and also it confronts this atypical behavior with the consumer’s history data. Therefore, when indicating a consumer as suspect, the FDS does not declare that it is defrauding, but signalize that consumption is less than normal, and mainly, that current consumption profile is different from the expected profile for this consumer. With the values found on the validation, it can be concluded that, on the hypothesis of a fraud, the FDS chance to point to a consumer as suspect is large. Inevitably, some consumers will pass as suspects on some moments. With inspection and certification that these consumers are not fraudulents, the FDS may be fed with this new information and start to admit the new behavior as already known. It is important to highlight that the FDS is parametric, and it can work with rigorous or loose values. With its practical functioning and respectively monitoring, it will be possible to tune these parameters on a more convenient way.
 José E. Cabral, Fraud Detection in High Voltage Electricity Consumers Using Data Mining" in Proc. 2008 IEEE Networking System, Sensing and control. Conf., pages 761-766, 2008.
 Y. Kou, Survey of Fraud Detection Techniques, Proceedings of the 2004 IEEE International Conference on Networking, Sensing and Control, vol. 1, pages 749–754, 2004.
 J. R. Filho, "Fraud Identification In Electricity Company Costumers Using Decision Tree" in Proc. 2004 IEEE SMC System, Man and Cybernetics Conf., pp. 3730-3734, 2004.
 J. E. Cabral, " Methodology for Fraud Detection Using Rough Sets." in Proc. 2006 IEEE Granular Computing Conf., pages 244-249, 2006.
Use Search at http://topicideas.net/search.php wisely To Get Information About Project Topic and Seminar ideas with report/source code along pdf and ppt presenaion