NEURAL NETWORK MODEL FOR DETECTING ANAMALOUS TRAFFIC IMPLEMENTING SELF-ORGANIZING MAP
computer science topics|
Active In SP
Joined: Jun 2010
07-06-2010, 06:23 PM
A NEURAL NETWORK MODEL FOR DETECTING ANAMALOUS TRAFFIC IMPLEMENTING SELF-ORGANIZING MAPS.docx (Size: 77.87 KB / Downloads: 62)
A NEURAL NETWORK MODEL FOR DETECTING ANAMALOUS TRAFFIC IMPLEMENTING SELF-ORGANIZING MAPS
G Vijay Kumar 1 Sumalatha 2 A Ajay Kumar 3 K V Ramana
department of ECE, VR Siddhartha Engg. College, Vijayawada, India
Department of CSE, JNTU College of Engg., JNTU, Kakinada, India
Department of CSE, JNTU College of Engg., JNTU, Kakinada, India 4Department of CSE, Sri Prakash Engineering College, TUNI, Andhra Pradesh, India
Integrated Network-Based Ohio University Network Detective Service (INBOUNDS) is a network based intrusion detection system being developed at Ohio University. The Anomalous Network-Traffic Detection with Self Organizing Maps (ANDSOM) module for INBOUNDS detects anomalous network traffic based on the Self-Organizing Map algorithm. Each network connection is characterized by six parameters and specified as a six-dimensional vector. The ANDSOM module creates a Self-Organizing Map (SOM) having a two-dimensional lattice of neurons for each network service. During the training phase, normal network traffic is fed to the ANDSOM module, and the neurons in the SOM are trained to capture its characteristic patterns. During real-time operation, a network connection is fed to its respective SOM, and a "winner" is selected by finding the neuron that is closest in distance to it. The network connection is then classified as an intrusion if this distance is more than a pre-set threshold. Keywords: Intrusion Detection, Anomaly Detection, Self-Organizing Maps
We have seen an explosive growth of the Internet in the past two decades. As of January 2003, the Internet connected over 171 million hosts . With this tremendous growth has come our dependence on the Internet for more and more activities of our lives. Hence, it has become critical to protect the integrity and availability of our computer resources connected to the Internet. We have to protect our computer resources from malicious users on the Internet who try to steal, corrupt, or otherwise abuse them. Towards this goal, intrusion detection systems are being actively developed and increasingly deployed. Intrusion detection systems have commonly used two detection approaches, namely, misuse detection and anomaly detection. The misuse detection approach uses a database of "signature"s of well known intrusions and uses a pattern matching scheme to detect intrusions in real-time. The anomaly detection approach, on the other hand, tries to quantify the normal operation of the host, or the network as a whole, with various parameters and looks for anomalous values for those parameters in real-time. Integrated Network Based Ohio University Network Detective Service (INBOUNDS) is an intrusion detection system being developed at Ohio University. INBOUNDS uses the anomaly detection approach to detect intrusions, and is network-based i.e., it can be used to passively monitor the network as a whole.
In this paper, we present the Self-Organizing Map based approach for detecting anomalous network behavior developed for INBOUNDS. We organize this paper as follows. In Section 2, we give a brief description of Self-Organizing Maps and follow up in Section 3 with details of related work in the intrusion detection domain based on Self-Organizing Maps. In Section 4, we describe the
INBOUNDS system and the design of the Self-
Organizing Map based module for detecting intrusions. In Section 5, we present some of our experimental results, and in Section 6 we conclude and give some recommendations for future work.
2 SELF-ORGANIZING MAPS
The concept, design, and implementation techniques of Self-Organizing Maps. The Self-Organizing Map algorithm performs a nonlinear, ordered, smooth mapping of high-dimensional input data manifolds onto the elements of a regular, low-dimensional array. The algorithm converts non-linear statistical relationships between data points in a high-dimensional space into geometrical relationships between points in a two-dimensional map, called the Self-Organizing Map (SOM).
6i iiii i
The SOM learning principle, illustrated in Figure 1, shows the SOM, with the circles representing neurons. The input data point is fed in parallel to all the neurons in the SOM. The winner neuron is colored black, and a square of length 5 centered around it represents the neighborhood. During the learning phase, samples of data are collected from the input space and "shown" to the SOM. For this purpose, sample vectors representing input data covering the range of operational behavior are collected. The neurons in the SOM are initialized to values chosen from the range of sample data. The neurons can be assigned values linearly in the range (linear initialization), or assigned random values within the range (random initialization).
- Distance Measure. For the purpose of locating the winner neuron given the data sample, a suitable measure of distance has to be defined. The commonly used distance measures are the Euclidean and the Dot-product measures. In the Euclidean measure, given two points X (xl, x2, . . ., xk) and Y (y 1, y2, . . ., yk) in k-dimensional space, the Euclidean distance is given by
Fig. 1. SOM Learning
A SOM can then be used to visualize the abstractions (clustering) of data points in the input space. The points in the SOM are called neurons, and are represented as multidimensional vectors. If the data points in the input space are characterized using k parameters and represented by k-dimensional vectors, the neurons in the SOM are also specified as k-dimensional vectors.
In the SOM Learning phase, the neurons in the SOM are trained to model the input space. This phase has the following two important characteristics:
- Competitive. Each sample data point from the input data space is shown in parallel to all the neurons in the SOM, and the "winner" is chosen to be the neuron that responds best. The k-dimensional values of the winner are adjusted so that it responds even better to similar input.
- Cooperative. A neighborhood is defined for the winner to include all neurons in its near vicinity in the SOM. The k-dimensional values of neurons in the neighborhood are also adjusted so that they too respond better to a similar input.
If the Dot-product measure is to be used, the input data points and the neurons in the SOM have to be normalized. Normalization of a vector V (vl, v2, . . ., vk) is a process of transforming its components into
so that the modulus of the normalized vector is unity. The dot-product of the input data point is calculated individually with each of the neurons, where the dot-product of two normalized vectors X (x1 , x2, . . ., xk), and Y (y 1, y2, . . .yk) is defined to be
The winner is selected to be the neuron that gives the maximum dot product.
â€ Neighborhood Function. The neighborhood function determines the size and nature of the neighborhood around the winner neuron. The commonly used neighborhood functions are Bubble and Gaussian. In the Bubble function, the neighborhood radius is specified by a variable s, and all neurons within the neighborhood are adjusted by the same factor a towards the winner. The parameter a, called the learning rate factor, and the neighborhood size s, are generally chosen to be
The choice of the factors described in the previous section affect the nature of the SOM generated. Once these factors are decided upon, the following algorithm can be used to train the SOM. After the SOM is initialized, the learning process is carried out in two phases.
In the initial phase, a relatively large neighborhood radius is chosen. The learning rate factor a(t) is also chosen to have a high value, close to unity. This phase is carried out for relatively lesser number of iterations. Most of the map organization happens in this phase. In the final fine-tuning phase, a smaller neighborhood radius and smaller learning rate factor are chosen. This phase is carried out for relatively larger number of iterations. The adjustment done to the neurons are much smaller in this phase.
In a k-dimensional space, the sample data and the SOM neurons appear as points. During the course of the learning algorithm described above, the neurons "move" in the k-dimensional space to characterize the sample data as closely as possible. While clusters of neurons would form at spaces where the sample data points are concentrated, fewer neurons would represent the space where sample data occur sparsely.
During operation, a real-time sample can be fed to the SOM, and its winner located. It can be flagged normal if it is sufficiently closer to the winner, and flagged anomalous if its distance from the winner is more than a preset threshold.
3. RELATED WORK
In the work cited in this paper, SOM-based profiles of various network services like web, email, etc., are built to perform anomaly-based intrusion detection. Using network-service profiles to perform intrusion detection is not new however. The paper discusses the approach used to build statistical profiles of network services for the EMERALD intrusion detection system. The paper discusses a Neural network based approach to develop connection signatures of common network services. Self-Organizing Maps have also been used in the past in the intrusion detection domain for various purposes. The paper describes a Multi-level Perceptron/ SOM prototype used to perform misuse-based intrusion detection. The paper describes a system that uses SOMs as a monitor-stack to profile network data at various layers of the TCP/IP protocol stack. This system was used to detect buffer-overflow attacks by building profiles of application data based on percentage of bytes that were alphabetical, numerical, control, or binary. The paper  describes a system that uses Neural networks using the Resilient Propagation Algorithm (RPROP) to detect intrusions that uses SOMs for clustering and visualizing the data.
Fig. 2. INBOUNDS Architecture Diagram
The paper describes a host-based intrusion detection system that uses multiple SOMs to build profiles of user sessions, and uses them to detect abnormal user activities. The paper describes an anomaly-based intrusion detection system that characterizes each connection based on the following features:
Duration of the connection, Protocol type (TCP/UDP, service type (HTTP/SMTP/etc.,, Status of the connection, Source bytes(Bytes sent from Destination to Source). SOMs based on these features were used to classify network traffic into normal or anomalous. In our approach cited in this paper, we characterize each network connection based on a different set of features, and build SOMs for each individual network service of interest.
4. SOM-BASED ANOMALY DETECTION
In this section, we describe the Anomalous Network-traffic Detection using Self- Organizing Maps
(ANDSOM) module used by INBOUNDS for detecting
intrusions. We briefly describe the design of the INBOUNDS system as a whole, and then describe the
ANDSOM module in detail.
4.1 INBOUNDS Architecture
The INBOUNDS Architecture diagram is shown in Figure 2. Some of the modules of the INBOUNDS system are currently under development. The goal of this section is only to present a high-level view of the INBOUNDS system, so as to give proper context to describe the ANDSOM module. The Intrusion Detection Engine is the
heart of the INBOUNDS system. Multiple Data Source
modules feed real-time network data into the engine. This engine makes a decision on whether a network connection looks normal or anomalous. The Display module shall display real-time network traffic seen in the network on which INBOUNDS is being run, with a GUI front-end. The Active Response module takes response actions against intrusions. The response actions include being able to add a rule in the network firewall (if available) to block the traffic, to rate-limit traffic, to close the TCP connection by sending a RST packet to the sender etc. The Intrusion Detection module is present as a placeholder for a module that can make the final decision on whether the connection being analyzed is intrusive or not. The goal of this module is to incorporate the results of other modules, beside the ANDSOM module, and come up with a final decision. The modules that could be added in future include signature based intrusion detection systems like SNORT  and modules based on other paradigms like the Bayesian module under development.
a. Data Source Module. The Data Source module feeds
live network data packets to Intrusion Detection Engine.
The program tcpurify  runs as the Data Source module,
captures network packets off the wire, removes the
application data from the packet and reports only the first
64 bytes of each packet, covering the IP and TCP/UDP
protocol headers. The tcpurify program can also obfuscate
the sender and receiver IP addresses and provide
anonymity to the two hosts involved in the network
connection during traffic analysis.
b. Data Processor Module. This module receives the raw
packets from the Data Source modules as input and runs
the tcptrace program with the INBOUNDS module. The
INBOUNDS module for tcptrace reports the following
messages for every connection seen in the network. Open
messages are reported upon seeing a new connection
being opened in the network. The Open (O) message is of
O TimeStamp Protocol <src host:port> <dst host:port> Status
The TimeStamp field reports the time the connection was opened. The Protocol field keeps track of if the protocol was TCP or UDP. Since UDP traffic doesn't have an implicit notion of connection unlike TCP, the Data Processor module groups UDP traffic between unique <SourceIP,SourcePort,DestinationIP,DestinationPort >
4-tuples as connections, and uses a time -out mechanism to expire old connections. The < srchost : port >, and < dsthost : port > fields together identify the end-points involved in the connection. The Status field is reported as 0 if both SYN packets opening the connection were seen, and reported as 1 otherwise, for TCP connections. For UDP connections, the Status field is always reported as 0.
c. Update messages are reported periodically during the
lifetime of the connection. The
Update (U) message is of the following format:
U TimeStamp Protocol <src host:port> <dst host:port>
The period with which successive Update (U) messages are reported is tunable and defaults to 60 seconds. The INTER field reports the interactivity of the connection, defined as the number of questions per second seen during the past period. A sequence of data packets seen from the sender to the receiver constitute a single question until the receiver sends a packet carrying some data (pure TCP acknowledgments do not count), which would mark the end of the question. The answers are similarly defined for the receiver to sender direction.
d. Close messages are reported when a previously open
connection is closed in the network. The Close ©
message is of the follo wing format:
C TimeStamp Protocol <src host:port> <dst host:port> Status
For TCP connections, the Status field is reported as 0 if both the FIN packets were seen during the connection close. If the connection was closed with a RST packet, the Status is reported as 1. UDP connections are reported as closed if they are found inactive for an expire-interval. The expire-interval is tunable, and defaults to 120 seconds. The Status field is always reported as 0 for UDP connections.
4.2 ANDSOM Module - Training
The ANDSOM module implements the SOM-based
approach for intrusion detection. In the training phase, SOMs are built to model different network services like web, email, telnet etc. For example, if we are trying to model web traffic in our network, a training data set is first collected by capturing dump files from the network having a large number of sample web connections. It is important to make sure that intrusions themselves do not get into the training data set, since such intrusion may be perceived as normal by the SOM being built. For this, the signature- based intrusion detection system SNORT is run on the dump file, and connections reported by SNORT as intrusive are removed. However, it is still possible that we could have intrusions missed by SNORT if it had no rules to detect them. To make our system robust against this possibility, we could use multiple training data sets to generate multiple SOMs for each network service and run training data sets against other maps to prune out anomalies from getting into our model, as discussed in .
The submodules that make up the ANDSOM module are explained below:
1. TRC2INP Submodule. This submodule receives the
'O', 'U', and 'C' messages generated by the Data
Processor module as input and generates six dimensional
vectors characterizing network connections. The
parameters constituting the six dimensions are
INTER, ASOQ, ASOA, L QAIT, L AQIT, and DOC. The
INTER, ASOQ, and ASOA dimensions are calculated
by averaging the INTER, ASOQ, and ASOA values
from the 'U' messages received during the lifetime of the connection. The dimensions L QAIT, and L AQIT represent log base 10 values of the average QAIT and AQIT parameters calculated from the 'U' messages received during the course of the connection. The DOC dimension reports the Duration of Connection, and is the difference between the TimeStamps reported in the 'O' and 'C' messages of the connection.
2. Normalizer Submodule. Using the six-dimensional
vectors reported by the TRC2INP module to build SOMs
directly tends to be biased to certain dimensions, as
different dimensional values tend to be in different units.
Normalizing all dimensions to values from 0 to 1, for
example, could help, but still the dimension with the most
variance would tend to dominate the SOM formation.
Hence we use the following variance normalization
procedure in this submodule to normalize the six-
1. SOM Initialization The SOMs were initialized with the som lininit function from the SOMTOOLBOX. This function uses Principal Component Analysis to arrive at the SOM dimensions by calculating the eigen values and eigen vectors of the auto-correlation matrix of the training data set. The orientation corresponding to the two largest eigen values, which are the directions in which the data set exhibits the most variance, is found. The ratio between the SOM dimensions are chosen based on the ratio of the two largest eigen values.
2. Initial Training Phase The vsom program from the SOM PAK package was used to train the neurons in the SOM. The number of iterations of training in this phase were chosen to be low, in the order of a few thousands. In our experiments, for each network service, we typically had a few thousand samples, and this phase was done so that each sample was shown at most once to the SOM being built. The Gaussian neighborhood function was used with an initial neighborhood radius as the lower of the SOM dimensions. The neighborhood radius decreased linearly to 1 at the end of the training. The learning rate factor a(t) was chosen to be 0.9 and reduced linearly to 0 at the end of training.
3. Final Training Phase The vsom program was used again in the final training phase, and the number of iterations of training was chosen to be high, in the range of 100,000s. The number of iterations was set to be 500 times the product of the lattice dimensions. The Gaussian neighborhood function with a low initial neighborhood radius of 5, and a low learning rate factor of 0.05 were set, and the map was fine-tuned in the Final Training Phase.
5. EXPERIMENTAL RESULTS
In this section, we describe our experiments with the SOM models for Domain Name System (DNS) and web (Hyper-Text Transfer Protocol) traffic, and analyze the performance of the mo dels built.
DNS  traffic runs on top of both TCP and UDP. Although some DNS connections are observed on top of TCP, typically when two name servers transfer bulk domain information, the bulk of DNS traffic is found to be on top of UDP, and involve simple query-response of domain information. We collected dumpfiles from our network yielding 8857 sample DNS connections, and a SOM of dimensions 19x25 was built and linearly initialized. The mean and standard deviation values of this data set found by the Normalizer submodule shown in Table 1 illustrate the traits of DNS traffic found in our network. The mean INTER value of 0.65 and the DOC value of 2 seconds, indicate that a DNS connection has 1.3 questions asked during the course of a connection. This is expected since the bulk of DNS connections involve a single query-response.
 Phillip A.Porras and Alfonso Valdes. Live Traffic Analysis of TCP/IP Gateways. In Proceedings of the ISOC Symposium on Network and Distributed Systems Security, 1998.
 T. Berners-Lee, R. Fielding, and H. Frystyk. Hypertext Transfer Protocol - HTTP/1.0, May 1996. RFC 1945.
 Ethan Blanton. TCPurify.
The ability of the SOM based approach to correlate multiple aspects of a network connection (reported by the six parameters) to decide if it looks normal or abnormal, makes it a powerful technique for anomaly detection. The SOM model we built to characterize SMTP (email) traffic was also successful in detecting a Sendmail  buffer overflow attack, and is described in . The SOM based approach seems to be particularly well suited to detect buffer-overflow attacks, as they tend to differ from the normal traffic behavior on the six dimensions. However, the ANDSOM module may not detect attacks that resemble normal operational behavior. An intrusion massaged to resemble normal traffic might go un-noticed. Another limitation is that although the behavior exhibited by the bulk of traffic for a network service can be modeled, corner-case behavior occurring infrequently may be classified as intrusions, giving rise to false-positives.
G Vijay Kumar
Lecturer, Dept of ECE, VR Siddhartha Engg. College,Vijayawada-520 007.M. Tech Computer Science JNTU , Kakinada ,In Progress,(Proj. Sub.), B. E ECE , Madras University, Inter., MPC, BIE,AP, Membership in Professional Bodies: MBMSE, AMIETE, Experience As
Programmer/Tester :16-09-2000 To 15-11Ã‚Â¬2001 (1 Years)As Lecturer : 26-11-2001 to till
date Working as Lecturer in Department of Electronics and Communication Engineering at Nova Engineering College and presently working in V R Siddhartha Engineering College, Vijayawada from 28-10-2002 to till date
Designation : Associate professor, Department: Computer Science and Engineering, Qualification : M.Tech Field of specializations Computer Science and Engineering, Teaching and Research Industrial Experience 6years,2months, Area of Expertise and interest Network Security and Cryptography, Artificial Intelligence.
Use Search at http://topicideas.net/search.php wisely To Get Information About Project Topic and Seminar ideas with report/source code along pdf and ppt presenaion