The SPID Algorithm - Statistical Protocol IDentification
Identifying which application layer protocol is being used within a network communication session is important when assigning Quality of Service priorities as well as when conducting network security monitoring. Currently most protocol identification is performed through signature matching algorithms that rely on strings or regular expressions as signatures. This project presents a protocol identification scheme called the Statistical Protocol Identification (SPID) algorithm, which reliably identifies the application layer protocol by using statistical measurements of flow data as well as application layer data.
The concept of Protocol Identification is also known as Application Identification, Port Independent Protocol Identification (PIPI), Protocol Discovery, Application Recognition and Traffic Classification.
Overview of the algorithm
SPID performs protocol identification by comparing the protocol model of an observed session to pre-calculated protocol models of known protocols. Protocol models contain a set of attribute fingerprints. Fingerprints are created through frequency analysis of various attributes, such as application layer data or flow features, and are represented as probability distributions. The PoC SPID application uses over 30 attribute meters, which are the functions that provide the distribution measurements for each specific attribute. An example of such an attribute meter is the basic ByteFrequencyMeter, which measures the frequency with which all of the possible 256 bytes occur in the application layer data. Other attribute meters perform much more advanced analysis of various properties in a session, such as measuring the frequency of various request-response combinations (e.g. HTTP behavior, where a ’GET’ request is followed by an ’HTTP’ response or FTP behavior where a ’220’ message is replied to with a ’USER’ command). The SPID algorithm also makes use of flow measurements (that do not require inspection of application layer data), such as packet sizes, packet inter-arrival times and packet order number- and direction combinations. Attribute fingerprints are represented in the form of probability distributions. This means that the data for each fingerprint is represented by two arrays (vectors) of discrete bins: one array of counter bins and one of probability bins. Values of the counter vectors represent the number of times an observation (analyzed packet) has caused the associated attribute meter to trigger that particular index number in the vector. Probability vectors are normalized versions of the counter vectors, with all values in every probability vector summing up to 1.0. A vector length of 256 is used in the PoC implementation; an implementation of the SPID algorithm can, however, use any length for these vectors.
Generation of Protocol Models
For observed sessions, a protocol model is created upon session establishment (e.g. after the TCP three-way handshake), consisting of a set of attribute fingerprints. Every packet with application layer data belonging to a session is called an observation. Each such observation is then fed into the attribute meters, which provide measurements that are stored in the session’s protocol model. Upon receiving such a measurement, the protocol model increments the fingerprint counters accordingly. For illustration, we assume an attribute fingerprint for the ByteFrequencyMeter from the first data packet observed in a HTTP session, i.e. a HTTP GET command. The counters would be incremented to
- 3 for the counter at index 84 (since there are three T’s in ’GET / HTTP/ 1.1’)
- 2 for counters at index 32, 47 and 49 (space, ’/’ and ’1’)
- 1 for counters at index 71, 69, 72, 80 and 46
- 0 for all other counters
All other attribute fingerprints belonging to the same protocol model will also increase their counters based on the sets of indices that are returned from their respective attribute meter. Subsequent packets in the same session will cause the fingerprint counter values to further increment. However, since one design goal of SPID is to keep time complexity low, we want to show in future work that utilizing only the first few packets provides sufficient precision. Protocol models for known protocols are generated from real network packet traces. These traces need to be preclassified, either manually or automatically, in order to be usable as training data for the SPID algorithm. The preclassified training data is converted to protocol model objects (one per protocol) by generating protocol models for each session and merging (i.e. adding) the fingerprints of the same protocol and attribute type. The more sessions are merged together for each protocol, the more reliable the fingerprint will be. As a rule of thumb, we found that 10% of the fingerprints’ vector lengths (i.e. approximately 25) turned out to be a rough measurement of the minimum number of training sessions needed to build a reliable protocol model.
Comparison of Protocol Models
Fingerprints of an observed session are compared to fingerprints of known protocol models by calculating the Kullback- Leibler (K-L) divergence (also known as relative entropy) between the probability distributions of the observed session and each protocol model, ranging from 0 (identical distributions) to 1. The K-L divergence is a value that represents how much extra information is needed to describe the values in the observed session by using a code, which is optimized for the known protocol model instead of using a code optimized for the session protocol model. The best match for an observed session is the attribute fingerprint which yields the smallest K-L divergence.
Protocol models of observed sessions are finally compared to protocol models of known protocols by calculating the KL divergences of the models’ attribute fingerprints. The best protocol match is the one with the smallest average K-L divergence of the underlying attribute fingerprints. A good approach is to assign a threshold value, where only K-L divergence average values below the threshold are considered matches. If none of the known protocol models match, the session is classified as ’unknown’ in order to avoid falsepositives for known models.
Download the proof-of-concept application (as source code or Windows executable)
A report describing the SPID algorithm is available at the web site of .SE (the Swedish Internet Infrastructure Foundation).
http://www.iis.se/docs/The_SPID_Algorithm_-_Statistical_Protocol_IDentification.pdf or http://issuu.com/erikh/docs/the_spid_algorithm
Research paper "Statistical Protocol IDentification with SPID: Preliminary Results" presented at SNCNW '09 in Uppsala:
http://spid.sourceforge.net/sncnw09-hjelmvik_john-CR.pdf or http://www.cse.chalmers.se/~johnwolf/publications/sncnw09-hjelmvik_john-CR.pdf
Presentation of "Statistical Protocol IDentification with SPID: Preliminary Results" from SNCNW'09: