Back-end Algorithm
Input: 5min flow file
For each flow record:
* Parse line
* Exclude if it's not TCP or UDP
* Exclude if the source of the destination does not belong to a defined subnet
* Store the source and destination end points in memory
* Store the unidirectional flow in memory (update an existing unidir. flow if one has already been defined)
For each unidirectional flow in memory:
* Check if a valid mirror unidirectional flow exists:
* If yes: merge the two flows to generate a bidirectional flow
* Run each heuristics on the source and destination end points
* Compute a detection proba using Bayesian inference ran on heuristic output
* Label source and destination end points according to the detection probability (either client or server)
* If no: this flow is only unidirectional
* Label source end point as client/scanner and destination end point as invalid
For each end point in memory:
* Compute metrics for this end point
* Output the node to a result file
Client/Server Detection Heuristics
- H.0 Flow timing. Let t1 and t2 be the timestamps of
the unidirectional flows constituting a bidirectional
flow. The source of the flow with the larger (more
recent) timestamp is likely the server. The difference between t1 and t2 provides an indication on
the probability that this heuristic will identify the
correct end point as a server. If the timestamps are
identical, they cannot be used to decide which end
point is the server.
- H.1 Port number. Let p1 and p2 be the port numbers
associated with a bidirectional flow. The end point
with the smaller port number is likely the server. If
the port numbers are identical, they cannot be used
to decide which end point is the server.
- H.2 Port number with threshold at 1024. If an end point
has a port number lower than 1024, then it is likely
a server. The value of 1024 corresponds to the limit
under which ports are considered privileged and
designated for well-known services. If both ports
are above or below 1024, this heuristic cannot be
used to decide which end point is the server.
- H.3 Port number advertised in /etc/services. If the port
number of an end point is listed in the standard
UNIX file /etc/services that compiles assigned port
numbers and registered port numbers, then it is
likely a server. If both or neither port numbers are in
/etc/services, this heuristic cannot be used to decide
which end point is the server.
- H.4 Number of distinct ports related to a given end
point. If two or more different port numbers (in different flows) are associated with an end point, the
end point is likely a server. The number of different port numbers related to an end point provides an
indication on the probability that this heuristic will
correctly identify the server. This heuristic comes
from the fact that ports on the client-side are often
randomly selected. Therefore, ports on the clientside of a connection are less likely to be used in
other connections compared to ports on the serverside. If both end points are related to the same number of ports, then this heuristic cannot be used to
decide which end point is the server.
- H.5 Number of distinct IP addresses related to a given
end point. This heuristic is identical to the previous
one but counts IP addresses instead of ports.
- H.6 Number of distinct tuples related to a given end
point. This heuristic is identical to the previous
one but counts end points instead of single IP addresses. This heuristic is based on the observation
that each server typically has two or more clients
that use the service. Furthermore, even if only onereal user accesses the service (e.g., identified by the
IP address of the user’s machine), the communication will likely require multiple connections and the
client side of the access often uses different port
numbers. Thus, multiple end points will be detected