Menu

Hostalias

Frederic Marchal Evgeniy Yakushev

The purpose

It is common for web sites to split the load over several servers by directing their users to servers with different names. In your logs, you can see, for instance, users sending requests to all those sites:

csi.gstatic.com
g0.gstatic.com
g1.gstatic.com
g3.gstatic.com
maps.gstatic.com
ssl.gstatic.com
t0.gstatic.com
t1.gstatic.com
t2.gstatic.com
t3.gstatic.com
www.gstatic.com

If the accesses are shared between servers, the topsites report will be cluttered with many servers entries that are, in the end, the same web service.

Sarg offers a mechanism to group those entries and rank them as they deserve in the reports. This is the hostalias file.

How to use it

Sarg's configuration file contains an option named hostalias. Its argument is a file name containing the grouping instructions.

The file contains text lines each identifying one criterion to be matched against host names and one alias to replace all the matching names in the reports. There can be many criteria. They are all tested until a match is found. There is, unfortunately, no guarantee about the order the criteria are tested. If no match are found, the original host name is left unchanged.

So, each line identifies a match. The string to match ends at the first space in the line. Anything after that space is the string to replace the host name with in the report. That second part is optional. If no replacement string are provided, the matching host names are replaced by the pattern string used in the hostalias file.

Simple name with one wildcard

The simplest way to match a host name is to use a string with, at most, one wildcard character *.

For instance, the above examples would be grouped as one report entry with a line as

*.gstatic.com

As no replacement string is provided, every access to a gstatic.com host will be grouped under the *.gstatic.com entry spelled like this.

The following examples replace the host name by a more meaningful string:

*.freeav.net antivirus: freeav
*.avgate.net antivirus: avgate

In both cases; each site is replaced by the string "antivirus: " followed by the antivirus name. The space in the alias is part of the output string (remember the criterion ends at the first space and anything left is the replacement string).

IP address

IP addresses can be grouped by sub domains.

IPv4 addresses are supported:

80.190.143.224/27 antivirus: avira

And so are IPv6 addresses:

0::1/128 localhost6

Note: The leading zero in the IPv6 address is the only way I have found to show the IP address on sf.net (a line starting with two colons obviously means something special to this wiki). The correct IPv6 syntax (::1/128) is supported by sarg.

Regular expression

If sarg was compiled with pcre, you can also use regular expressions to match hosts.

A regular expression criterion starts with the three characters "re:" The next character is the end delimiter.

For instance:

re:!www\d*\.megavideo\.com! www.megavideo.com

Here, the ! delimits the regular expression. Any character can be used. Just make sure you choose a character that doesn't appear in the regular expression.

Placeholder are allowed:

re:/khm\d*\.google\.(com|be)$/ khm.google.\1

In the above expression, if the top level domain matches either "com" or "be", the matched top level domain replaces the "\1" in the alias string. Therefore, khm5.google.com is replaced with khm.google.com while khm3.google.be is replaced with khm.google.be.

If this regular expression thing looks complicated, find a good tutorial about perl regular expressions and start by building simple regex.

This one liner perl may actually not help you if you are lost with regular expressions but let's assume someone ever want to check a regular expression against a real access.log to see if it matches anything at all, here is how I proceed:

perl -ne 'my @c=split;(my $u=$c[6])=~s!^\w+://!!;$u=~s!/.*!!;print "$u\n" if $u=~m!\.gstatic\.com$!;' access.log

In full, the program is:

# split the columns
my @cols=split;
# get the url column
my $url=$cols[6];
# remove the schema (http://) if any
$url=~s!^\w+://!!;
# remove the path from the url
$url=~s!/.*!!;
# print it if it matches the regular expression
print "$url\n" if $url=~m!\.gstatic\.com$!;

The regular expression to check is the one after the print statement. It starts after the m! in the above example.


Related

Discussion: Report only second level domains
Wiki: Report options
Wiki: Types of reports sarg can produce
Wiki: USER and IP options

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.