Andrew Weiner - 2007-02-07

I have been trying to determine the different IP addresses for Google Cache servers. Awstats has google.co as a search link and all the numerical addresses as cache. After some trial and error it seems that the referrer URL address could be the same for both. For example, I can have a referrer of 216.239.39.104 for both a direct search and cache results. Here are two actual log entries:

www.ittoolkit.com - - [06/Feb/2007:18:12:51 -0600] "GET / HTTP/1.1" 200 23470 "http://216.239.39.100/search?hl=en&q=ittoolkit&btnG=Google+Search" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; RTAx)"

www.ittoolkit.com - - [06/Feb/2007:18:07:17 -0600] "GET /images/brochure.gif HTTP/1.1" 200 23214 "http://216.239.39.100/search?q=cache:O3q6fpowGaAJ:www.ittoolkit.com/qtools.htm+ittoolkit&hl=en&ct=clnk&cd=2" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322;)"

You can see they are both from http://http://216.239.39.100

In the current Search_engines.pm file you will find the following:

$Revision: 1.41 $ - $Author: eldy $ - $Date: 2006/11/15 22:30:15 $

'google.','google',
'216.239.(35|37|39|51).100','google_cache',
'216.239.(35|37|39|51).101','google_cache',
'216.239.5[0-9].104','google_cache',
'64.233.1[0-9]{2}.104','google_cache',
'66.102.[1-9].104','google_cache',
'66.249.93.104','google_cache',
'72.14.2[0-9]{2}.104','google_cache',

With the above code, both example entries will be logged as 'google_cache' when the first is clearly not from cache.

What I have done is to say all google referrers are to be rolled up as 'google', eliminating the 'google_cache' test. I then use the Extras section to break out the types of referrers (Direct vs Cache). Here is the code:

NEW GOOGLE DATACENTER CODE BY APW (Includes last octet test)

'64.233.(16[1379]|17[19]|18[3579]).(1[89]|44|8[034]|9[13589]|10[0-7]|115|133|147|16[0-2]|18[49])','google',
'66.102.([179]|11).(1[89]|44|8[034]|9[13589]|10[0-7]|115|133|147|16[0-2]|18[49])','google',
'66.249.(8[1359]|9[13]).(1[89]|44|8[034]|9[13589]|10[0-7]|115|133|147|16[0-2]|18[49])','google',
'72.14.2(0[3579]|1[1579]|2[13]|35|47|5[35]).(1[89]|44|8[034]|9[13589]|10[0-7]|115|133|147|16[0-2]|18[49])','google',
'209.85.(129|135|143).(1[89]|44|8[034]|9[13589]|10[0-7]|115|133|147|16[0-2]|18[49])','google',
'216.239.(3[379]|41|5[1579]|63).(1[89]|44|8[034]|9[13589]|10[0-7]|115|133|147|16[0-2]|18[49])','google',

Here it is without the last octet being tested:

NEW GOOGLE DATACENTER CODE BY APW (No last octet test)

'64.233.(16[1379]|17[19]|18[3579]).','google',
'66.102.([179]|11).','google',
'66.249.(8[1359]|9[13]).','google',
'72.14.2(0[3579]|1[1579]|2[13]|35|47|5[35]).','google',
'209.85.(129|135|143).','google',
'216.239.(3[379]|41|5[1579]|63).','google',

Here it is consolidated into one line (includes last octet test):

'(64.233.(16[1379]|17[19]|18[3579])|66.102.([179]|11)|66.249.(8[1359]|9[13])|72.14.2(0[3579]|1[1579]|2[13]|35|47|5[35])|209.85.(129|135|143)|216.239.(3[379]|41|5[1579]|63)).(1[89]|44|8[034]|9[13589]|10[0-7]|115|133|147|16[0-2]|18[49])','google',

I have also modified the search term field to correctly extract the keywords from the cache referrer:

AWSTATS CANNOT DIFFERENTIATE BETWEEN CACHE AND SEARCH from URL(Cache test MUST be first)

'google','(q=cache:[0-9A-Za-z_]{12}:.*?+|p=|q=|as_p=|as_q=)',

Here is the Extras section of the configuration file:

Can leave off last octet for greater speed

ExtraSectionName1="Google Breakout by Referrer Type"
ExtraSectionCodeFilter1="200 304"
ExtraSectionCondition1="REFERER,(google.|(64.233.(16[1379]|17[19]|18[3579])|66.102.([179]|11)|66.249.(8[1359]|9[13])|72.14.2(0[3579]|1[1579]|2[13]|35|47|5[35])|209.85.(129|135|143)|216.239.(3[379]|41|5[1579]|63)).(1[89]|44|8[034]|9[13589]|10[0-7]|115|133|147|16[0-2]|18[49]))"
ExtraSectionFirstColumnTitle12="Standard Search:(q=) Cache Link:(q=cache)

q=cache: must be first or regex will match q= for all referrers

ExtraSectionFirstColumnValues1="REFERER,(q=cache:|q=)"
ExtraSectionFirstColumnFormat1="%s"
ExtraSectionStatTypes1=PHL
ExtraSectionAddAverageRow1=0
ExtraSectionAddSumRow12=1
MaxNbOfExtra1=2
MinHitExtra1=1

I will agree that most direct (non cache) entries will be from google.co but I like my stats to be as accurate as possible. I suggest that you at least use the modified addresses as the ones in Awstats are not all-inclusive. Also Awstats does not correctly extract the keyphrase in the current version and the above code will correct the search term. I hope this can be of help to others.

Enjoy!
Andrew

Google Datacenter Address sources link:
http://www.seocritique.com/datacentertool/

Additional background information:
http://www.webmasterworld.com/forum30/34828.htm
http://www.seroundtable.com/archives/004004.html