From: Charles C. <ch...@ru...> - 2005-02-07 08:43:55
|
Here are two new queries that I wrote to analyse search engine activity from the SQL access log. These can be inserted into the AnalyseAccessLog.inc file that I uploaded last week - the insertion point should be just before the final "}". The queries contain a list of DNS names for search engine crawlers/spiders/whatever - but this is based only on what I saw in my sites' accesslog. Any additions are welcome. regards, Charles } elseif ($queryName=="Search Bots") { // This queries for all entries in the SQL access log table that // have a dns name that I know to be a web search engine crawler and // categorises the results into time buckets as per the list below // 0 - 1 minute - 60 // 1 - 1 hour - 3600 = 60 * 60 // 2 - 1 day - 86400 = 60 * 60 * 24 // 3 - 1 week - 604800 = 60 * 60 * 24 * 7 // 4 - 1 month - 2629800 = 60 * 60 * 24 * 365.25 / 12 // 5 - 1 year - 31557600 = 60 * 60 * 24 * 365.25 $now = time(); $query = "SELECT " ."IF($now-time_stamp<60, '0 - last minute', IF($now -time_stamp<3600, '1 - 1 minute to 1 hour', IF($now-time_stamp<86400, '2 - 1 hour to 1 day', IF($now-time_stamp<604800, '3 - 1 day to 1 week', IF($now-time_stamp<2629800, '4 - 1 week to 1 month', IF($now-time_stamp<31557600, '5 - 1 month to 1 year', '6 - more than 1 year')))))) AS 'Time Scale', " ."remote_host AS 'Remote Host', " ."count(*) AS 'Access Count' " ."FROM $accesslog " ."WHERE (remote_host LIKE '%googlebot.com' " ."OR remote_host LIKE '%alexa.com' " ."OR remote_host LIKE '%inktomisearch.com' " ."OR remote_host LIKE '%msnbot.msn.com') " .($whereConditions ? 'AND '.$whereConditions : '') ."GROUP BY 'Time Scale', 'Remote Host'"; } elseif ($queryName=="Search Bot Page Hits") { // This queries for all entries in the SQL access log table that // have a dns name that I know to be a web search engine crawler and // displays the URI that was hit. // If PHPSESSID appears in the URI, just display the URI to the left of this $now = time(); $query = "SELECT " ."IF($now-time_stamp<60, '0 - last minute', IF($now -time_stamp<3600, '1 - 1 minute to 1 hour', IF($now-time_stamp<86400, '2 - 1 hour to 1 day', IF($now-time_stamp<604800, '3 - 1 day to 1 week', IF($now-time_stamp<2629800, '4 - 1 week to 1 month', IF($now-time_stamp<31557600, '5 - 1 month to 1 year', '6 - more than 1 year')))))) AS 'Time Scale', " ."remote_host AS 'Remote Host', " ."IF(instr(request_uri, 'PHPSESS')=0, request_uri, left(request_uri, instr(request_uri, 'PHPSESS')-2)) AS 'Request URI' " ."FROM $accesslog " ."WHERE (remote_host LIKE '%googlebot.com' " ."OR remote_host LIKE '%alexa.com' " ."OR remote_host LIKE '%inktomisearch.com' " ."OR remote_host LIKE '%msnbot.msn.com') " .($whereConditions ? 'AND '.$whereConditions : '') ."ORDER BY time_stamp"; |
From: Reini U. <ru...@x-...> - 2005-02-07 10:39:59
|
Charles Corrigan schrieb: > Here are two new queries that I wrote to analyse search engine activity > from the SQL access log. These can be inserted into the > AnalyseAccessLog.inc file that I uploaded last week - the insertion point > should be just before the final "}". I see now why you wanted to exclude the queries, but I still don't feel comfortable with the seperate .inc file. That would be the first and only case, and there are several file-based methods to list all our plugins (checking for .php, well). it violates our current (unwritten) plugin policies. Also with the name change from mode to queryName. I think we want to keep "mode" for consistency. > The queries contain a list of DNS > names for search engine crawlers/spiders/whatever - but this is based only > on what I saw in my sites' accesslog. Any additions are welcome. > } elseif ($queryName=="Search Bots") { ... > ."FROM $accesslog " > ."WHERE (remote_host LIKE '%googlebot.com' " > ."OR remote_host LIKE '%alexa.com' " > ."OR remote_host LIKE '%inktomisearch.com' " > ."OR remote_host LIKE '%msnbot.msn.com') " > .($whereConditions ? 'AND '.$whereConditions : '') > ."GROUP BY 'Time Scale', 'Remote Host'"; For search bots you can constrcut the hosts string from lib/ExternalReferrer.php > } elseif ($queryName=="Search Bot Page Hits") { > // This queries for all entries in the SQL access log table that > // have a dns name that I know to be a web search engine crawler and > // displays the URI that was hit. > // If PHPSESSID appears in the URI, just display the URI to the left > of this > > $now = time(); > $query = "SELECT " > ."IF($now-time_stamp<60, '0 - last minute', IF($now > -time_stamp<3600, '1 - 1 minute to 1 hour', > IF($now-time_stamp<86400, '2 - 1 hour to 1 day', > IF($now-time_stamp<604800, '3 - 1 day to 1 week', > IF($now-time_stamp<2629800, '4 - 1 week to 1 month', > IF($now-time_stamp<31557600, '5 - 1 month to 1 year', '6 - more All these are untranslated strings. -- Reini Urban http://xarch.tu-graz.ac.at/home/rurban/ |
From: Charles C. <ch...@ru...> - 2005-02-07 11:22:13
|
On Mon, February 7, 2005 18:39, Reini Urban said: > Charles Corrigan schrieb: >> Here are two new queries that I wrote to analyse search engine activity >> from the SQL access log. These can be inserted into the >> AnalyseAccessLog.inc file that I uploaded last week - the insertion >> point >> should be just before the final "}". > > I see now why you wanted to exclude the queries, but I still don't feel > comfortable with the seperate .inc file. The main reason that I wanted to put the queries in a separate file is that different database engines have different dialects of SQL and I know that some queries that I have written will only work with MySQL. The secondary reason for a separate file was to document the query file's API simply and completely enough to allow non-PhpWiki or even non-PHP programmers to alter the existing queries to make them work with their database engine or to add new queries. Can you suggest an alternative mechanism to achieve what I am getting at? > That would be the first and only case, and there are several file-based > methods to list all our plugins (checking for .php, well). > it violates our current (unwritten) plugin policies. > > Also with the name change from mode to queryName. > I think we want to keep "mode" for consistency. I thought long and hard about that before making the change as it really is a change in API. Again, to help people that are not familiar with programming PHP, I wanted a) the API name to match the parameter name and b) the API name to make sense > >> The queries contain a list of DNS >> names for search engine crawlers/spiders/whatever - but this is based >> only >> on what I saw in my sites' accesslog. Any additions are welcome. > >> } elseif ($queryName=="Search Bots") { > ... >> ."FROM $accesslog " >> ."WHERE (remote_host LIKE '%googlebot.com' " >> ."OR remote_host LIKE '%alexa.com' " >> ."OR remote_host LIKE '%inktomisearch.com' " >> ."OR remote_host LIKE '%msnbot.msn.com') " >> .($whereConditions ? 'AND '.$whereConditions : '') >> ."GROUP BY 'Time Scale', 'Remote Host'"; > > For search bots you can constrcut the hosts string from > lib/ExternalReferrer.php I will look into this, thanks. >> $now = time(); >> $query = "SELECT " >> ."IF($now-time_stamp<60, '0 - last minute', IF($now >> -time_stamp<3600, '1 - 1 minute to 1 hour', >> IF($now-time_stamp<86400, '2 - 1 hour to 1 day', >> IF($now-time_stamp<604800, '3 - 1 day to 1 week', >> IF($now-time_stamp<2629800, '4 - 1 week to 1 month', >> IF($now-time_stamp<31557600, '5 - 1 month to 1 year', '6 - more > > All these are untranslated strings. Good point, I will see what I can do - to get it "right" may require some support in the main php file. regards, Charles |
From: Reini U. <ru...@x-...> - 2005-02-07 12:13:38
|
Charles Corrigan schrieb: > On Mon, February 7, 2005 18:39, Reini Urban said: >>Charles Corrigan schrieb: >> >>>Here are two new queries that I wrote to analyse search engine activity >>>from the SQL access log. These can be inserted into the >>>AnalyseAccessLog.inc file that I uploaded last week - the insertion >>>point should be just before the final "}". >> >>I see now why you wanted to exclude the queries, but I still don't feel >>comfortable with the seperate .inc file. > > The main reason that I wanted to put the queries in a separate file is > that different database engines have different dialects of SQL and I know > that some queries that I have written will only work with MySQL. I see. Why not call the plugin AnalyseAccessLogSql then? The inc should be AnalyseAccessLogMySql.inc then. The AccessLog class has methods to support (read/write) both logfiles and sql logs. This plugin only deals with the fast SQL case. > The secondary reason for a separate file was to document the query file's > API simply and completely enough to allow non-PhpWiki or even non-PHP > programmers to alter the existing queries to make them work with their > database engine or to add new queries. > > Can you suggest an alternative mechanism to achieve what I am getting at? I will think about it. >>That would be the first and only case, and there are several file-based >>methods to list all our plugins (checking for .php, well). >>it violates our current (unwritten) plugin policies. >> >>Also with the name change from mode to queryName. >>I think we want to keep "mode" for consistency. > > I thought long and hard about that before making the change as it really > is a change in API. Again, to help people that are not familiar with > programming PHP, I wanted > a) the API name to match the parameter name and > b) the API name to make sense queryName is too low level. Too SQL specific. All these modes can also be supported by logfile analysis. (the file based backend) The plugin API has to deal with users and should be consistent. >>>The queries contain a list of DNS >>>names for search engine crawlers/spiders/whatever - but this is based >>>only >>>on what I saw in my sites' accesslog. Any additions are welcome. >>> $now = time(); >>> $query = "SELECT " >>> ."IF($now-time_stamp<60, '0 - last minute', IF($now >>>-time_stamp<3600, '1 - 1 minute to 1 hour', >>>IF($now-time_stamp<86400, '2 - 1 hour to 1 day', >>>IF($now-time_stamp<604800, '3 - 1 day to 1 week', >>>IF($now-time_stamp<2629800, '4 - 1 week to 1 month', >>>IF($now-time_stamp<31557600, '5 - 1 month to 1 year', '6 - more >> >>All these are untranslated strings. > > Good point, I will see what I can do - to get it "right" may require some > support in the main php file. Our current xgettext (automatic message string extractor) doesn't look at .inc files. So I would use php vars and define them in the plugin php file. _("1 minute to 1 hour") , _("1 hour to 1 day"), ... -- Reini Urban http://xarch.tu-graz.ac.at/home/rurban/ |
From: Charles C. <ch...@ru...> - 2005-02-07 15:54:57
|
>> The main reason that I wanted to put the queries in a separate file is >> that different database engines have different dialects of SQL and I know >> that some queries that I have written will only work with MySQL. > > I see. Why not call the plugin AnalyseAccessLogSql then? > The inc should be AnalyseAccessLogMySql.inc then. > > The AccessLog class has methods to support (read/write) both logfiles > and sql logs. This plugin only deals with the fast SQL case. OK, this would make sense - maybe with a shorter name? I looked at Request_AccessLog & Request_AccesslogEntry in Request.php and chose not to follow that API. For the actual transactional part of the system, I agree totally that the backend should be as transparent as possible. But for (added-value) reporting why limit to the least common denominator? Unless someone wants to step up to the plate and deliver a query engine against the access log file... >>>Also with the name change from mode to queryName. >>>I think we want to keep "mode" for consistency. >> >> I thought long and hard about that before making the change as it really >> is a change in API. Again, to help people that are not familiar with >> programming PHP, I wanted >> a) the API name to match the parameter name and >> b) the API name to make sense > > queryName is too low level. Too SQL specific. The queryName is designed to be just that, a name. mode has the implication of being built-in, whereas queryName does give the sense of it just being an arbitrary name. And if the plugin is renamed to AnalyseAccessLogSql it should be even less of an issue Regards, Charles |
From: Charles C. <ch...@ru...> - 2005-02-08 15:51:46
Attachments:
AccessLogSQL.php
AccessLogSQLMySQL.inc
|
OK, attached are two files that I hope are going in the right direction. Please review but do not commit to CVS yet as they are untested and an API change is inadequately documented. AccessLogSQL.php replaces AnalyseAccessLog.php and AccessLogSQLMySQL.inc replaces AnalyseAccessLog.php I will revise the wiki page that I provided earlier once the code changes are complete Regards, Charles |
From: Charles C. <ch...@ru...> - 2005-02-10 13:29:39
|
On 07 February 2005 18:40, Reini Urban [mailto:ru...@x-...] wrote >> ."WHERE (remote_host LIKE '%googlebot.com' " >> ."OR remote_host LIKE '%alexa.com' " >> ."OR remote_host LIKE '%inktomisearch.com' " >> ."OR remote_host LIKE '%msnbot.msn.com') " > >For search bots you can constrcut the hosts string from >lib/ExternalReferrer.php I took a look - this handles the URL for main site of the search engine but not necessarily the crawler engine URL. For example: google.com vs googlebot.com. Also, some of these sites use hand built indices rather than crawlers to automatically generate the raw material. I could add a new column to the array in ExternalReferrer.php for the crawler URLs but I only have the 4 crawlers above so far... Regards, Charles |