#38 Query string normalization: unescape and utf8 detection

closed
None
6
2014-09-13
2003-11-24
Che, Dong
No

seem query string in known search engine and unknow
search engine need a query string normalizer and
following functions can be added into it:
1 remove google cached and related command:
$param =~ s/^(cache|related):[^+]+//;

2 unescape \xFF:

change \xc4\xbe\xd7\xd3\xc3\xc0 into %c4%be%d7%d3%c3%c0

$param =~ s/\x(\w{2})/%\1/gi;

3 url decode: uri_unescape()
$param = uri_unescape($param);

4 utf-8 detection:
if ( $string =~
m/^([\x00-\x7f]|[\xc2-\xdf][\x80-\xbf]|\xe0[\xa0-\xbf][\x80-\xbf]|[\xe1-\xef][\x80-\xbf][\x80-\xbf]|\xf0[\x90-\xbf][\x80-\xbf][\x80-\xbf]|[\xf1-\xf7][\x80-\xbf][\x80-\xbf][\x80-\xbf])*$/
)
{
$param = decode("utf-8", $string);
$param = encode($encoding, $string);
}

5 replace spaces into '+'

reverse "+", ";" to space

$param=~ s/;+/ /g;
$param =~ s/\s+/+/g;

6 trim space:
$param=~ s/^ +//;
$param =~ s/ +$//;

Discussion

  • Che, Dong
    Che, Dong
    2004-01-22

    Logged In: YES
    user_id=5008

    I think this convert should be made while log processing.
    suggestion:
    move this convert from decodeutfkeys.pm into awstats.pl line
    6538:

    convert \xf1\xbc\xcd\xac\xba\xa3 ==> %f1%bc%cd%ac%

    ba%a3
    $param =~ s/\x(\w\w)/%\1/gi;