In the Retriever class (src file Retriever.cc)method
Need2Get there is a small bug that might cause the
same URL to be requested more than once. The original
code is:
int Retriever::Need2Get(char *u)
{
static String url;
url = u;
return !visited.Exists(url);
}
But, this code causes the following 2 urls:
http://www.oso.com/news/issues/hottopic/school_violence
_expand.html
&
http://www.oso.com/news/issues/hottopic//school_violenc
e_expand.html
to be considered as two different urls (the difference
is in the use of a double slash ['//']
after 'hottopic').
This is a real case when indexing
http://www.oso.com/news/issues/hottopic/school_violence
.html
which contains these two links.
The solution is to use the url parser to remove the
double slash (since it normalizes the path):
int
Retriever::Need2Get(char *u)
{
static URL url;
url.parse(u);
return !visited.Exists(url.get());
}
(WARNING: I didn't have time yet to actualyl test it,
but it should work)
Logged In: YES
user_id=21420
Done in the 3.2 code, but certainly a good fix.
Thanks for bringing this up!