Menu

#30 Bug in Need2Get method (Retriever.cc)

resolved
closed-fixed
htdig (103)
5
2001-03-17
2001-03-14
riv
No

In the Retriever class (src file Retriever.cc)method
Need2Get there is a small bug that might cause the
same URL to be requested more than once. The original
code is:

int Retriever::Need2Get(char *u)
{
static String url;
url = u;

return !visited.Exists(url);
}

But, this code causes the following 2 urls:
http://www.oso.com/news/issues/hottopic/school_violence
_expand.html
&
http://www.oso.com/news/issues/hottopic//school_violenc
e_expand.html

to be considered as two different urls (the difference
is in the use of a double slash ['//']
after 'hottopic').
This is a real case when indexing
http://www.oso.com/news/issues/hottopic/school_violence
.html
which contains these two links.

The solution is to use the url parser to remove the
double slash (since it normalizes the path):
int
Retriever::Need2Get(char *u)
{
static URL url;
url.parse(u);

return !visited.Exists(url.get());
}

(WARNING: I didn't have time yet to actualyl test it,
but it should work)

Discussion

  • Geoff Hutchison

    Geoff Hutchison - 2001-03-17
    • labels: --> htdig
    • milestone: --> resolved
    • assigned_to: nobody --> grdetil
    • status: open --> open-fixed
     
  • Geoff Hutchison

    Geoff Hutchison - 2001-03-17

    Logged In: YES
    user_id=21420

    Done in the 3.2 code, but certainly a good fix.

    Thanks for bringing this up!

     
  • Geoff Hutchison

    Geoff Hutchison - 2001-03-17
    • status: open-fixed --> closed-fixed
     

Log in to post a comment.