[htdig-dev] removing double slashes from URL path may be a problem

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi again, folks.  Another bug I discovered in htdig while I was
experimenting with different approaches to indexing the Geocrawler
archives was that its removal of double slashes in URL::normalizePath()
may cause problems.  Here's the code in question:

    //
    // Furthermore, get rid of "//".  This could also cause loops
    //
    while ((i = _path.indexOf("//")) >= 0 && i < pathend)
    {
        String  newPath;
        newPath << _path.sub(0, i).get();
        newPath << _path.sub(i + 1).get();
        _path = newPath;
        pathend = _path.indexOf('?');
        if (pathend < 0)
            pathend = _path.length();
    }

The problem with this is that is assumes the path refers to a standard
hierarchical filesystem where a null path component is taken as the same
as a ".".  That assumption can break down when the path is process by a
script rather than by the filesystem.  (There was a rant about a similar
bug in URL handling in Office XP in Woody's Office Watch sometime ago
just before Office XP's release.)  The easy fix I can think of would be
to prefix the while loop above with this:

    if (config.Boolean("remove_double_slash", 1))

and set that attribute to true by default in htcommon/defaults.cc.
Setting it to false in your htdig.conf would turn off this feature when
it causes problems.  It's still a good feature to have in most cases of
normalizing conventional filesystem paths.  The other approach I thought
of, which would take more work, would be to have a StringList attribute
called remove_url_path_part or something of the sort, which would define
all the substrings to be stripped out.  The complication is that not all
the stuff that URL::normalizePath() strips out is simple substrings.  I'm
leaning to the simpler fix.  Thoughts?

The way I stumbled into this problem with Geocrawler was when I used a
start_url like

	http://www.geocrawler.com/archives/3/8822/2002/8/

to index the Aug 2002 htdig-general archives.  Normally, the path component
after the month is a starting document number for a page of 50 messages in
reverse chronological order.  E.g., for August,

	http://www.geocrawler.com/archives/3/8822/2002/8/0/

is the last 50 messages of the month,

	http://www.geocrawler.com/archives/3/8822/2002/8/50/

is the next 50, and so on, leading to URLs for the individual messages
like this:

	http://www.geocrawler.com/archives/3/8822/2002/8/100/9269993/

But if you omit the starting document number, as in the first URL above,
Geocrawler generates URLs for messages with a null starting number, as

	http://www.geocrawler.com/archives/3/8822/2002/8//9269993/

which work fine until htdig removes one of the slashes, at which point
geocrawler tries a non-existant starting number, rather than recognising
the last number as a message ID (because the position in the path is
wrong), so you'd end up indexing a lot of "No Results Found" pages.
The workaround in this case was easy enough - I just used a starting
number of 0/ at the end of the start_url, and all was good.  I also
used url_rewrite_rules (and later an external converter script which
also had other "cleanups") to normalize the starting number in all
message URLs to 0, so that you wouldn't get the same message indexed
under multiple start numbers.  However, not all such situations will
have such an easy workaround.

-- 
Gilles R. Detillieux              E-mail: <gr...@sc...>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/
Dept. Physiology, U. of Manitoba  Winnipeg, MB  R3E 3J7  (Canada)