Re: [htdig] htdig/case_sensitive

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

On Thu, 06 Jun 2002 16:56:52 -0400 (EDT) Geoff Hutchison <ghu...@ws...> wrote:
> On Tue, 4 Jun 2002, c.j.bol wrote:
> 
> > The number of visits grow from about 5000 to ~25000.
> > The solution would be case_sensitive=false, however then the
> > Unix-servers will be incomplete.
> 
> Truth be told, do you actually have pages on the UNIX server that are case
> sensitive? Do you really have Index.html and index.html in the same
> directory? How many documents do you have with mixed case?

I investigated 3 unixservers (there are a lot more) and found 8651 URL's
with a mix of upper/lowercase characters. All these URL's will be
ignored if case_sensitive=false.

> 
> In general, I'd say "probably pretty few or none." But it's something you
> should check. (IMHO, it's bad site design, but I digress.)
> 
> > Perhaps I don't understand the entire stuff with 'case_sensitive' but
> > would it not be a solution if one could set case_sensitive=false, and every
> > website was visited with a non-converted URL (as found in a document)
> > and only the URL-comparison to prevend multiple access for the same page
> > was done in lowercase? 
> 
> The problem with a case-insensitive system is how you treat multiple
> "identical" URLs. Which one is the "canonical" URL? It's not necessarily
> the first reference you see to a document, which is when the URL is placed
> in the database. So ht://Dig makes the all-lowercase URL as the
> "canonical" URL, which is pretty much the standard on the web
> anyway. (Again, I'd ask--how many URLs do you have with mixed case?)
> 

If, when case_sensitive=false, only the comparison was done in lowercase
and fetching the page the first time with the URL as it was found in
some page, you will miss only the pages that realy exist in multiple,
only case-different, names, but all the other ones are treated correctly.
So then you won't miss any page with uppercases in it's name.

KeesBol