From: Gilles D. <gr...@sc...> - 2002-06-07 14:29:52
|
According to Geoff Hutchison: > > I investigated 3 unixservers (there are a lot more) and found 8651 URL's > > with a mix of upper/lowercase characters. All these URL's will be > > ignored if case_sensitive=false. > > No, they won't be ignored, but the resulting URL in the query results > (which will be lowercased) may not work. Actually, the problem doesn't just occur in search results, but also when htdig goes to fetch the URL. The URLs are lowercased in URL::normalizePath, so it happens before the document is fetched by htdig. When the case_sensitive attribute was first added in around 3.1.0b3, it only affected matching of names in robots.txt. It didn't affect the test for whether the URL was already "visited", i.e. fetched or queued, and didn't cause the URL to be set to lowercase. In 3.1.4, this behaviour was changed. I can't seem to find the discussions that led to the decision to map URLs to lowercase, rather than just to case insensitive comparisons, but I seem to recall it had to do with normalizing the case of URLs, and this was seen at the time as a good thing. > > some page, you will miss only the pages that realy exist in multiple, > > only case-different, names, but all the other ones are treated > > correctly. > > So then you won't miss any page with uppercases in it's name. > > When indexing, ht://Dig will not "miss" any pages with case_sensitive: > false, it's simply a question of whether an all-lowercase URL will work > to retrieve a mixed-case URL. Yes, but with most UNIX base web servers, URLs are case sensitive, and mapping URLs to lowercase is likely to cause them to fail to be fetched. Recent versions of Apache have modules to remap URLs so that you essentially get a case-insensitive server, but I don't know how commonly this is implemented on servers. For the time being, it's still a bad idea to use case_sensitive: false to index a case sensitive server. > But in any case, you can also index the UNIX servers and Windows servers > separately with two different config files (i.e. one with > case_sensitive: false and one with true). Use these for indexing and > re-indexing. Then copy one of them to a database you'll use for > searching and use htmerge to merge the other database into this new one. > You'll have all the servers with correct URLs. (It's not quite as nice > as having the per-server case_sensitive attribute, but it'll work.) > > <http://www.htdig.org/FAQ.html#q4.4> > <http://www.htdig.org/FAQ.html#q4.5> > <http://www.htdig.org/htmerge.html> Yes, this is a reasonable solution for 3.1.6. (Older htmerge versions had problems with merging the wordlists correctly.) -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) |