From: Joe R. J. <jj...@cl...> - 2002-03-11 23:59:41
|
On Mon, 11 Mar 2002, Gilles Detillieux wrote: > Date: Mon, 11 Mar 2002 17:22:15 -0600 (CST) > From: Gilles Detillieux <gr...@sc...> > To: jj...@cl... > Cc: Geoff Hutchison <ghu...@ws...>, > htd...@li... > Subject: Re: [htdig] "file name.html" -> "filename.html";( > > According to Joe R. Jah: > > On Sat, 9 Mar 2002, Geoff Hutchison wrote: > > > On Friday, March 8, 2002, at 01:51 PM, Joe R. Jah wrote: > > > > Unfortunately htdig removes the space. and looks for "filename.html" and > > > > reports: > > > > > > > > Not found: http://domain.com/some/path/filename.html Ref: > > > > http://domain.com/some/path/file.html > > > > > > Joe, I think you should understand that this isn't much help as a bug > > > report. Do you see this in 3.1.x, 3.2.0bX, both, etc.? When does the > > > space seem to "disappear?" Is it when it first encounters the link > > > (parser error), as it normalizes and accepts/rejects the URL (retriever > > > or URL parser error) or as it tries to fetch it? > > > > > > A bit more feedback would go a long way towards debugging this. > > > > Ok, I run 3.1.6, rundig -vvvvv results the following for one link in one > > file: > > ----------------------------------8<------------------------------- > > 0:0:0:http://domain.com/Path/To/: Trying local files > > tried local file /domain.com/Path/To/index.html > > tried local file /domain.com/Path/To/index.shtml > > found existing file /domain.com/Path/To/index.htm > > Read 5785 from document > > Read a total of 5785 bytes > > Tag: <html>, matched -1 > > Tag: <head>, matched -1 > > Tag: <title>, matched 0 > > word: Handouts@7 > > Tag: </title>, matched 1 > > title: Handouts > > Tag: <a href="fa01HP2-Basic Unix Commands.htm">, matched 2 > > word: Basic@696 > > word: UNIX@698 > > word: Commands@700 > > Tag: </a>, matched 3 > > href: http://domain.com/Path/To/fa01HP2-BasicUnixCommands.htm (Basic UNIX > > Commands) > > resolving 'http://domain.com/Path/To/fa01HP2-BasicUnixCommands.htm' > > pushing http://domain.com/Path/To/fa01HP2-BasicUnixCommands.htm > > ----------------------------------8<------------------------------- > > ... > > ----------------------------------8<------------------------------- > > 14:14:1:http://domain.com/Path/To/fa01HP2-BasicUnixCommands.htm: Trying local files > > tried local file /domain.com/Path/To/fa01HP2-BasicUnixCommands.htm > > Local retrieval failed, trying HTTP > > Retrieval command for http://domain.com/Path/To/fa01HP2-BasicUnixCommands.htm: GET /Path/To/fa01HP2-BasicUnixCommands.htm HTTP/1.0 > > User-Agent: htdig/3.1.6 (Se...@do...) > > Referer: http://domain.com/Path/To/ > > Host: domain.com > > > > Header line: HTTP/1.1 404 Not Found > > Header line: Date: Sun, 10 Mar 2002 08:03:36 GMT > > ----------------------------------8<------------------------------- > > > > And it reports: > > ----------------------------------8<------------------------------- > > Not found: http://domain.com/Path/To/fa01HP2-BasicUnixCommands.htm Ref: http://domain.com/Path/To/ > > ----------------------------------8<------------------------------- > > What most browsers do with unencoded spaces within URLs is a violation of > RFC 1738 and RFC 2396. htdig does the correct thing, if not what some > users would prefer it did. You can of course patch the URL class to leave > the spaces in there, in violation of the standard, to conform with the > incorrect behaviour of most browsers and, apparently, some really bad > HTML code generators. That would save you from having to fix all the bad > HTML code you're indexing. Spaces within URLs should always always be > encoded as %20. > > See http://www.geocrawler.com/archives/3/8822/2002/1/300/7455555/ > and http://www.geocrawler.com/archives/3/8822/2002/1/250/7495651/ > > My recommendation, if you have a choice, is to avoid spaces in filenames > altogether, because they cause all sorts of grief. Some caching proxy > servers mess up URLs with spaces, even if the space is properly encoded > as %20. I am sorry I missed that thread. I believe the above situation is certainly becoming more and more pervasive. I vote +1 to tweak the HTML parser to handle space in filenames. Regards, Joe -- _/ _/_/_/ _/ ____________ __o _/ _/ _/ _/ ______________ _-\<,_ _/ _/ _/_/_/ _/ _/ ......(_)/ (_) _/_/ oe _/ _/. _/_/ ah jj...@cl... |