From: Gilles D. <gr...@sc...> - 2002-03-12 23:32:53
|
According to Joe R. Jah: > On Mon, 11 Mar 2002, Gilles Detillieux wrote: ... > > What most browsers do with unencoded spaces within URLs is a violation of > > RFC 1738 and RFC 2396. htdig does the correct thing, if not what some > > users would prefer it did. You can of course patch the URL class to leave > > the spaces in there, in violation of the standard, to conform with the > > incorrect behaviour of most browsers and, apparently, some really bad > > HTML code generators. That would save you from having to fix all the bad > > HTML code you're indexing. Spaces within URLs should always always be > > encoded as %20. > > > > See http://www.geocrawler.com/archives/3/8822/2002/1/300/7455555/ > > and http://www.geocrawler.com/archives/3/8822/2002/1/250/7495651/ > > > > My recommendation, if you have a choice, is to avoid spaces in filenames > > altogether, because they cause all sorts of grief. Some caching proxy > > servers mess up URLs with spaces, even if the space is properly encoded > > as %20. > > You are absolutely right. I made a patch from your tips in the above > thread: ... > Applied it and randig, and waited for the dig to finish, and waited, and > waited, ...;( Finally I killed the process. I humbly switch my previous > +1 vote to -1. That's a bit surprising. (Not the change in vote, but the fact that it hung.) I'm curious as to why that is. Were you indexing through a proxy server, and if so, which one? Did it lock up solid without doing anything, or did it seem to be doing something when you killed it? Can you provide any verbose output and/or a stack backtrace at the time you killed it? -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 |