From: Joe R. J. <jj...@cl...> - 2002-03-11 23:59:41
|
On Mon, 11 Mar 2002, Gilles Detillieux wrote: > Date: Mon, 11 Mar 2002 17:22:15 -0600 (CST) > From: Gilles Detillieux <gr...@sc...> > To: jj...@cl... > Cc: Geoff Hutchison <ghu...@ws...>, > htd...@li... > Subject: Re: [htdig] "file name.html" -> "filename.html";( > > According to Joe R. Jah: > > On Sat, 9 Mar 2002, Geoff Hutchison wrote: > > > On Friday, March 8, 2002, at 01:51 PM, Joe R. Jah wrote: > > > > Unfortunately htdig removes the space. and looks for "filename.html" and > > > > reports: > > > > > > > > Not found: http://domain.com/some/path/filename.html Ref: > > > > http://domain.com/some/path/file.html > > > > > > Joe, I think you should understand that this isn't much help as a bug > > > report. Do you see this in 3.1.x, 3.2.0bX, both, etc.? When does the > > > space seem to "disappear?" Is it when it first encounters the link > > > (parser error), as it normalizes and accepts/rejects the URL (retriever > > > or URL parser error) or as it tries to fetch it? > > > > > > A bit more feedback would go a long way towards debugging this. > > > > Ok, I run 3.1.6, rundig -vvvvv results the following for one link in one > > file: > > ----------------------------------8<------------------------------- > > 0:0:0:http://domain.com/Path/To/: Trying local files > > tried local file /domain.com/Path/To/index.html > > tried local file /domain.com/Path/To/index.shtml > > found existing file /domain.com/Path/To/index.htm > > Read 5785 from document > > Read a total of 5785 bytes > > Tag: <html>, matched -1 > > Tag: <head>, matched -1 > > Tag: <title>, matched 0 > > word: Handouts@7 > > Tag: </title>, matched 1 > > title: Handouts > > Tag: <a href="fa01HP2-Basic Unix Commands.htm">, matched 2 > > word: Basic@696 > > word: UNIX@698 > > word: Commands@700 > > Tag: </a>, matched 3 > > href: http://domain.com/Path/To/fa01HP2-BasicUnixCommands.htm (Basic UNIX > > Commands) > > resolving 'http://domain.com/Path/To/fa01HP2-BasicUnixCommands.htm' > > pushing http://domain.com/Path/To/fa01HP2-BasicUnixCommands.htm > > ----------------------------------8<------------------------------- > > ... > > ----------------------------------8<------------------------------- > > 14:14:1:http://domain.com/Path/To/fa01HP2-BasicUnixCommands.htm: Trying local files > > tried local file /domain.com/Path/To/fa01HP2-BasicUnixCommands.htm > > Local retrieval failed, trying HTTP > > Retrieval command for http://domain.com/Path/To/fa01HP2-BasicUnixCommands.htm: GET /Path/To/fa01HP2-BasicUnixCommands.htm HTTP/1.0 > > User-Agent: htdig/3.1.6 (Se...@do...) > > Referer: http://domain.com/Path/To/ > > Host: domain.com > > > > Header line: HTTP/1.1 404 Not Found > > Header line: Date: Sun, 10 Mar 2002 08:03:36 GMT > > ----------------------------------8<------------------------------- > > > > And it reports: > > ----------------------------------8<------------------------------- > > Not found: http://domain.com/Path/To/fa01HP2-BasicUnixCommands.htm Ref: http://domain.com/Path/To/ > > ----------------------------------8<------------------------------- > > What most browsers do with unencoded spaces within URLs is a violation of > RFC 1738 and RFC 2396. htdig does the correct thing, if not what some > users would prefer it did. You can of course patch the URL class to leave > the spaces in there, in violation of the standard, to conform with the > incorrect behaviour of most browsers and, apparently, some really bad > HTML code generators. That would save you from having to fix all the bad > HTML code you're indexing. Spaces within URLs should always always be > encoded as %20. > > See http://www.geocrawler.com/archives/3/8822/2002/1/300/7455555/ > and http://www.geocrawler.com/archives/3/8822/2002/1/250/7495651/ > > My recommendation, if you have a choice, is to avoid spaces in filenames > altogether, because they cause all sorts of grief. Some caching proxy > servers mess up URLs with spaces, even if the space is properly encoded > as %20. I am sorry I missed that thread. I believe the above situation is certainly becoming more and more pervasive. I vote +1 to tweak the HTML parser to handle space in filenames. Regards, Joe -- _/ _/_/_/ _/ ____________ __o _/ _/ _/ _/ ______________ _-\<,_ _/ _/ _/_/_/ _/ _/ ......(_)/ (_) _/_/ oe _/ _/. _/_/ ah jj...@cl... |
From: Joe R. J. <jj...@cl...> - 2002-03-12 23:09:52
|
On Mon, 11 Mar 2002, Gilles Detillieux wrote: > Date: Mon, 11 Mar 2002 17:22:15 -0600 (CST) > From: Gilles Detillieux <gr...@sc...> > To: jj...@cl... > Cc: Geoff Hutchison <ghu...@ws...>, > htd...@li... > Subject: Re: [htdig] "file name.html" -> "filename.html";( > > According to Joe R. Jah: > > On Sat, 9 Mar 2002, Geoff Hutchison wrote: > > > On Friday, March 8, 2002, at 01:51 PM, Joe R. Jah wrote: > > > > Unfortunately htdig removes the space. and looks for "filename.html" and > > > > reports: > > > > > > > > Not found: http://domain.com/some/path/filename.html Ref: > > > > http://domain.com/some/path/file.html > > > > > > Joe, I think you should understand that this isn't much help as a bug > > > report. Do you see this in 3.1.x, 3.2.0bX, both, etc.? When does the > > > space seem to "disappear?" Is it when it first encounters the link > > > (parser error), as it normalizes and accepts/rejects the URL (retriever > > > or URL parser error) or as it tries to fetch it? > > > > > > A bit more feedback would go a long way towards debugging this. > > > > Ok, I run 3.1.6, rundig -vvvvv results the following for one link in one > > file: > > ----------------------------------8<------------------------------- > > 0:0:0:http://domain.com/Path/To/: Trying local files > > tried local file /domain.com/Path/To/index.html > > tried local file /domain.com/Path/To/index.shtml > > found existing file /domain.com/Path/To/index.htm > > Read 5785 from document > > Read a total of 5785 bytes > > Tag: <html>, matched -1 > > Tag: <head>, matched -1 > > Tag: <title>, matched 0 > > word: Handouts@7 > > Tag: </title>, matched 1 > > title: Handouts > > Tag: <a href="fa01HP2-Basic Unix Commands.htm">, matched 2 > > word: Basic@696 > > word: UNIX@698 > > word: Commands@700 > > Tag: </a>, matched 3 > > href: http://domain.com/Path/To/fa01HP2-BasicUnixCommands.htm (Basic UNIX > > Commands) > > resolving 'http://domain.com/Path/To/fa01HP2-BasicUnixCommands.htm' > > pushing http://domain.com/Path/To/fa01HP2-BasicUnixCommands.htm > > ----------------------------------8<------------------------------- > > ... > > ----------------------------------8<------------------------------- > > 14:14:1:http://domain.com/Path/To/fa01HP2-BasicUnixCommands.htm: Trying local files > > tried local file /domain.com/Path/To/fa01HP2-BasicUnixCommands.htm > > Local retrieval failed, trying HTTP > > Retrieval command for http://domain.com/Path/To/fa01HP2-BasicUnixCommands.htm: GET /Path/To/fa01HP2-BasicUnixCommands.htm HTTP/1.0 > > User-Agent: htdig/3.1.6 (Se...@do...) > > Referer: http://domain.com/Path/To/ > > Host: domain.com > > > > Header line: HTTP/1.1 404 Not Found > > Header line: Date: Sun, 10 Mar 2002 08:03:36 GMT > > ----------------------------------8<------------------------------- > > > > And it reports: > > ----------------------------------8<------------------------------- > > Not found: http://domain.com/Path/To/fa01HP2-BasicUnixCommands.htm Ref: http://domain.com/Path/To/ > > ----------------------------------8<------------------------------- > > What most browsers do with unencoded spaces within URLs is a violation of > RFC 1738 and RFC 2396. htdig does the correct thing, if not what some > users would prefer it did. You can of course patch the URL class to leave > the spaces in there, in violation of the standard, to conform with the > incorrect behaviour of most browsers and, apparently, some really bad > HTML code generators. That would save you from having to fix all the bad > HTML code you're indexing. Spaces within URLs should always always be > encoded as %20. > > See http://www.geocrawler.com/archives/3/8822/2002/1/300/7455555/ > and http://www.geocrawler.com/archives/3/8822/2002/1/250/7495651/ > > My recommendation, if you have a choice, is to avoid spaces in filenames > altogether, because they cause all sorts of grief. Some caching proxy > servers mess up URLs with spaces, even if the space is properly encoded > as %20. You are absolutely right. I made a patch from your tips in the above thread: -----------------------8<----------------------- *** htlib/URL.cc.orig Thu Feb 7 17:15:38 2002 --- htlib/URL.cc Tue Mar 12 12:54:45 2002 *************** *** 75,81 **** URL::URL(char *ref, URL &parent) { String temp(ref); - temp.remove(" \r\n\t"); ref = temp; _host = parent._host; --- 75,82 ---- URL::URL(char *ref, URL &parent) { String temp(ref); + temp.remove("\r\n\t"); + temp.chop(' '); ref = temp; _host = parent._host; *************** *** 249,255 **** void URL::parse(char *u) { String temp(u); - temp.remove(" \t\r\n"); char *nurl = temp; // --- 250,257 ---- void URL::parse(char *u) { String temp(u); + temp.remove("\t\r\n"); + temp.chop(' '); char *nurl = temp; // -----------------------8<----------------------- Applied it and randig, and waited for the dig to finish, and waited, and waited, ...;( Finally I killed the process. I humbly switch my previous +1 vote to -1. Regards, Joe -- _/ _/_/_/ _/ ____________ __o _/ _/ _/ _/ ______________ _-\<,_ _/ _/ _/_/_/ _/ _/ ......(_)/ (_) _/_/ oe _/ _/. _/_/ ah jj...@cl... |
From: Gilles D. <gr...@sc...> - 2002-03-12 23:32:53
|
According to Joe R. Jah: > On Mon, 11 Mar 2002, Gilles Detillieux wrote: ... > > What most browsers do with unencoded spaces within URLs is a violation of > > RFC 1738 and RFC 2396. htdig does the correct thing, if not what some > > users would prefer it did. You can of course patch the URL class to leave > > the spaces in there, in violation of the standard, to conform with the > > incorrect behaviour of most browsers and, apparently, some really bad > > HTML code generators. That would save you from having to fix all the bad > > HTML code you're indexing. Spaces within URLs should always always be > > encoded as %20. > > > > See http://www.geocrawler.com/archives/3/8822/2002/1/300/7455555/ > > and http://www.geocrawler.com/archives/3/8822/2002/1/250/7495651/ > > > > My recommendation, if you have a choice, is to avoid spaces in filenames > > altogether, because they cause all sorts of grief. Some caching proxy > > servers mess up URLs with spaces, even if the space is properly encoded > > as %20. > > You are absolutely right. I made a patch from your tips in the above > thread: ... > Applied it and randig, and waited for the dig to finish, and waited, and > waited, ...;( Finally I killed the process. I humbly switch my previous > +1 vote to -1. That's a bit surprising. (Not the change in vote, but the fact that it hung.) I'm curious as to why that is. Were you indexing through a proxy server, and if so, which one? Did it lock up solid without doing anything, or did it seem to be doing something when you killed it? Can you provide any verbose output and/or a stack backtrace at the time you killed it? -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 |
From: Joe R. J. <jj...@cl...> - 2002-03-13 00:44:32
|
On Tue, 12 Mar 2002, Gilles Detillieux wrote: > Date: Tue, 12 Mar 2002 17:32:41 -0600 (CST) > From: Gilles Detillieux <gr...@sc...> > To: jj...@cl... > Cc: htd...@li... > Subject: Re: [htdig-dev] "file name.html" -> "filename.html";( > > According to Joe R. Jah: > > On Mon, 11 Mar 2002, Gilles Detillieux wrote: > ... > > > My recommendation, if you have a choice, is to avoid spaces in filenames > > > altogether, because they cause all sorts of grief. Some caching proxy > > > servers mess up URLs with spaces, even if the space is properly encoded > > > as %20. > > > > You are absolutely right. I made a patch from your tips in the above > > thread: > ... > > Applied it and randig, and waited for the dig to finish, and waited, and > > waited, ...;( Finally I killed the process. I humbly switch my previous > > +1 vote to -1. > > That's a bit surprising. (Not the change in vote, but the fact that > it hung.) I'm curious as to why that is. Were you indexing through > a proxy server, and if so, which one? Did it lock up solid without > doing anything, or did it seem to be doing something when you killed it? > Can you provide any verbose output and/or a stack backtrace at the time > you killed it? The dig was entirely on the local server. When it got to this link: <a href=" http://domain.com/path/to/page.htm" target="_blank"> in a file.shtml in a folder without any index files, it added the server URl to it, as if it were a relative URl, and went into an endless wild goose search for made up URL's like: http://mydomain.com/ http:/domain.com/some/path/somefile.htm Until I killed the process. It somehow removed one "/" from the second http://;-/ Regards, Joe -- _/ _/_/_/ _/ ____________ __o _/ _/ _/ _/ ______________ _-\<,_ _/ _/ _/_/_/ _/ _/ ......(_)/ (_) _/_/ oe _/ _/. _/_/ ah jj...@cl... |
From: Jessica B. <jes...@ya...> - 2002-03-14 15:42:07
|
Is there a way to match spaces in a regex inside a url_rewrite_rules parameter, so that you could just do: url_rewrite_rules: (.*)[:space:](.*) \1%20\2 (of course, you'd have to repeat this same rule multiple times to handle multiple spaces) I tried the above rule and it didn't seem to work. Characters inside the [brackets] were taken literally, and thus, the first s, p, a, c, or e were replaced with %20. This may seem like a wimpy work-around, but it could be done without the need to modify any code internally, keeping htdig RFC2396 compliant at the same time. So if you could help me with the regex I would appreciate it. __________________________________________________ Do You Yahoo!? Yahoo! Sports - live college hoops coverage http://sports.yahoo.com/ |
From: Gilles D. <gr...@sc...> - 2002-03-14 21:38:58
|
According to Jessica Biola: > Is there a way to match spaces in a regex inside a > url_rewrite_rules parameter, so that you could just > do: > > url_rewrite_rules: (.*)[:space:](.*) \1%20\2 > > (of course, you'd have to repeat this same rule > multiple times to handle multiple spaces) I tried the > above rule and it didn't seem to work. Characters > inside the [brackets] were taken literally, and thus, > the first s, p, a, c, or e were replaced with %20. > > This may seem like a wimpy work-around, but it could > be done without the need to modify any code > internally, keeping htdig RFC2396 compliant at the > same time. > > So if you could help me with the regex I would > appreciate it. Interesting idea, but there are a few reasons it won't work: 1) As you discovered, the [:space:] character class isn't implemented. This may actually be a function of which regex code ends up being used. Some C libraries may implement this, but clearly that's not the case on your system. Even if your regex code does implement this, see point 3. 2) You can't use just a space in the regular expression, either with or without the brackets, because url_rewrite_rules is parsed as a string list, not a quoted string list, so there's no way to embed a literal space in your regular expression. 3) Even if you could get around the two problems above, it still wouldn't work because the URL class doesn't do the rewriting until AFTER it's parsed the URL, and so the spaces are already stripped out in accordance with RFC2396. By the way, any trick you'd use to make htdig handle spaces within URLs would be a violation of RFC2396, regardless of whether it required code changes or just config file changes. The standard says spaces should be stripped out. The way most web browsers handle spaces within URLs is also a violation of RFC2396. The question is whether/how we get htdig to do likewise. The change I had suggested previously, which Joe Jah wrote into a patch mostly does things correctly. Only one bit is missing. All white space characters other than the space itself are stripped out anywhere, and the chop() call strips off trailing spaces, but there's nothing in that patch to strip off leading spaces, which is what caused grief in Joe's test of his patch. What you could do is, in addition to Joe's patch, add the following at the very start of URL::URL(char *ref, URL &parent)... while (*ref == ' ') ref++; and this at the very start of URL::parse(char *u)... while (*u == ' ') u++; before ref or u is assigned to the String "temp". -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 |
From: Jessica B. <jes...@ya...> - 2002-03-15 02:22:47
|
> By the way, any trick you'd use to make htdig handle > spaces within URLs > would be a violation of RFC2396, regardless of > whether it required code > changes or just config file changes. The standard > says spaces should > be stripped out. The way most web browsers handle > spaces within URLs is > also a violation of RFC2396. The question is > whether/how we get htdig > to do likewise. I'm okay with having my version of htdig violate RFCs (humming the melody "Breakin' the law, breakin' the law"), just as long as it can be toggled via a configuration parameter. I'll look into Joe's and your patches to see if it will fix my problems. Rather odd yet fortunate that I was running into the same problem at the same time. __________________________________________________ Do You Yahoo!? Yahoo! Sports - live college hoops coverage http://sports.yahoo.com/ |
From: Joe R. J. <jj...@cl...> - 2002-03-15 00:10:34
|
Hi Gilles, To beat the dead horse;) Is there a simple way of removing spaces from the beginning of URL's ala: -----------------------8<----------------------- --- 75,82 ---- URL::URL(char *ref, URL &parent) { String temp(ref); + temp.remove("\r\n\t"); + temp.chop(' '); + temp.shift(' '); # or something ref = temp; // -----------------------8<----------------------- And somehow turning the rest of the space into %20 in the code? Regards, Joe -- _/ _/_/_/ _/ ____________ __o _/ _/ _/ _/ ______________ _-\<,_ _/ _/ _/_/_/ _/ _/ ......(_)/ (_) _/_/ oe _/ _/. _/_/ ah jj...@cl... |
From: Gilles D. <gr...@sc...> - 2002-03-15 15:50:03
|
According to Joe R. Jah: > Hi Gilles, > > To beat the dead horse;) Is there a simple way of removing spaces from the > beginning of URL's ala: > -----------------------8<----------------------- > --- 75,82 ---- > URL::URL(char *ref, URL &parent) > { > String temp(ref); > + temp.remove("\r\n\t"); > + temp.chop(' '); > + temp.shift(' '); # or something > ref = temp; > > // > -----------------------8<----------------------- Well, there's no "shift" method in the String class, nor anything that simply strips characters off the front of a String. However, did you not see my reply to Jessica Biola yesterday afternoon, in this same thread? I did cc the list. In that, I suggested a simple fix to advance the char * pointer past leading spaces before assigning to temp. > And somehow turning the rest of the space into %20 in the code? OK, this is a little bit more effort, because now you're expanding a single character into 3, so you can't do it in place. However, you could probably change the first few lines of the URL constructor and parse methods like this. First, change the "u" to "ref" in the parse method for consistency. Then, instead of simply assigning ref to temp as String temp(ref); and then removing white space characters, you can do this: static int allowspace = config.Boolean("allow_space_in_url", 0); String temp; while (*ref) { if (*ref == ' ' && temp.length() > 0 && allowspace) { // Replace space character with %20 if there's more non-space // characters to come... char *s = ref+1; while (*s && isspace(*s)) s++; if (*s) temp << "%20"; } else if (!isspace(*ref)) temp << *ref; ref++; } Then, you'll have to set allow_space_in_url: true in your htdig.conf to enable this feature. -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 |
From: Joe R. J. <jj...@cl...> - 2002-03-15 22:16:08
|
On Fri, 15 Mar 2002, Gilles Detillieux wrote: > Date: Fri, 15 Mar 2002 09:49:47 -0600 (CST) > From: Gilles Detillieux <gr...@sc...> > To: "Joe R. Jah" <jj...@cl...> > Cc: "ht://Dig developers list" <htd...@li...> > Subject: Re: [htdig-dev] "file name.html" -> "filename.html";( > > Well, there's no "shift" method in the String class, nor anything that > simply strips characters off the front of a String. However, did you not > see my reply to Jessica Biola yesterday afternoon, in this same thread? I saw it, but I had not read it through until now;) > I did cc the list. In that, I suggested a simple fix to advance the > char * pointer past leading spaces before assigning to temp. It works like a charm;) thanks Gilles, the patch: ------------------------8<------------------------ *** htlib/URL.cc.orig Thu Feb 7 17:15:38 2002 --- htlib/URL.cc Fri Mar 15 12:15:41 2002 *************** *** 74,81 **** // URL::URL(char *ref, URL &parent) { String temp(ref); ! temp.remove(" \r\n\t"); ref = temp; _host = parent._host; --- 74,84 ---- // URL::URL(char *ref, URL &parent) { + while (*ref == ' ') + ref++; String temp(ref); ! temp.remove("\r\n\t"); ! temp.chop(' '); ref = temp; _host = parent._host; *************** *** 248,255 **** // void URL::parse(char *u) { String temp(u); ! temp.remove(" \t\r\n"); char *nurl = temp; // --- 251,261 ---- // void URL::parse(char *u) { + while (*u == ' ') + u++; String temp(u); ! temp.remove("\r\n\t"); ! temp.chop(' '); char *nurl = temp; // ------------------------8<------------------------ > > And somehow turning the rest of the space into %20 in the code? > > OK, this is a little bit more effort, because now you're expanding a > single character into 3, so you can't do it in place. However, you > could probably change the first few lines of the URL constructor and > parse methods like this. First, change the "u" to "ref" in the parse > method for consistency. Then, instead of simply assigning ref to temp > as String temp(ref); and then removing white space characters, you can > do this: The above patch already allows in non-(leading/trailing) space. The code bellow would just convert the allowed space into %20. I believe the term allow_space_in_url would be more expressive as convert_space_to_%20, or something;) > static int allowspace = config.Boolean("allow_space_in_url", 0); > String temp; > while (*ref) > { > if (*ref == ' ' && temp.length() > 0 && allowspace) > { > // Replace space character with %20 if there's more non-space > // characters to come... > char *s = ref+1; > while (*s && isspace(*s)) > s++; > if (*s) > temp << "%20"; > } > else if (!isspace(*ref)) > temp << *ref; > ref++; > } > > > Then, you'll have to set allow_space_in_url: true in your htdig.conf > to enable this feature. At any case, I do not see the rationale behind this option. We do not give an option for allowing non-(leading/trailing) space, but we give one for converting them to %20;-/ Unless we somehow integrate your option, allow_space_in_url, in the entire patch;) Regards, Joe -- _/ _/_/_/ _/ ____________ __o _/ _/ _/ _/ ______________ _-\<,_ _/ _/ _/_/_/ _/ _/ ......(_)/ (_) _/_/ oe _/ _/. _/_/ ah jj...@cl... |
From: Gilles D. <gr...@sc...> - 2002-03-15 22:34:44
|
According to Joe R. Jah: > On Fri, 15 Mar 2002, Gilles Detillieux wrote: > > According to Joe R. Jah: ... > > > And somehow turning the rest of the space into %20 in the code? > > > > OK, this is a little bit more effort, because now you're expanding a > > single character into 3, so you can't do it in place. However, you > > could probably change the first few lines of the URL constructor and > > parse methods like this. First, change the "u" to "ref" in the parse > > method for consistency. Then, instead of simply assigning ref to temp > > as String temp(ref); and then removing white space characters, you can > > do this: > > The above patch already allows in non-(leading/trailing) space. The code > bellow would just convert the allowed space into %20. I believe the term > allow_space_in_url would be more expressive as convert_space_to_%20, or > something;) No, the code below does two things: 1) if allow_space_in_url is not set, the code works like the standard 3.1.x code does, i.e. in strips out all white space characters, and 2) if allow_space_in_url is set, the code strips out all white space characters other than the space itself - for the space character (ASCII 20 hex) it strips leading and trailing spaces and converts the spaces within the URL to %20. The name allow_space_in_url is correct, because if the attribute is false, no spaces are allowed - they're stripped out, just as the currently released code does, in accordance with RFC 2396. However, if you prefer encode_space_in_url we can go with that. We're not going to start putting all sorts of wierd punctuation characters like "%" in attribute names. > > static int allowspace = config.Boolean("allow_space_in_url", 0); > > String temp; > > while (*ref) > > { > > if (*ref == ' ' && temp.length() > 0 && allowspace) > > { > > // Replace space character with %20 if there's more non-space > > // characters to come... > > char *s = ref+1; > > while (*s && isspace(*s)) > > s++; > > if (*s) > > temp << "%20"; > > } > > else if (!isspace(*ref)) > > temp << *ref; > > ref++; > > } > > > > > > Then, you'll have to set allow_space_in_url: true in your htdig.conf > > to enable this feature. > > At any case, I do not see the rationale behind this option. We do not > give an option for allowing non-(leading/trailing) space, but we give one > for converting them to %20;-/ Unless we somehow integrate your option, > allow_space_in_url, in the entire patch;) Maybe my description of the code above helps you see the rationale more clearly. The attribute selects both behaviours, not just the encoding. The reason to make it user-selectable option is that some users may actually prefer htdig to follow the standards rather than ignore them like MS/AOL do. I'm not sure what you mean by integrating my option in the entire patch. The code above should be complete on its own, as a change to vanilla 3.1.6 URL.cc code. You don't need to integrate it with earlier proposed changes - just put it in both URL methods you were changing before and make a patch out of it. -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 |
From: Joe R. J. <jj...@cl...> - 2002-03-16 00:13:39
|
On Fri, 15 Mar 2002, Gilles Detillieux wrote: > Date: Fri, 15 Mar 2002 16:34:36 -0600 (CST) > From: Gilles Detillieux <gr...@sc...> > To: jj...@cl... > Cc: "ht://Dig developers list" <htd...@li...> > Subject: Re: [htdig-dev] "file name.html" -> "filename.html";( > > No, the code below does two things: 1) if allow_space_in_url is not > set, the code works like the standard 3.1.x code does, i.e. in strips > out all white space characters, and 2) if allow_space_in_url is set, > the code strips out all white space characters other than the space > itself - for the space character (ASCII 20 hex) it strips leading and > trailing spaces and converts the spaces within the URL to %20. The name > allow_space_in_url is correct, because if the attribute is false, > no spaces are allowed - they're stripped out, just as the currently > released code does, in accordance with RFC 2396. However, if you prefer > encode_space_in_url we can go with that. We're not going to start putting > all sorts of wierd punctuation characters like "%" in attribute names. > > > > static int allowspace = config.Boolean("allow_space_in_url", 0); > > > String temp; > > > while (*ref) > > > { > > > if (*ref == ' ' && temp.length() > 0 && allowspace) > > > { > > > // Replace space character with %20 if there's more non-space > > > // characters to come... > > > char *s = ref+1; > > > while (*s && isspace(*s)) > > > s++; > > > if (*s) > > > temp << "%20"; > > > } > > > else if (!isspace(*ref)) > > > temp << *ref; > > > ref++; > > > } > > Maybe my description of the code above helps you see the rationale more > clearly. The attribute selects both behaviours, not just the encoding. > The reason to make it user-selectable option is that some users may > actually prefer htdig to follow the standards rather than ignore them > like MS/AOL do. > > I'm not sure what you mean by integrating my option in the entire patch. > The code above should be complete on its own, as a change to vanilla > 3.1.6 URL.cc code. You don't need to integrate it with earlier proposed > changes - just put it in both URL methods you were changing before and > make a patch out of it. I misunderstood. Here is the patch: -------------------------------------8<------------------------------------- *** htlib/URL.cc.031202 Thu Feb 7 17:15:38 2002 --- htlib/URL.cc Fri Mar 15 15:25:27 2002 *************** *** 74,82 **** // URL::URL(char *ref, URL &parent) { ! String temp(ref); ! temp.remove(" \r\n\t"); ! ref = temp; _host = parent._host; _port = parent._port; --- 74,97 ---- // URL::URL(char *ref, URL &parent) { ! static int allowspace = config.Boolean("allow_space_in_url", 0); ! String temp; ! while (*ref) ! { ! if (*ref == ' ' && temp.length() > 0 && allowspace) ! { ! // Replace space character with %20 if there's more non-space ! // characters to come... ! char *s = ref+1; ! while (*s && isspace(*s)) ! s++; ! if (*s) ! temp << "%20"; ! } ! else if (!isspace(*ref)) ! temp << *ref; ! ref++; ! } _host = parent._host; _port = parent._port; *************** *** 243,255 **** } //***************************************************************************** ! // void URL::parse(char *u) // Given a URL string, extract the service, host, port, and path from it. // ! void URL::parse(char *u) { ! String temp(u); ! temp.remove(" \t\r\n"); char *nurl = temp; // --- 258,286 ---- } //***************************************************************************** ! // void URL::parse(char *ref) // Given a URL string, extract the service, host, port, and path from it. // ! void URL::parse(char *ref) { ! static int allowspace = config.Boolean("allow_space_in_url", 0); ! String temp; ! while (*ref) ! { ! if (*ref == ' ' && temp.length() > 0 && allowspace) ! { ! // Replace space character with %20 if there's more non-space ! // characters to come... ! char *s = ref+1; ! while (*s && isspace(*s)) ! s++; ! if (*s) ! temp << "%20"; ! } ! else if (!isspace(*ref)) ! temp << *ref; ! ref++; ! } char *nurl = temp; // -------------------------------------8<------------------------------------- But it failed to follow any link;(I must have misread your instructions;) any ideas? Regards, Joe -- _/ _/_/_/ _/ ____________ __o _/ _/ _/ _/ ______________ _-\<,_ _/ _/ _/_/_/ _/ _/ ......(_)/ (_) _/_/ oe _/ _/. _/_/ ah jj...@cl... |
From: Gilles D. <gr...@sc...> - 2002-03-16 00:36:13
|
According to Joe R. Jah: > I misunderstood. Here is the patch: > -------------------------------------8<------------------------------------- > *** htlib/URL.cc.031202 Thu Feb 7 17:15:38 2002 > --- htlib/URL.cc Fri Mar 15 15:25:27 2002 > *************** > *** 74,82 **** > // > URL::URL(char *ref, URL &parent) > { > ! String temp(ref); > ! temp.remove(" \r\n\t"); > ! ref = temp; Here's your error right here. You shouldn't have deleted that third line, only the first two. The last one is needed because the constructor then uses the ref pointer to walk through the cleaned up URL string. You should set ref = temp; right after the close of the while loop below. > > _host = parent._host; > _port = parent._port; > --- 74,97 ---- > // > URL::URL(char *ref, URL &parent) > { > ! static int allowspace = config.Boolean("allow_space_in_url", 0); > ! String temp; > ! while (*ref) > ! { > ! if (*ref == ' ' && temp.length() > 0 && allowspace) > ! { > ! // Replace space character with %20 if there's more non-space > ! // characters to come... > ! char *s = ref+1; > ! while (*s && isspace(*s)) > ! s++; > ! if (*s) > ! temp << "%20"; > ! } > ! else if (!isspace(*ref)) > ! temp << *ref; > ! ref++; > ! } You need... ref = temp; here, after the loop. Without it, ref is still pointing to the end of the original URL, not the start of the cleaned up one, so the rest of the code will think it got an empty string as a URL. > _host = parent._host; > _port = parent._port; The second section, in URL::parse() looks fine to me, because in there, there were only two lines that you removed at the start, and you left the assignment of temp to nurl. We'll make a programmer out of you yet, Joe. :-) -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 |
From: Joe R. J. <jj...@cl...> - 2002-03-16 03:36:40
|
On Fri, 15 Mar 2002, Gilles Detillieux wrote: > Date: Fri, 15 Mar 2002 18:36:06 -0600 (CST) > From: Gilles Detillieux <gr...@sc...> > To: jj...@cl... > Cc: "ht://Dig developers list" <htd...@li...> > Subject: Re: [htdig-dev] "file name.html" -> "filename.html";( > > According to Joe R. Jah: > > I misunderstood. Here is the patch: > > -------------------------------------8<------------------------------------- > > *** htlib/URL.cc.031202 Thu Feb 7 17:15:38 2002 > > --- htlib/URL.cc Fri Mar 15 15:25:27 2002 > > *************** > > *** 74,82 **** > > // > > URL::URL(char *ref, URL &parent) > > { > > ! String temp(ref); > > ! temp.remove(" \r\n\t"); > > ! ref = temp; > > Here's your error right here. You shouldn't have deleted that third > line, only the first two. The last one is needed because the constructor > then uses the ref pointer to walk through the cleaned up URL string. > You should set ref = temp; right after the close of the while loop > below. ... > You need... > > ref = temp; > > here, after the loop. Without it, ref is still pointing to the end of > the original URL, not the start of the cleaned up one, so the rest of > the code will think it got an empty string as a URL. ... > The second section, in URL::parse() looks fine to me, because in there, > there were only two lines that you removed at the start, and you left > the assignment of temp to nurl. Thank you Gilles I corrected the patch and placed it in the patch site: ftp://ftp.ccsf.edu/htdig-patches/3.1.6/fileSpace.1 I believe this patch is safer than the fileSpace.0 because it does not leave any space in the URL; besides, its use is optional. I switch back my vote to +1;) To apply save it to your srver, change to htdig-3.1.6 source directory, and run: patch -p0 < /patch/to/fileSpace.1 > We'll make a programmer out of you yet, Joe. :-) ;-/ Regards, Joe -- _/ _/_/_/ _/ ____________ __o _/ _/ _/ _/ ______________ _-\<,_ _/ _/ _/_/_/ _/ _/ ......(_)/ (_) _/_/ oe _/ _/. _/_/ ah jj...@cl... |