You can subscribe to this list here.
| 2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(47) |
Nov
(74) |
Dec
(66) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2002 |
Jan
(95) |
Feb
(102) |
Mar
(83) |
Apr
(64) |
May
(55) |
Jun
(39) |
Jul
(23) |
Aug
(77) |
Sep
(88) |
Oct
(84) |
Nov
(66) |
Dec
(46) |
| 2003 |
Jan
(56) |
Feb
(129) |
Mar
(37) |
Apr
(63) |
May
(59) |
Jun
(104) |
Jul
(48) |
Aug
(37) |
Sep
(49) |
Oct
(157) |
Nov
(119) |
Dec
(54) |
| 2004 |
Jan
(51) |
Feb
(66) |
Mar
(39) |
Apr
(113) |
May
(34) |
Jun
(136) |
Jul
(67) |
Aug
(20) |
Sep
(7) |
Oct
(10) |
Nov
(14) |
Dec
(3) |
| 2005 |
Jan
(40) |
Feb
(21) |
Mar
(26) |
Apr
(13) |
May
(6) |
Jun
(4) |
Jul
(23) |
Aug
(3) |
Sep
(1) |
Oct
(13) |
Nov
(1) |
Dec
(6) |
| 2006 |
Jan
(2) |
Feb
(4) |
Mar
(4) |
Apr
(1) |
May
(11) |
Jun
(1) |
Jul
(4) |
Aug
(4) |
Sep
|
Oct
(4) |
Nov
|
Dec
(1) |
| 2007 |
Jan
(2) |
Feb
(8) |
Mar
(1) |
Apr
(1) |
May
(1) |
Jun
|
Jul
(2) |
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
|
| 2008 |
Jan
(1) |
Feb
|
Mar
(1) |
Apr
(2) |
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
|
| 2009 |
Jan
|
Feb
|
Mar
(2) |
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
| 2010 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
(1) |
| 2011 |
Jan
|
Feb
|
Mar
(1) |
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(1) |
Nov
|
Dec
|
| 2012 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
| 2013 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
| 2016 |
Jan
(1) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
| 2017 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(1) |
Dec
|
|
From: Gilles D. <gr...@sc...> - 2002-03-14 21:38:58
|
According to Jessica Biola:
> Is there a way to match spaces in a regex inside a
> url_rewrite_rules parameter, so that you could just
> do:
>
> url_rewrite_rules: (.*)[:space:](.*) \1%20\2
>
> (of course, you'd have to repeat this same rule
> multiple times to handle multiple spaces) I tried the
> above rule and it didn't seem to work. Characters
> inside the [brackets] were taken literally, and thus,
> the first s, p, a, c, or e were replaced with %20.
>
> This may seem like a wimpy work-around, but it could
> be done without the need to modify any code
> internally, keeping htdig RFC2396 compliant at the
> same time.
>
> So if you could help me with the regex I would
> appreciate it.
Interesting idea, but there are a few reasons it won't work:
1) As you discovered, the [:space:] character class isn't implemented.
This may actually be a function of which regex code ends up being used.
Some C libraries may implement this, but clearly that's not the case on
your system. Even if your regex code does implement this, see point 3.
2) You can't use just a space in the regular expression, either with
or without the brackets, because url_rewrite_rules is parsed as a
string list, not a quoted string list, so there's no way to embed a
literal space in your regular expression.
3) Even if you could get around the two problems above, it still wouldn't
work because the URL class doesn't do the rewriting until AFTER it's
parsed the URL, and so the spaces are already stripped out in accordance
with RFC2396.
By the way, any trick you'd use to make htdig handle spaces within URLs
would be a violation of RFC2396, regardless of whether it required code
changes or just config file changes. The standard says spaces should
be stripped out. The way most web browsers handle spaces within URLs is
also a violation of RFC2396. The question is whether/how we get htdig
to do likewise.
The change I had suggested previously, which Joe Jah wrote into a patch
mostly does things correctly. Only one bit is missing. All white space
characters other than the space itself are stripped out anywhere, and
the chop() call strips off trailing spaces, but there's nothing in that
patch to strip off leading spaces, which is what caused grief in Joe's
test of his patch.
What you could do is, in addition to Joe's patch, add the following
at the very start of URL::URL(char *ref, URL &parent)...
while (*ref == ' ')
ref++;
and this at the very start of URL::parse(char *u)...
while (*u == ' ')
u++;
before ref or u is assigned to the String "temp".
--
Gilles R. Detillieux E-mail: <gr...@sc...>
Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba Phone: (204)789-3766
Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930
|
|
From: Jessica B. <jes...@ya...> - 2002-03-14 15:42:07
|
Is there a way to match spaces in a regex inside a url_rewrite_rules parameter, so that you could just do: url_rewrite_rules: (.*)[:space:](.*) \1%20\2 (of course, you'd have to repeat this same rule multiple times to handle multiple spaces) I tried the above rule and it didn't seem to work. Characters inside the [brackets] were taken literally, and thus, the first s, p, a, c, or e were replaced with %20. This may seem like a wimpy work-around, but it could be done without the need to modify any code internally, keeping htdig RFC2396 compliant at the same time. So if you could help me with the regex I would appreciate it. __________________________________________________ Do You Yahoo!? Yahoo! Sports - live college hoops coverage http://sports.yahoo.com/ |
|
From: Geoff H. <ghu...@ws...> - 2002-03-14 04:28:06
|
The switch from GDBM to the Berkeley DB was quite some time ago. I don't think there's much to say except that you'll see performance improvements, get a stable, solid database. The Berkeley DB documentation is really quite helpful. -Geoff On Monday, March 11, 2002, at 03:10 AM, Kang-Jin Lee wrote: > Hi, > > I see that you have switched from gdbm to db in Htdig. > I am currently looking for a replacement for gdbm and would like to know > if you want to share some experiences? > > Thank you > > kj > -- > Kang-Jin Lee <lee @ arco.de> |
|
From: Joe R. J. <jj...@cl...> - 2002-03-13 00:44:32
|
On Tue, 12 Mar 2002, Gilles Detillieux wrote: > Date: Tue, 12 Mar 2002 17:32:41 -0600 (CST) > From: Gilles Detillieux <gr...@sc...> > To: jj...@cl... > Cc: htd...@li... > Subject: Re: [htdig-dev] "file name.html" -> "filename.html";( > > According to Joe R. Jah: > > On Mon, 11 Mar 2002, Gilles Detillieux wrote: > ... > > > My recommendation, if you have a choice, is to avoid spaces in filenames > > > altogether, because they cause all sorts of grief. Some caching proxy > > > servers mess up URLs with spaces, even if the space is properly encoded > > > as %20. > > > > You are absolutely right. I made a patch from your tips in the above > > thread: > ... > > Applied it and randig, and waited for the dig to finish, and waited, and > > waited, ...;( Finally I killed the process. I humbly switch my previous > > +1 vote to -1. > > That's a bit surprising. (Not the change in vote, but the fact that > it hung.) I'm curious as to why that is. Were you indexing through > a proxy server, and if so, which one? Did it lock up solid without > doing anything, or did it seem to be doing something when you killed it? > Can you provide any verbose output and/or a stack backtrace at the time > you killed it? The dig was entirely on the local server. When it got to this link: <a href=" http://domain.com/path/to/page.htm" target="_blank"> in a file.shtml in a folder without any index files, it added the server URl to it, as if it were a relative URl, and went into an endless wild goose search for made up URL's like: http://mydomain.com/ http:/domain.com/some/path/somefile.htm Until I killed the process. It somehow removed one "/" from the second http://;-/ Regards, Joe -- _/ _/_/_/ _/ ____________ __o _/ _/ _/ _/ ______________ _-\<,_ _/ _/ _/_/_/ _/ _/ ......(_)/ (_) _/_/ oe _/ _/. _/_/ ah jj...@cl... |
|
From: Gilles D. <gr...@sc...> - 2002-03-13 00:15:14
|
Hi again, John. I believe I have resolved this problem to Sean Downey's satisfaction. According to js...@in...: > Hi everyone, > > I have an issue with htdig that hopefully should be fairly > simple to resolve, however, I have managed to get $500 out > of my company to offer as a bounty to whoever manages to sort it > out. It's a bug with date parsing on FreeBSD that means all > dates are incorrectly stored. > > Please email me for more details. > > Cheers, > > John Senior. > > -- > js...@in... -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 |
|
From: Gilles D. <gr...@sc...> - 2002-03-12 23:32:53
|
According to Joe R. Jah: > On Mon, 11 Mar 2002, Gilles Detillieux wrote: ... > > What most browsers do with unencoded spaces within URLs is a violation of > > RFC 1738 and RFC 2396. htdig does the correct thing, if not what some > > users would prefer it did. You can of course patch the URL class to leave > > the spaces in there, in violation of the standard, to conform with the > > incorrect behaviour of most browsers and, apparently, some really bad > > HTML code generators. That would save you from having to fix all the bad > > HTML code you're indexing. Spaces within URLs should always always be > > encoded as %20. > > > > See http://www.geocrawler.com/archives/3/8822/2002/1/300/7455555/ > > and http://www.geocrawler.com/archives/3/8822/2002/1/250/7495651/ > > > > My recommendation, if you have a choice, is to avoid spaces in filenames > > altogether, because they cause all sorts of grief. Some caching proxy > > servers mess up URLs with spaces, even if the space is properly encoded > > as %20. > > You are absolutely right. I made a patch from your tips in the above > thread: ... > Applied it and randig, and waited for the dig to finish, and waited, and > waited, ...;( Finally I killed the process. I humbly switch my previous > +1 vote to -1. That's a bit surprising. (Not the change in vote, but the fact that it hung.) I'm curious as to why that is. Were you indexing through a proxy server, and if so, which one? Did it lock up solid without doing anything, or did it seem to be doing something when you killed it? Can you provide any verbose output and/or a stack backtrace at the time you killed it? -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 |
|
From: Joe R. J. <jj...@cl...> - 2002-03-12 23:09:52
|
On Mon, 11 Mar 2002, Gilles Detillieux wrote: > Date: Mon, 11 Mar 2002 17:22:15 -0600 (CST) > From: Gilles Detillieux <gr...@sc...> > To: jj...@cl... > Cc: Geoff Hutchison <ghu...@ws...>, > htd...@li... > Subject: Re: [htdig] "file name.html" -> "filename.html";( > > According to Joe R. Jah: > > On Sat, 9 Mar 2002, Geoff Hutchison wrote: > > > On Friday, March 8, 2002, at 01:51 PM, Joe R. Jah wrote: > > > > Unfortunately htdig removes the space. and looks for "filename.html" and > > > > reports: > > > > > > > > Not found: http://domain.com/some/path/filename.html Ref: > > > > http://domain.com/some/path/file.html > > > > > > Joe, I think you should understand that this isn't much help as a bug > > > report. Do you see this in 3.1.x, 3.2.0bX, both, etc.? When does the > > > space seem to "disappear?" Is it when it first encounters the link > > > (parser error), as it normalizes and accepts/rejects the URL (retriever > > > or URL parser error) or as it tries to fetch it? > > > > > > A bit more feedback would go a long way towards debugging this. > > > > Ok, I run 3.1.6, rundig -vvvvv results the following for one link in one > > file: > > ----------------------------------8<------------------------------- > > 0:0:0:http://domain.com/Path/To/: Trying local files > > tried local file /domain.com/Path/To/index.html > > tried local file /domain.com/Path/To/index.shtml > > found existing file /domain.com/Path/To/index.htm > > Read 5785 from document > > Read a total of 5785 bytes > > Tag: <html>, matched -1 > > Tag: <head>, matched -1 > > Tag: <title>, matched 0 > > word: Handouts@7 > > Tag: </title>, matched 1 > > title: Handouts > > Tag: <a href="fa01HP2-Basic Unix Commands.htm">, matched 2 > > word: Basic@696 > > word: UNIX@698 > > word: Commands@700 > > Tag: </a>, matched 3 > > href: http://domain.com/Path/To/fa01HP2-BasicUnixCommands.htm (Basic UNIX > > Commands) > > resolving 'http://domain.com/Path/To/fa01HP2-BasicUnixCommands.htm' > > pushing http://domain.com/Path/To/fa01HP2-BasicUnixCommands.htm > > ----------------------------------8<------------------------------- > > ... > > ----------------------------------8<------------------------------- > > 14:14:1:http://domain.com/Path/To/fa01HP2-BasicUnixCommands.htm: Trying local files > > tried local file /domain.com/Path/To/fa01HP2-BasicUnixCommands.htm > > Local retrieval failed, trying HTTP > > Retrieval command for http://domain.com/Path/To/fa01HP2-BasicUnixCommands.htm: GET /Path/To/fa01HP2-BasicUnixCommands.htm HTTP/1.0 > > User-Agent: htdig/3.1.6 (Se...@do...) > > Referer: http://domain.com/Path/To/ > > Host: domain.com > > > > Header line: HTTP/1.1 404 Not Found > > Header line: Date: Sun, 10 Mar 2002 08:03:36 GMT > > ----------------------------------8<------------------------------- > > > > And it reports: > > ----------------------------------8<------------------------------- > > Not found: http://domain.com/Path/To/fa01HP2-BasicUnixCommands.htm Ref: http://domain.com/Path/To/ > > ----------------------------------8<------------------------------- > > What most browsers do with unencoded spaces within URLs is a violation of > RFC 1738 and RFC 2396. htdig does the correct thing, if not what some > users would prefer it did. You can of course patch the URL class to leave > the spaces in there, in violation of the standard, to conform with the > incorrect behaviour of most browsers and, apparently, some really bad > HTML code generators. That would save you from having to fix all the bad > HTML code you're indexing. Spaces within URLs should always always be > encoded as %20. > > See http://www.geocrawler.com/archives/3/8822/2002/1/300/7455555/ > and http://www.geocrawler.com/archives/3/8822/2002/1/250/7495651/ > > My recommendation, if you have a choice, is to avoid spaces in filenames > altogether, because they cause all sorts of grief. Some caching proxy > servers mess up URLs with spaces, even if the space is properly encoded > as %20. You are absolutely right. I made a patch from your tips in the above thread: -----------------------8<----------------------- *** htlib/URL.cc.orig Thu Feb 7 17:15:38 2002 --- htlib/URL.cc Tue Mar 12 12:54:45 2002 *************** *** 75,81 **** URL::URL(char *ref, URL &parent) { String temp(ref); - temp.remove(" \r\n\t"); ref = temp; _host = parent._host; --- 75,82 ---- URL::URL(char *ref, URL &parent) { String temp(ref); + temp.remove("\r\n\t"); + temp.chop(' '); ref = temp; _host = parent._host; *************** *** 249,255 **** void URL::parse(char *u) { String temp(u); - temp.remove(" \t\r\n"); char *nurl = temp; // --- 250,257 ---- void URL::parse(char *u) { String temp(u); + temp.remove("\t\r\n"); + temp.chop(' '); char *nurl = temp; // -----------------------8<----------------------- Applied it and randig, and waited for the dig to finish, and waited, and waited, ...;( Finally I killed the process. I humbly switch my previous +1 vote to -1. Regards, Joe -- _/ _/_/_/ _/ ____________ __o _/ _/ _/ _/ ______________ _-\<,_ _/ _/ _/_/_/ _/ _/ ......(_)/ (_) _/_/ oe _/ _/. _/_/ ah jj...@cl... |
|
From: Gilles D. <gr...@sc...> - 2002-03-12 03:39:49
|
According to Emma Jane Hogbin: > >I think your assessment of the problem, and proposed solution, are > >both bang-on. The stuff between the <script> and </script> tag should > >be stripped out entirely and not parsed for HTML tags. > > I just found this: > > noindex_start, noindex_end > > type: > string used by: > htdig default: > <!--htdig_noindex--> <!--/htdig_noindex--> description: ... > noindex_start: <SCRIPT > noindex_end: </SCRIPT> > > > Maybe the default could also have the example "<script" tags? > Can you have multiple values for this? No, right now there is support for only one value in each attribute. We've talked many times of extending it to support multiple values, but so far no one has taken the time to implement it. Ironically, I felt when getting 3.1.6 out that this was less of a priority now that the HTML parser had built-in support for ignoring stuff between <script> and </script> tags. As for the default value, the idea was to have a custom set of tags that only htdig would recognize, and it seems from the e-mails we've seen on the list that these defaults are fairly widely used, so changing the defaults isn't likely to be popular. > >Of course, you can avoid this problem in your HTML if you properly put > >inline JavaScript code inside an HTML comment. E.g.: > > I didn't think I had my JS set up this way but when I went in to check I > actually have the proper comments in there. I thought that the parser read > HTML comments but didn't index them.... and this bug is to do with the > parser getting stuck with stuff it's reading even if it's not indexing it. htdig's HTML parser completely strips out HTML comments, if these are properly formed (i.e. they have an even number of dashes), so if you have valid HTML comment delimiters around your JavaScript, the parser shouldn't be thrown off by any "<" signs in the code. Can you show us an example of the comment delimiters you're using around the JavaScript? ... and in a later message ... > This little bit (from Geoff's old email) is the same as setting > noindex_start and end attributes, right? > > case 29: // "script" > noindex |= TAGscript; > nofollow |= TAGscript; > break; No, these are two very different things. The noindex_start and noindex_end handling is done in the first pass through the HTML, at the same time the comments are stripped out, and the text between these tags is completely removed from the in-memory copy of the document. Setting the noindex and nofollow flags is done in the second pass, and during that pass the parser still looks for tags in the code even when noindex is set, because it may be the matching closing tag that it finds at that point. Geoff's suggestion was to handle <script> and <style> tags in the first pass, rather than in the second. -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 |
|
From: Joe R. J. <jj...@cl...> - 2002-03-11 23:59:41
|
On Mon, 11 Mar 2002, Gilles Detillieux wrote: > Date: Mon, 11 Mar 2002 17:22:15 -0600 (CST) > From: Gilles Detillieux <gr...@sc...> > To: jj...@cl... > Cc: Geoff Hutchison <ghu...@ws...>, > htd...@li... > Subject: Re: [htdig] "file name.html" -> "filename.html";( > > According to Joe R. Jah: > > On Sat, 9 Mar 2002, Geoff Hutchison wrote: > > > On Friday, March 8, 2002, at 01:51 PM, Joe R. Jah wrote: > > > > Unfortunately htdig removes the space. and looks for "filename.html" and > > > > reports: > > > > > > > > Not found: http://domain.com/some/path/filename.html Ref: > > > > http://domain.com/some/path/file.html > > > > > > Joe, I think you should understand that this isn't much help as a bug > > > report. Do you see this in 3.1.x, 3.2.0bX, both, etc.? When does the > > > space seem to "disappear?" Is it when it first encounters the link > > > (parser error), as it normalizes and accepts/rejects the URL (retriever > > > or URL parser error) or as it tries to fetch it? > > > > > > A bit more feedback would go a long way towards debugging this. > > > > Ok, I run 3.1.6, rundig -vvvvv results the following for one link in one > > file: > > ----------------------------------8<------------------------------- > > 0:0:0:http://domain.com/Path/To/: Trying local files > > tried local file /domain.com/Path/To/index.html > > tried local file /domain.com/Path/To/index.shtml > > found existing file /domain.com/Path/To/index.htm > > Read 5785 from document > > Read a total of 5785 bytes > > Tag: <html>, matched -1 > > Tag: <head>, matched -1 > > Tag: <title>, matched 0 > > word: Handouts@7 > > Tag: </title>, matched 1 > > title: Handouts > > Tag: <a href="fa01HP2-Basic Unix Commands.htm">, matched 2 > > word: Basic@696 > > word: UNIX@698 > > word: Commands@700 > > Tag: </a>, matched 3 > > href: http://domain.com/Path/To/fa01HP2-BasicUnixCommands.htm (Basic UNIX > > Commands) > > resolving 'http://domain.com/Path/To/fa01HP2-BasicUnixCommands.htm' > > pushing http://domain.com/Path/To/fa01HP2-BasicUnixCommands.htm > > ----------------------------------8<------------------------------- > > ... > > ----------------------------------8<------------------------------- > > 14:14:1:http://domain.com/Path/To/fa01HP2-BasicUnixCommands.htm: Trying local files > > tried local file /domain.com/Path/To/fa01HP2-BasicUnixCommands.htm > > Local retrieval failed, trying HTTP > > Retrieval command for http://domain.com/Path/To/fa01HP2-BasicUnixCommands.htm: GET /Path/To/fa01HP2-BasicUnixCommands.htm HTTP/1.0 > > User-Agent: htdig/3.1.6 (Se...@do...) > > Referer: http://domain.com/Path/To/ > > Host: domain.com > > > > Header line: HTTP/1.1 404 Not Found > > Header line: Date: Sun, 10 Mar 2002 08:03:36 GMT > > ----------------------------------8<------------------------------- > > > > And it reports: > > ----------------------------------8<------------------------------- > > Not found: http://domain.com/Path/To/fa01HP2-BasicUnixCommands.htm Ref: http://domain.com/Path/To/ > > ----------------------------------8<------------------------------- > > What most browsers do with unencoded spaces within URLs is a violation of > RFC 1738 and RFC 2396. htdig does the correct thing, if not what some > users would prefer it did. You can of course patch the URL class to leave > the spaces in there, in violation of the standard, to conform with the > incorrect behaviour of most browsers and, apparently, some really bad > HTML code generators. That would save you from having to fix all the bad > HTML code you're indexing. Spaces within URLs should always always be > encoded as %20. > > See http://www.geocrawler.com/archives/3/8822/2002/1/300/7455555/ > and http://www.geocrawler.com/archives/3/8822/2002/1/250/7495651/ > > My recommendation, if you have a choice, is to avoid spaces in filenames > altogether, because they cause all sorts of grief. Some caching proxy > servers mess up URLs with spaces, even if the space is properly encoded > as %20. I am sorry I missed that thread. I believe the above situation is certainly becoming more and more pervasive. I vote +1 to tweak the HTML parser to handle space in filenames. Regards, Joe -- _/ _/_/_/ _/ ____________ __o _/ _/ _/ _/ ______________ _-\<,_ _/ _/ _/_/_/ _/ _/ ......(_)/ (_) _/_/ oe _/ _/. _/_/ ah jj...@cl... |
|
From: Gilles D. <gr...@sc...> - 2002-03-11 23:41:32
|
According to Jim Cole: > results in the parser missing all remaining links on the page. If > the '<' is removed or replaced (e.g. with a '>'), the page is > properly indexed. This occurs with 3.1.6; I haven't tried it with > a 3.2.0b4 snapshot. Yes, the <script> tag handling in 3.1.6 is a backport from 3.2.0b4 and possibly earlier betas, so they will likely do the same thing. -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 |
|
From: Gilles D. <gr...@sc...> - 2002-03-11 23:38:34
|
According to Geoff Hutchison: > On Friday, March 8, 2002, at 05:20 PM, Jim Cole wrote: > > It does look like there is a problem with the parser. If a '<' > > occurs in a script element, it appears that the parser becomes > > somewhat confused with regard to the remaining document content. > > For example > > Yes, this sounds like a bug to me. Actually, the <script> sections and > probably other sections as well should be simply skipped by the parser. > Right now the code does this: > > > case 29: // "script" > > noindex |= TAGscript; > > nofollow |= TAGscript; > > break; > > In short, the parser doesn't *index* the bits inside <script></script> > tags, but it does *look* at them. So it hit that "<" character and > figured it was a new tag. > > I would think that we want to treat <script> and probably <style> > sections like comments--find the ending tag and completely ignore > everything inside. I think your assessment of the problem, and proposed solution, are both bang-on. The stuff between the <script> and </script> tag should be stripped out entirely and not parsed for HTML tags. Of course, you can avoid this problem in your HTML if you properly put inline JavaScript code inside an HTML comment. E.g.: <script> <!-- JavaScript code here // --> </script> I'm amazed at how frequently people/programs fail to do this. It's what you're supposed to do to avoid problems with non-JavaScript-aware web clients. -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 |
|
From: Gilles D. <gr...@sc...> - 2002-03-11 23:29:42
|
According to Michael Clarke: > When we dig our intranet (4 gigs of data) apche web logs become chocca > with requests. So much that we have to purge thelogs each dig. Is > there a way of minimising requests / digging so apache doesn't record > soo much data? See http://www.htdig.org/attrs.html#local_urls -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 |
|
From: Williams, D. A. - D. <DAW...@aa...> - 2002-03-11 22:59:52
|
Michael, Apache documents an optional module which allows you to tweak the environment at: http://httpd.apache.org/docs/mod/mod_setenvif.html I use that and the CustomLog conf setting to not log internal requests: SetEnvIf Remote_Addr "10\.10\.." dont_log_local CustomLog /var/log/httpd/access_log combined env=!dont_log_local The Apache docs mention you can test for user agent as, so I would expect testing for htdig should be pretty straight forward. See also: http://httpd.apache.org/docs/logs.html#conditional Hope that helps a bit, -David ========================== The views, opinions, and judgments expressed are solely my own. Message contents have not been reviewed or approved by AARP. -----Original Message----- From: Michael Clarke [mailto:Mic...@ir...] Sent: Monday, March 11, 2002 5:36 PM To: htd...@li... Subject: [htdig-dev] Diging and logging by aache Hi all. When we dig our intranet (4 gigs of data) apche web logs become chocca with requests. So much that we have to purge thelogs each dig. Is there a way of minimising requests / digging so apache doesn't record soo much data? Michael Clarke IRD Open Systems Team Level 4, Telecom House 13-27 Manners Street Wellington Phone: +64 (04) 8031423 Mobile: +64 021 455 218 email: mic...@ir... email: ma...@ir... _______________________________________________ htdig-dev mailing list htd...@li... https://lists.sourceforge.net/lists/listinfo/htdig-dev |
|
From: Michael C. <Mic...@ir...> - 2002-03-11 22:36:26
|
Hi all. When we dig our intranet (4 gigs of data) apche web logs become chocca = with requests. So much that we have to purge thelogs each dig. Is there a = way of minimising requests / digging so apache doesn't record soo much = data? Michael Clarke IRD Open Systems Team Level 4, Telecom House 13-27 Manners Street Wellington Phone: +64 (04) 8031423 Mobile: +64 021 455 218 email: mic...@ir... email: ma...@ir... |
|
From: Neal R. <ne...@ri...> - 2002-03-11 07:20:38
|
> > With the branching of mifluz and libhtdig an LGPL license seems > > more appropriate. Using the LGPL license would encourage the use of > > libhtdig & the mifluz library by the widest possible set of developers, > > hopefully enhancing the project's quality and feature set. > > I'm not sure that mifluz will be relicensed under the LGPL. Keep in mind > that it's now "GNU mifluz." I noticed that.. but what I did also notice is that I didn't see that the mifluz code officially had copyright assigned to the FSF. If not, Loic is free to relicense the source however he choses as the copyright holder... and since milfuz is a derivative work of HtDig, this group would also have some say. Loic, comments or feedback? > So in order to get a new license for ht://Dig as a whole, I think you'd > need to see a few things happen first: > a) mifluz dual licensed. > b) Andrew agree to allow an LGPL for his code in ht://Dig. > > This is, of course, ignoring the question of other contributors, which > are many. > Yes.. forming a steering committee gives that committee the 'authority' to make decisions on behalf of the "ht://Dig Group" which consists of people who have contributed code. > I'm not a lawyer, nor do I want to be one. I would suggest that a scheme > more in line with previous practice would be at a minimum that the > htdig-dev list should vote on such a steering committee. This may, in > fact, be little more than a "rubber stamp," but I would also think major > decisions such as relicensing should require more general votes as well. > (I look towards Debian, for example.) A nomination and voting process (ala Debian) of the committee would certainly be a very fair way to do things! > I don't think most of us would know the benefits/drawbacks of being an > official non-profit. I am aware that this would probably entail some > paperwork overhead in terms of tax purposes in the U.S. Also keep in > mind that many contributors are not in the U.S. and as such may not wish > to be bound to various U.S. regulations. Yes.. it's was just an idea. I'm not sure exactly what kind of organization Debian is, but it works for them.. with all it's world-wide contributors. OpenBSD looks to be a formal Candian non-profit. > I understand some of the motivations for such moves, but personally > before there's much talk about licenses, I'd want to know that a license > change would even be possible. If mifluz is completely GPL'ed, period, > then there's not much a steering committee for ht://Dig could do. Again, that depends on if Loic has assigned mifluz copyrights to the FSF. FSF certainly encourages that, but there are examples of 'GNU' projects where that is not the case. My main motivation is a library style release of htdig could have a LGPL and not compromize the spirit of the project. There are of course workarounds... making libhtdig/htdig a server.. that are possible, but lots of work. Thanks. -- Neal Richter Knowledgebase Developer RightNow Technologies, Inc. Customer Service for Every Web Site |
|
From: Joe R. J. <jj...@cl...> - 2002-03-10 18:28:04
|
FYI,
Sun Mar 10 10:19:35 PST 2002
ssl.9 41
timet_enddate.1 28
Makefile.0 23
documentation.1 21
metadate.0 17
NUL.0 16
documentation.2 13
redirect.0 10
AdjustableLoggingPatch.tar.gz 7
titleSpace.0 2
Regards,
Joe
--
_/ _/_/_/ _/ ____________ __o
_/ _/ _/ _/ ______________ _-\<,_
_/ _/ _/_/_/ _/ _/ ......(_)/ (_)
_/_/ oe _/ _/. _/_/ ah jj...@cl...
|
|
From: Geoff H. <ghu...@us...> - 2002-03-10 08:14:05
|
STATUS of ht://Dig branch 3-2-x
RELEASES:
3.2.0b4: In progress
3.2.0b3: Released: 22 Feb 2001.
3.2.0b2: Released: 11 Apr 2000.
3.2.0b1: Released: 4 Feb 2000.
SHOWSTOPPERS:
KNOWN BUGS:
* Odd behavior with $(MODIFIED) and scores not working with
wordlist_compress set but work fine without wordlist_compress.
(the date is definitely stored correctly, even with compression on
so this must be some sort of weird htsearch bug)
* Not all htsearch input parameters are handled properly: PR#648. Use a
consistant mapping of input -> config -> template for all inputs where
it makes sense to do so (everything but "config" and "words"?).
* If exact isn't specified in the search_algorithms, $(WORDS) is not set
correctly: PR#650. (The documentation for 3.2.0b1 is updated, but can
we fix this?)
* META descriptions are somehow added to the database as FLAG_TITLE,
not FLAG_DESCRIPTION. (PR#859)
PENDING PATCHES (available but need work):
* Additional support for Win32.
* Memory improvements to htmerge. (Backed out b/c htword API changed.)
* MySQL patches to 3.1.x to be forward-ported and cleaned up.
(Should really only attempt to use SQL for doc_db and related, not word_db)
NEEDED FEATURES:
* Field-restricted searching.
* Return all URLs.
* Handle noindex_start & noindex_end as string lists.
* Handle local_urls through file:// handler, for mime.types support.
* Handle directory redirects in RetrieveLocal.
* Merge with mifluz
TESTING:
* httools programs:
(htload a test file, check a few characteristics, htdump and compare)
* Turn on URL parser test as part of test suite.
* htsearch phrase support tests
* Tests for new config file parser
* Duplicate document detection while indexing
* Major revisions to ExternalParser.cc, including fork/exec instead of popen,
argument handling for parser/converter, allowing binary output from an
external converter.
* ExternalTransport needs testing of changes similar to ExternalParser.
DOCUMENTATION:
* List of supported platforms/compilers is ancient.
* Add thorough documentation on htsearch restrict/exclude behavior
(including '|' and regex).
* Document all of htsearch's mappings of input parameters to config attributes
to template variables. (Relates to PR#648.) Also make sure these config
attributes are all documented in defaults.cc, even if they're only set by
input parameters and never in the config file.
* Split attrs.html into categories for faster loading.
* require.html is not updated to list new features and disk space
requirements of 3.2.x (e.g. phrase searching, regex matching,
external parsers and transport methods, database compression.)
* TODO.html has not been updated for current TODO list and completions.
OTHER ISSUES:
* Can htsearch actually search while an index is being created?
(Does Loic's new database code make this work?)
* The code needs a security audit, esp. htsearch
* URL.cc tries to parse malformed URLs (which causes further problems)
(It should probably just set everything to empty) This relates to
PR#348.
|
|
From: Geoff H. <ghu...@ws...> - 2002-03-09 19:53:20
|
On Friday, March 8, 2002, at 02:30 PM, Neal Richter wrote: > This would mean that calling libhtdig would be problematic from PHP > code. This is further complicated by the interpretive nature of > PHP... a > 'function call' in a PHP page is not a function call per-se. It > becomes a > function call during the interpretation of the page, after the page (or > the php.ini) loads the needed module (xxx.so). I hate legal issues. I was tempted to be flippant and write that it's up to the user to decide how to use ht://Dig and PHP. But of course if you're writing PHP wrappers for libhtdig, there's a rather implicit use for them. > With the branching of mifluz and libhtdig an LGPL license seems > more appropriate. Using the LGPL license would encourage the use of > libhtdig & the mifluz library by the widest possible set of developers, > hopefully enhancing the project's quality and feature set. I'm not sure that mifluz will be relicensed under the LGPL. Keep in mind that it's now "GNU mifluz." For the moment, I'm going to ignore the question of contacting all copyright holders in ht://Dig and asking them about a dual license. I'm quite certain in terms of # of lines of code outside of the Berkeley DB, that Andrew and Loic are probably the largest contributors. So in order to get a new license for ht://Dig as a whole, I think you'd need to see a few things happen first: a) mifluz dual licensed. b) Andrew agree to allow an LGPL for his code in ht://Dig. This is, of course, ignoring the question of other contributors, which are many. > Idea: Form an official steering committee consisting of the > developers that have CVS commit access. This committee would become the > representatives of the contributing developers as a whole. The > committee > would then have the power to make re-licensing decisions for the > "The ht://Dig Group" as listed in each source code file. So far, the "steering" has been via open mailing list on htdig-dev. Such decisions usually involve pushing towards various releases, whether certain changes should be implemented in a "frozen" tree before a release and the merits of certain approaches. I'm not a lawyer, nor do I want to be one. I would suggest that a scheme more in line with previous practice would be at a minimum that the htdig-dev list should vote on such a steering committee. This may, in fact, be little more than a "rubber stamp," but I would also think major decisions such as relicensing should require more general votes as well. (I look towards Debian, for example.) > any. If HtDig wanted to become an official entity (non-profit etc.) > there > would be costs, if not it may be as simple as posting the notice on the > web-site. I don't think most of us would know the benefits/drawbacks of being an official non-profit. I am aware that this would probably entail some paperwork overhead in terms of tax purposes in the U.S. Also keep in mind that many contributors are not in the U.S. and as such may not wish to be bound to various U.S. regulations. I understand some of the motivations for such moves, but personally before there's much talk about licenses, I'd want to know that a license change would even be possible. If mifluz is completely GPL'ed, period, then there's not much a steering committee for ht://Dig could do. -Geoff |
|
From: Geoff H. <ghu...@ws...> - 2002-03-09 19:36:02
|
On Friday, March 8, 2002, at 05:20 PM, Jim Cole wrote: > It does look like there is a problem with the parser. If a '<' > occurs in a script element, it appears that the parser becomes > somewhat confused with regard to the remaining document content. > For example Yes, this sounds like a bug to me. Actually, the <script> sections and probably other sections as well should be simply skipped by the parser. Right now the code does this: > case 29: // "script" > noindex |= TAGscript; > nofollow |= TAGscript; > break; In short, the parser doesn't *index* the bits inside <script></script> tags, but it does *look* at them. So it hit that "<" character and figured it was a new tag. I would think that we want to treat <script> and probably <style> sections like comments--find the ending tag and completely ignore everything inside. -- -Geoff Hutchison Williams Students Online http://wso.williams.edu/ |
|
From: Jim C. <gre...@yg...> - 2002-03-08 23:20:31
|
Emma Jane Hogbin's bits of Thu, 7 Mar 2002 translated to:
>>OK, but what exactly is on the page? It certainly didn't find anything
>>significant to index or links other than the images you pointed out.
>
>The page has four bread crumb items, a bunch of image navigation buttons,
>eight left nav text links, and over 20 text links (in a list). None of the
>words on the page are getting put into the word db. i.e. the page has a
>list of Colleges and none of the names of the colleges show up when I do a
>search.
>
>>Either the HTML parser is missing a lot, or there isn't much on the page
>>to index.
>
>I think it's the first option, which scares me. :(
It does look like there is a problem with the parser. If a '<'
occurs in a script element, it appears that the parser becomes
somewhat confused with regard to the remaining document content.
For example
<head>
<title>Title</title>
<script language="javascript">
var i;
for ( i = 0; i < 5; i++ ) {}
</script>
</head>
results in the parser missing all remaining links on the page. If
the '<' is removed or replaced (e.g. with a '>'), the page is
properly indexed. This occurs with 3.1.6; I haven't tried it with
a 3.2.0b4 snapshot.
Assuming that this is in fact a bug rather than a misunderstanding
of expected functionality, and the cause of problem is not obvious,
I would be willing to do a bit of debugging.
Jim
|
|
From: Neal R. <ne...@ri...> - 2002-03-08 20:30:47
|
Hey, So here's a topic for discussion. The libhtdig project I've been working on is basically repackaging htdig into a shared library. I will be finishing a set of PHP wrappers for libhtdig shortly. This raises a couple questions: 1. The FSF states that the newest PHP 4.x license is incompatible with the GPL. It's basically a BSD-style license with an advertising clause, which the FSF doesn't like (the non-advertising BSD License is kosher with the GPL). This would mean that calling libhtdig would be problematic from PHP code. This is further complicated by the interpretive nature of PHP... a 'function call' in a PHP page is not a function call per-se. It becomes a function call during the interpretation of the page, after the page (or the php.ini) loads the needed module (xxx.so). 2. The original intent of the HtDig project was to write a stand-alone package to do web-site searching... and the GPL is appropriate for this. With the branching of mifluz and libhtdig an LGPL license seems more appropriate. Using the LGPL license would encourage the use of libhtdig & the mifluz library by the widest possible set of developers, hopefully enhancing the project's quality and feature set. PHP 3.x, OpenOffice, Mozilla, and other projects have chosen a Dual-License strategy. This could be an option here. Either dual license the entire project, or re license the inner portions of the project under the LGPL (leaving the major components of the individual binaries under the GPL). This approach would necessitate some care in bringing in other GPLed code into HtDig in the future. If possible the original copyright holder would need to be contacted to re-license the specific code under the LGPL. 3. Many Open Source projects have a kind of steering committe that is empowered to make these kinds of decisions.... glibc, gcc, FreeBSD, X11, etc. Idea: Form an official steering committee consisting of the developers that have CVS commit access. This committee would become the representatives of the contributing developers as a whole. The committee would then have the power to make re-licensing decisions for the "The ht://Dig Group" as listed in each source code file. Note that the committee could have policies in place that respect the thoughts of the developers at large and make decisions accordingly. Having a steering committee doesn't imply draconian powers. Forming the steering committee may involve a few steps, I will ask our Legal Department how this could be done. The initial feeling is that its pretty easy to do... post notice on the website, etc. Please forward any other legal questions to me and I'll try to get answers. We do have a relationship with an IP lawyer familiar with the GPL if it comes to that. RightNow Tech would also consider covering any reasonable costs associated with any legal work required to form the steering committtee, if any. If HtDig wanted to become an official entity (non-profit etc.) there would be costs, if not it may be as simple as posting the notice on the web-site. Thoughts? I'll be away from e-mail this weekend so I'll respond to and questions directed to me on monday. Thanks and give libhtdig_api.h a look! -- Neal Richter Knowledgebase Developer RightNow Technologies, Inc. Customer Service for Every Web Site |
|
From: Neal R. <ne...@ri...> - 2002-03-08 19:45:53
|
Hey all, Soon there will be a new snapshot of the libhtdig project in the http://www.htdig.org/files/contrib/other/ Status: The indexing (htdig), merging (htmerge), and searching (htsearch) API calls are functional. The htfuzzy API is not yet working. TODO: Make APIs of the other utils binaries. Implement 'indicators'. This indicator is returned by and xxxx_open() call and used by any follow up calls, and freed by calling xxxx_close(). This is similar to functionality in SQL libraries. This kind of feature is a first step in making libhtdig usable in a server or apache-module type of setting. I'm posting a follow up e-mail shortly about licensing questions. Thanks. -- Neal Richter Knowledgebase Developer RightNow Technologies, Inc. Customer Service for Every Web Site |
|
From: Neal R. <ne...@ri...> - 2002-03-08 19:34:40
|
Hello, Available very soon at http://www.htdig.org/files/contrib/wordlists/ will be a file called miltilang_stopwords.tgz This file contains separate stopword files for these languages: cs_CZ de_DE el_GR en_GB en_US es_ES fi_FI fr_CA fr_FR it_IT nl_NL pt_BR sv_SE The majority of the foreign lanaguage files contain the English translation in a comment above the word. When appropriate, the words retain the original accenting. These lists are quite aggressive, so I would recommend editing them to your needs before using with htdig. Happy stop-wording! -- Neal Richter Knowledgebase Developer RightNow Technologies, Inc. Customer Service for Every Web Site |
|
From: Steve F. <ht...@ht...> - 2002-03-08 02:18:20
|
Hi all, Just joined the list and I have a question about custom headers. I have several virtual hosts on one machine and have successfully limited searches for each host However I would like ot have a different header for the results from different hosts, mainly so i can add a logo with a link back to the home page for the particular site. I have looked through the docs and can't see a way to do this, is it possible? Regards Steve -- ------------------------------------------------- "Minds are like parachutes, they work best when open" Support free speech; visit http://www.efa.org.au/ Heads Together Systems Pty Ltd http://www.hts.com.au Email: ht...@ht... Tel: 612 9982 6767 Fax: 612 9981 3081 |
|
From: Geoff H. <ghu...@ws...> - 2002-03-06 19:55:20
|
On Wed, 6 Mar 2002, Willy Calderon wrote: > OK, I setenv LDFLAGS =96L/usr/lib32 and the ./configure and it ran. Agai= n,=20 > during the make process, I get another error except this time previous=20 > warnings in /usr/lib32/ > [snip] > ld32: WARNING 84 : /usr/lib32/libsocket.so is not used for resolving any= =20 > symbol. These warnings are fine--the libraries are included even though they may not be needed. The IRIX linker is much more verbose about this than any others. > ld32: ERROR 33 : Unresolved text symbol "alloca" -- 1st referenced by= =20 > ../htlib/libht.a(regex.o). Ugh. Strange that the configure script manage to get things to work, but it doesn't here. What you want to do is remove regex.* and take the regex target out of the htlib Makefile. This should use the builtin IRIX system call rather than the GNU-regex that's included. -- -Geoff Hutchison Williams Students Online http://wso.williams.edu/ |