You can subscribe to this list here.
| 2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(47) |
Nov
(74) |
Dec
(66) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2002 |
Jan
(95) |
Feb
(102) |
Mar
(83) |
Apr
(64) |
May
(55) |
Jun
(39) |
Jul
(23) |
Aug
(77) |
Sep
(88) |
Oct
(84) |
Nov
(66) |
Dec
(46) |
| 2003 |
Jan
(56) |
Feb
(129) |
Mar
(37) |
Apr
(63) |
May
(59) |
Jun
(104) |
Jul
(48) |
Aug
(37) |
Sep
(49) |
Oct
(157) |
Nov
(119) |
Dec
(54) |
| 2004 |
Jan
(51) |
Feb
(66) |
Mar
(39) |
Apr
(113) |
May
(34) |
Jun
(136) |
Jul
(67) |
Aug
(20) |
Sep
(7) |
Oct
(10) |
Nov
(14) |
Dec
(3) |
| 2005 |
Jan
(40) |
Feb
(21) |
Mar
(26) |
Apr
(13) |
May
(6) |
Jun
(4) |
Jul
(23) |
Aug
(3) |
Sep
(1) |
Oct
(13) |
Nov
(1) |
Dec
(6) |
| 2006 |
Jan
(2) |
Feb
(4) |
Mar
(4) |
Apr
(1) |
May
(11) |
Jun
(1) |
Jul
(4) |
Aug
(4) |
Sep
|
Oct
(4) |
Nov
|
Dec
(1) |
| 2007 |
Jan
(2) |
Feb
(8) |
Mar
(1) |
Apr
(1) |
May
(1) |
Jun
|
Jul
(2) |
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
|
| 2008 |
Jan
(1) |
Feb
|
Mar
(1) |
Apr
(2) |
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
|
| 2009 |
Jan
|
Feb
|
Mar
(2) |
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
| 2010 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
(1) |
| 2011 |
Jan
|
Feb
|
Mar
(1) |
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(1) |
Nov
|
Dec
|
| 2012 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
| 2013 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
| 2016 |
Jan
(1) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
| 2017 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(1) |
Dec
|
|
From: Neal R. <ne...@ri...> - 2003-10-10 19:58:47
|
Thanks! This is a lot of work. Everyone: Please let me know what kind of time you'd be willing to put in to get this stuff tested??!! When we do the code-freeze on Monday, other than fixing the Jim/Gilles/robot.txt issue we should be testing and fixing till we covered the test cases. Let's get this done so we can talk about 4.0 and restructuring. Neal. On Wed, 8 Oct 2003, Lachlan Andrew wrote: > On Tue, 7 Oct 2003 21:55, Lachlan Andrew wrote: > > > I'll start by tentatively taking attributes starting with A-D, and > > then post a list of those which I *don't* think I'll be able to > > test. > > > Scratch that... > > Attached is a partial grouping of attributes which may be able to be > tested together. Feel free to re-group them. > > -- > lh...@us... > ht://Dig developer DownUnder (http://www.htdig.org) Neal Richter Knowledgebase Developer RightNow Technologies, Inc. Customer Service for Every Web Site Office: 406-522-1485 |
|
From: Neal R. <ne...@ri...> - 2003-10-10 19:54:33
|
> 1) Recognize that regex is the wrong tool for the job, and go back to > using StringMatch for robots.txt handling as in 3.2. However, keep > the test of the URL's path in Retriever::IsValidURL(), and make sure > Server::push() tests the path only and not the full URL. > > 2) Stick to using regex, and Jim's patch, but fix Server::push() to > test the path. Also, it would be a good idea to add the meta character > escaping code from my patch. > > 3) Back out Jim's patch and apply mine, which addresses all the concerns > I brought up, and is the simplest fix to the existing code (pre-Jim's > patch). My fix adds a pattern to the regex to skip over the protocol > and server name parts of the URL, so the match is effectively anchored to > the path component of the URL, even though the pattern matching is done > (consistently) on the whole URL. > > Whichever of the 3 you choose, it's going to need testing in any case. > Fix 1 would be the ideal one, I think, but I didn't (and still don't) > have time to do that, so I opted for the simple fix in the above > mentioned e-mail. Given that fix 2 is now partially implemented, > maybe the best course of action would be to follow through and address > the 2 missing pieces. My fix (no. 3) has the disadvantage that future > developers will be scratching their heads trying to figure out my regular > expression for skipping to the path portion of the whole URL. If your fix #3 is fully implemented, vs Jim's needing more work.. I'd rather go with #3. Jim: What is your estimate of finishing #2? I don't want to step on toes.. just get 3.2 out. After 3.2 we all need to have a pow-wow about what the next step is. Thanks. Neal Richter Knowledgebase Developer RightNow Technologies, Inc. Customer Service for Every Web Site Office: 406-522-1485 |
|
From: Gilles D. <gr...@sc...> - 2003-10-08 20:59:16
|
According to Jim Cole: > The patch I provided is apparently not a complete fix. Gilles shot it > down, pointing out that it misses at least one case. He offered an > alternative patch, but asked for feedback before committing it. I don't > think the person that initially reported the problem ever responded > regarding either patch. This information never ended up being appended > to the bug report; my bad I guess since I opened it. Sorry. > > Gilles' response can be found at > http://sourceforge.net/mailarchive/message.php?msg_id=5470006. > > I don't know whether it makes more sense to undo the patch and reopen > the report or accept the partial fix and open a new one for the missing > case. To be honest I never got around to taking the time to understand > the part of the problem that my initial patch missed. ... > > Fix robot exclusions. Jim Cole's patch with bug report #765726. I think properly explaining it requires a bit of a history lesson. Back in 3.1.x, htdig only checked against the disallow list during a Server::push(), and it only checked the url.path() portion, not the full URL. It did this using StringMatch::Compare(), rather than a regex, so it was an "anchored" match - i.e. the pattern had to match the path from the start. It worked, but the problem was that by the time htdig did the Server::push(), it had already created a db.docdb entry for the URL, which htmerge had to delete, which led to a lot of confusion because the message from htmerge was ambiguous as to the reason for deleting the record. Enter 3.2, and in Dec, 1999, Geoff made some pretty radical changes to robots.txt handling, including using regex (though I'm still not sure why*, as robots.txt entries don't use regular expressions), and checking for disallowed paths in Retriever::IsValidURL(). The latter change was to check the URLs earlier on, before a db.docdb entry is needlessly created, to avoid the problem above, and to give a meaningful error message saying why the URL is rejected. All that is great, but the change to regex introduced this bug: the match was now an unanchored match of the whole URL, rather than an anchored match of the path only. This gave rise to the problem of false negatives, rejecting a URL if a disallowed path matched any portion of the URL's path, or even the server name. The change also left the test in Server::push(), so that URLs are essentially screened twice. What's not clear to me is if this test is still needed, or whether the test in Retriever::IsValidURL() is now sufficient. Finally, the change to regex matching also may lead to another potential problem - characters in Disallow lines could be mistaken for regex meta characters, as they're not escaped. I don't know how likely this is, but it's a theoretical possibility. Jim, your patch fixes the problem with false negatives, by forcing an anchored search of the path (by inserting ^ before each Disallowed path). The inconsistency here is that the match is done on the path only on the test in Retriever::IsValidURL() - the test in Server::push() still tries the (now anchored) match on the whole URL. If there was indeed a reason to keep this old test in Server::push(), that test is now broken. This patch also doesn't do anything about escaping regex meta characters in Disallow patterns. I can see taking one of the following courses of action to remedy the problem: 1) Recognize that regex is the wrong tool for the job, and go back to using StringMatch for robots.txt handling as in 3.2. However, keep the test of the URL's path in Retriever::IsValidURL(), and make sure Server::push() tests the path only and not the full URL. 2) Stick to using regex, and Jim's patch, but fix Server::push() to test the path. Also, it would be a good idea to add the meta character escaping code from my patch. 3) Back out Jim's patch and apply mine, which addresses all the concerns I brought up, and is the simplest fix to the existing code (pre-Jim's patch). My fix adds a pattern to the regex to skip over the protocol and server name parts of the URL, so the match is effectively anchored to the path component of the URL, even though the pattern matching is done (consistently) on the whole URL. Whichever of the 3 you choose, it's going to need testing in any case. Fix 1 would be the ideal one, I think, but I didn't (and still don't) have time to do that, so I opted for the simple fix in the above mentioned e-mail. Given that fix 2 is now partially implemented, maybe the best course of action would be to follow through and address the 2 missing pieces. My fix (no. 3) has the disadvantage that future developers will be scratching their heads trying to figure out my regular expression for skipping to the path portion of the whole URL. As a separate design decision, it might be worth trying to determine for sure whether the test in Server::push() is actually needed or not. If it's redundant, it should either be tossed out for the sake of efficiency, or kept for the sake of "defensive programming" if it doesn't really add any inefficiency. If it's kept, though, it should be kept in a way that's consistent with the new test in Retriever::IsValidURL(). As it stands now, as of this Sunday's CVS commit, it's confusing and misleading to have a functional test of the path in Retriever::IsValidURL(), and a non-functional test of the whole URL in Server::push(). * Pure speculation as to the motive for moving to regex from StringMatch: maybe Geoff intended eventually to eliminate the StringMatch class altogether from the ht://Dig code base. I know it had caused us some grief in the early 3.1.x development, but I thought we had gotten past that. Maybe there were other issues or potential problems still to be faced. Geoff, care to comment? -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) |
|
From: Neal R. <ne...@ri...> - 2003-10-08 19:26:01
|
Hey, Not at the moment. HtDig isn't UTF compatible yet. I would look into Namazu if I were you: http://www.searchtools.com/tools/namazu.html Thanks On Tue, 7 Oct 2003, C.J. Collier wrote: > Hey all, > > Can anyone tell me if htdig supports multiple locales at the same time? > I've got a contract with Japan's Computational Biology Research Center, > and they're looking for support for CJK locales in their search engine. > > Is htdig the right product for the project? Neal Richter Knowledgebase Developer RightNow Technologies, Inc. Customer Service for Every Web Site Office: 406-522-1485 |
|
From: Neal R. <ne...@ri...> - 2003-10-08 19:16:14
|
> > I think this only became an issue because of persistent connections. > Correct me if I'm wrong, but I think htdig's behaviour in the past > (i.e. 3.1.x, and maybe 3.2 without head_before_get=TRUE) was to do a GET, > and upon seeing the headers if it decided it didn't need to refetch the > file, it would simply close the connection right away and not read the > stream of data for the file. No wasted bandwidth, but maybe it caused > some unnecessary overhead on the server, which probably started serving > up each file (including running CGI scripts if that's what made the page) > before realising the connection was closed. True, but we can override the current setting if '-i' is given to force head_before_get=false. > > The critical part of the above, which I was trying to explain before, is > point 4 (a). If a document hasn't changed, htdig would need somehow to > keep track of every link that document had to others, so that it could > keep traversing the hierarchy of links as it crawls its way through > to every "active" page on the site. That would require additional > information in the database that htdig doesn't keep track of right now. > Right now, the only way to do a complete crawl is to reparse every > document. Yep, this is true. On the plus side, if we do keep and maintain that list I've got a strack of research papers talking about what can be done with that list to make searching better. It opens up a world of possibilities for improving relevance ranking, learning relationships between pages, etc.. Neal Richter Knowledgebase Developer RightNow Technologies, Inc. Customer Service for Every Web Site Office: 406-522-1485 |
|
From: Lachlan A. <lh...@us...> - 2003-10-07 22:28:42
|
On Tue, 7 Oct 2003 21:55, Lachlan Andrew wrote: > I'll start by tentatively taking attributes starting with A-D, and > then post a list of those which I *don't* think I'll be able to > test. Scratch that... Attached is a partial grouping of attributes which may be able to be=20 tested together. Feel free to re-group them. --=20 lh...@us... ht://Dig developer DownUnder (http://www.htdig.org) |
|
From: C.J. C. <cjc...@co...> - 2003-10-07 19:48:21
|
Hey all, Can anyone tell me if htdig supports multiple locales at the same time? I've got a contract with Japan's Computational Biology Research Center, and they're looking for support for CJK locales in their search engine. Is htdig the right product for the project? Cheers, C.J. |
|
From: Gilles D. <gr...@sc...> - 2003-10-07 18:06:13
|
According to Gabriele Bartolini: > > Nope, if head_before_get=TRUE we use the HEAD request and the HTTP > >server is kind enough to give us the timestamp on the document in the header. > >If the timestamps are the same we don't bother to download it. > > Yep, you are right. I remember that was one of the reasons why I wrote the > code for the 'HEAD' method (also to avoid downloading an entire document > whose content-type is not parsable). I think this only became an issue because of persistent connections. Correct me if I'm wrong, but I think htdig's behaviour in the past (i.e. 3.1.x, and maybe 3.2 without head_before_get=TRUE) was to do a GET, and upon seeing the headers if it decided it didn't need to refetch the file, it would simply close the connection right away and not read the stream of data for the file. No wasted bandwidth, but maybe it caused some unnecessary overhead on the server, which probably started serving up each file (including running CGI scripts if that's what made the page) before realising the connection was closed. Now, head_before_get=TRUE would add a bit of bandwidth for htdig -i, because it would get the headers twice, but it's obviously a big advantage for update digs, especially when using persistent connections. Closing the connection whenever htdig decides not to fetch a file would kill any speed advantage you'd otherwise gain from persistent connections, not to mention the extra load on the server. > > > I think you misinterpreted what Lachlan suggested, i.e. the case where Y > > > does NOT change. If Y is the only document with a link to X, and Y does > > > not change, it will still have the link to X, so X is still "valid". > > > However, if Y didn't change, and htdig (without -i) doesn't reindex Y, > > > then how will it find the link to X to validate X's presence in the db? > > I must admit I am not very confortable with the incremental indexing code. > Anyway, when I was thinking of the same procedure for ht://Check (not yet > done, as I said) I came up to this (I will try to stay on a logical level): > > 1) As you said, mark all the document as let's say 'Reference_obsolete' > 2) Read the start URL and mark all the URLs in the start URL to be > retrieved (eventually add them in the index of documents) > 3) Loop until there are no URLs to be retrieved > > 4) For every URL, through a pre-emptive head call, get to know if it is > changed: > a) not changed: let's get all the URLs linked to it and mark them > "to be retrieved" or something like that > b) yes: let's download it again and mark all the new link as "to > be retrieved" > > 5) Purge all the obsolete URLs > > This approach would solve your second "flaw" Neal (I guess so). The critical part of the above, which I was trying to explain before, is point 4 (a). If a document hasn't changed, htdig would need somehow to keep track of every link that document had to others, so that it could keep traversing the hierarchy of links as it crawls its way through to every "active" page on the site. That would require additional information in the database that htdig doesn't keep track of right now. Right now, the only way to do a complete crawl is to reparse every document. -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) |
|
From: Lachlan A. <lh...@us...> - 2003-10-07 12:02:55
|
On Mon, 6 Oct 2003 05:08, Neal Richter wrote: > Mutiny? Ok, Pirate Lachlan. *grin* I don't mean to step on the toes of those who have worked very=20 hard for many years. Apologies to them if I was curt. > I think the issue is that we don't > have a plan to get it to the next step... Good point. > I suggest that on next Sunday Oct 12 we > institute a 'code-freeze' and spend the remainder of Oct testing > the code and call it 3.2RC1. 3.2RC1 goes out Sun Nov 2. Good idea. Do you think the db.words.db bug will be sorted out by=20 then? Does anyone have access to Cygwin? > Proposed plan: > > Testing would consist of us figuring out some way to divide up > test points. The basic idea is to go through the process of > changing the configuration file in reasonable ways and > indexing/searching to formally exercise those verbs. Ideally > we'd like to test each of the approx 200 verbs. A similar method > for each executables command-line switches in needed. Thanks for taking the lead on this, Neal. The testing sounds like a=20 big job! I'll start by tentatively taking attributes starting with A-D, and=20 then post a list of those which I *don't* think I'll be able to test. Would it be worth writing extra scripts in .../test/ so that we can=20 re-test easily after bug fixes and before future releases? Or would=20 that be more effort than it is worth? > The next step would be devoting ourselves to quickly handling > bugs that come in from users from some specified period and release > '3.2.1'. I put this in so we don't leave all this work up to > Gilles, our devoted maintainer of 3.1.X. Hear, hear!! > Formal testing is yucky and delays us stamping it 3.2-final, but > I'd rather spend a month doing this than see silly errors found by > a score of endusers. Absolutely. > The next step would comming up with a test plan, basically a list > of test points so we can divvy up the work. This needs to be > complete by next weekend. Aye, Aye, Cap'n! Lachlan --=20 lh...@us... ht://Dig developer DownUnder (http://www.htdig.org) |
|
From: Joe R. J. <jj...@cl...> - 2003-10-06 21:19:42
|
On Mon, 6 Oct 2003, Gilles Detillieux wrote:
> Date: Mon, 6 Oct 2003 15:29:20 -0500 (CDT)
> From: Gilles Detillieux <gr...@sc...>
> To: Gabriele Bartolini <bar...@in...>
> Cc: lh...@us..., htd...@li...
> Subject: Re: [htdig-dev] Release plans?
>
> According to Gabriele Bartolini:
> > >I vote that, once the Windows db.words.db bug is ironed out, we
> > >release 3.2.0b5 / 3.2.0rc1. If anyone has a reason *not* to, could
> > >they please mail it to the group?
> >
> > I vote +1.
> >
> > I guess the actual snapshot is the best 3.2 version so far. So ... better a
> > 3.2.0b5 than a buggy 3.2.0b3 still around ...
>
> I'd agree! Lachlan & Neal, you'd probably know better than anyone whether
> the code is ready for a beta release, as you've done most of the work on it
> in the past several months. If you say it's ready, then I say go for it.
>
> +1
+1
Regards,
Joe
--
_/ _/_/_/ _/ ____________ __o
_/ _/ _/ _/ ______________ _-\<,_
_/ _/ _/_/_/ _/ _/ ......(_)/ (_)
_/_/ oe _/ _/. _/_/ ah jj...@cl...
|
|
From: Gilles D. <gr...@sc...> - 2003-10-06 20:29:34
|
According to Gabriele Bartolini: > >I vote that, once the Windows db.words.db bug is ironed out, we > >release 3.2.0b5 / 3.2.0rc1. If anyone has a reason *not* to, could > >they please mail it to the group? > > I vote +1. > > I guess the actual snapshot is the best 3.2 version so far. So ... better a > 3.2.0b5 than a buggy 3.2.0b3 still around ... I'd agree! Lachlan & Neal, you'd probably know better than anyone whether the code is ready for a beta release, as you've done most of the work on it in the past several months. If you say it's ready, then I say go for it. +1 As for who holds the keys, it's really up to the developers as a group to decide on these matters. Beyond that, the only holding of the keys there's been is that Geoff has personally put the finishing touches on each release since 3.1.0b1, so he knows best the ins and outs of how it's done. If any of you want to take that over, just say the word, and I'm sure Geoff would share with you his procedures, scripts, checklists, etc. (It would be good to have all that on the developer web site for posterity in any case, don't you think?) -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) |
|
From: Jim C. <gre...@yg...> - 2003-10-05 23:32:29
|
The patch I provided is apparently not a complete fix. Gilles shot it down, pointing out that it misses at least one case. He offered an alternative patch, but asked for feedback before committing it. I don't think the person that initially reported the problem ever responded regarding either patch. This information never ended up being appended to the bug report; my bad I guess since I opened it. Sorry. Gilles' response can be found at http://sourceforge.net/mailarchive/message.php?msg_id=5470006. I don't know whether it makes more sense to undo the patch and reopen the report or accept the partial fix and open a new one for the missing case. To be honest I never got around to taking the time to understand the part of the problem that my initial patch missed. Jim On Sunday, October 5, 2003, at 04:46 AM, Lachlan Andrew wrote: > > htdig/htdig Retriever.cc Server.cc > Update of /cvsroot/htdig/htdig/htdig > In directory sc8-pr-cvs1:/tmp/cvs-serv13262/htdig > > Modified Files: > Retriever.cc Server.cc > Log Message: > Fix robot exclusions. Jim Cole's patch with bug report #765726. > > > > ------------------------------------------------------- > This sf.net email is sponsored by:ThinkGeek > Welcome to geek heaven. > http://thinkgeek.com/sf > _______________________________________________ > htdig-updates mailing list > htd...@li... > https://lists.sourceforge.net/lists/listinfo/htdig-updates > |
|
From: Neal R. <ne...@ri...> - 2003-10-05 19:10:52
|
On Sun, 5 Oct 2003, Lachlan Andrew wrote: > Greetings all, > > Neal recently suggested releasing a new interrim release in September. > Since yet another deadline has passed, could I ask that those who > hold the "keys" to www.htdig.org set some guidelines for when we > can release the next beta? It was suggested that we, the developers, > do that. The problem is that we can't enforce the timelines... Mutiny? Ok, Pirate Lachlan. I think the issue is that we don't have a plan to get it to the next step... > I vote that, once the Windows db.words.db bug is ironed out, we > release 3.2.0b5 / 3.2.0rc1. If anyone has a reason *not* to, could > they please mail it to the group? I second this idea and suggest that on next Sunday Oct 12 we institute a 'code-freeze' and spend the remainder of Oct testing the code and call it 3.2RC1. 3.2RC1 goes out Sun Nov 2. Proposed plan: Testing would consist of us figuring out some way to divide up test points. The basic idea is to go through the process of changing the configuration file in reasonable ways and indexing/searching to formally exercise those verbs. Ideally we'd like to test each of the approx 200 verbs. A similar method for each executables command-line switches in needed. We could probably vote on proposed changes to default config values. Of course we need to test index some websites, but doing that on a large scale is difficult... I propose we set some reasonable goal here and handle indexing/parsing bugs as we see them after release. The goals for 3.2RC2 are to fix any bugs that come up. The only allowable commits in this time-frame are those that fix bugs or cleanup memory leaks. When its ready we institute a 'code-freeze' and formally test the parts affected by bug fixes.. At the end of this we can vote for calling it 3.2 final. The next step would be devoting ourselves to quickly handling bugs that come in from users from some specified period and release '3.2.1'. I put this in so we don't leave all this work up to Gilles, our devoted maintainer of 3.1.X. Formal testing is yucky and delays us stamping it 3.2-final, but I'd rather spend a month doing this than see silly errors found by a score of endusers. The next step would comming up with a test plan, basically a list of test points so we can divvy up the work. This needs to be complete by next weekend. Thanks. Neal Richter Knowledgebase Developer RightNow Technologies, Inc. Customer Service for Every Web Site Office: 406-522-1485 |
|
From: Neal R. <ne...@ri...> - 2003-10-05 18:43:04
|
On Fri, 3 Oct 2003, Gilles Detillieux wrote: > I think you misinterpreted what Lachlan suggested, i.e. the case where Y > does NOT change. If Y is the only document with a link to X, and Y does > not change, it will still have the link to X, so X is still "valid". > However, if Y didn't change, and htdig (without -i) doesn't reindex Y, > then how will it find the link to X to validate X's presence in the db? Doh! Yep, now I see it and yep it's a problem! We're going to need to implement the logic Gabriele/Lachlan discussed to cure this. On Sun, 5 Oct 2003, Lachlan Andrew wrote: >> However I would strongly recommend we enable head_before_get by >> default. We're basically wasting bandwidth like drunken sailors >> with it off!!! > >Good suggestion. If we want some code bloat, we could have an "auto" >mode, which would use head_before_get unless -i is specified, but >not when -i is specified (since we'll always have to do the "get" >anyway)... This would be a two-line change.. we simply make 'true' the default and manualy change it to false with a line of code that executes when '-i' is given! Post 3.2 we could remove the config verb altogether. Neal Richter Knowledgebase Developer RightNow Technologies, Inc. Customer Service for Every Web Site Office: 406-522-1485 |
|
From: Gabriele B. <bar...@in...> - 2003-10-05 12:22:57
|
>I vote that, once the Windows db.words.db bug is ironed out, we >release 3.2.0b5 / 3.2.0rc1. If anyone has a reason *not* to, could >they please mail it to the group? I vote +1. I guess the actual snapshot is the best 3.2 version so far. So ... better a 3.2.0b5 than a buggy 3.2.0b3 still around ... Ciao -Gabriele -- Gabriele Bartolini: Web Programmer, ht://Dig & IWA/HWG Member, ht://Check maintainer Current Location: Melbourne, Victoria, Australia bar...@in... | http://www.prato.linux.it/~gbartolini | ICQ#129221447 > "Leave every hope, ye who enter!", Dante Alighieri, Divine Comedy, The Inferno |
|
From: Jesse op d. B. <ht...@op...> - 2003-10-05 12:02:40
|
Hi all, if the mainstream of servers and/or operating systems work correct with the latest snapshot and there are only obscure bugs, then my personal view is that we should release a new beta, maybe a RC. There are still people that download, install and use 3.2b3.......... ----- Original Message ----- From: "Lachlan Andrew" <lh...@us...> To: <htd...@li...> Sent: Sunday, October 05, 2003 1:04 PM Subject: [htdig-dev] Release plans? > Greetings all, > > Neal recently suggested releasing a new interrim release in September. > Since yet another deadline has passed, could I ask that those who > hold the "keys" to www.htdig.org set some guidelines for when we > can release the next beta? It was suggested that we, the developers, > do that. The problem is that we can't enforce the timelines... > > I vote that, once the Windows db.words.db bug is ironed out, we > release 3.2.0b5 / 3.2.0rc1. If anyone has a reason *not* to, could > they please mail it to the group? > > Thanks in advance to all involved, > Lachlan > > (Do I sound like a broken record? :) --Jesse |
|
From: Lachlan A. <lh...@us...> - 2003-10-05 11:09:55
|
Phew... then it might not be my fault :) Sorry, I have no more=20 ideas, so I'll leave you in Neal's capable hands... Cheers, Lachlan On Sat, 4 Oct 2003 01:36, Steve Eidemiller wrote: > it didn't appear to work --=20 lh...@us... ht://Dig developer DownUnder (http://www.htdig.org) |
|
From: Lachlan A. <lh...@us...> - 2003-10-05 11:06:43
|
Greetings all, Neal recently suggested releasing a new interrim release in September. =20 Since yet another deadline has passed, could I ask that those who=20 hold the "keys" to www.htdig.org set some guidelines for when we=20 can release the next beta? It was suggested that we, the developers,=20 do that. The problem is that we can't enforce the timelines... I vote that, once the Windows db.words.db bug is ironed out, we=20 release 3.2.0b5 / 3.2.0rc1. If anyone has a reason *not* to, could=20 they please mail it to the group? Thanks in advance to all involved, Lachlan (Do I sound like a broken record? :) --=20 lh...@us... ht://Dig developer DownUnder (http://www.htdig.org) |
|
From: Gabriele B. <bar...@in...> - 2003-10-05 09:29:16
|
>If we get around to implementing Google's link analysis, as Geoff
>suggested, then we may be able to fix the problem properly. It seems
>that any fix will have to look at all links *to* a page, and then
>mark as "obsolete" those *links* where (a) the link-from page ("Y")
>is changed and (b) it no longer contains the link. After the dig,
>all pages must be checked (in the database), and those with no links
>which are not obsolete can themselves be marked as obsolete.
Yep ... that's exactly what I wrote ... Sorry Lachlan.
-Gabriele
--
Gabriele Bartolini: Web Programmer, ht://Dig & IWA/HWG Member, ht://Check
maintainer
Current Location: Melbourne, Victoria, Australia
bar...@in... | http://www.prato.linux.it/~gbartolini | ICQ#129221447
> "Leave every hope, ye who enter!", Dante Alighieri, Divine Comedy, The
Inferno
|
|
From: Gabriele B. <bar...@in...> - 2003-10-05 09:29:13
|
Ciao guys,
> Nope, if head_before_get=TRUE we use the HEAD request and the HTTP
>server is kind enough to give us the timestamp on the document in the header.
>If the timestamps are the same we don't bother to download it.
Yep, you are right. I remember that was one of the reasons why I wrote the
code for the 'HEAD' method (also to avoid downloading an entire document
whose content-type is not parsable).
> > I think you misinterpreted what Lachlan suggested, i.e. the case where Y
> > does NOT change. If Y is the only document with a link to X, and Y does
> > not change, it will still have the link to X, so X is still "valid".
> > However, if Y didn't change, and htdig (without -i) doesn't reindex Y,
> > then how will it find the link to X to validate X's presence in the db?
I must admit I am not very confortable with the incremental indexing code.
Anyway, when I was thinking of the same procedure for ht://Check (not yet
done, as I said) I came up to this (I will try to stay on a logical level):
1) As you said, mark all the document as let's say 'Reference_obsolete'
2) Read the start URL and mark all the URLs in the start URL to be
retrieved (eventually add them in the index of documents)
3) Loop until there are no URLs to be retrieved
4) For every URL, through a pre-emptive head call, get to know if it is
changed:
a) not changed: let's get all the URLs linked to it and mark them
"to be retrieved" or something like that
b) yes: let's download it again and mark all the new link as "to
be retrieved"
5) Purge all the obsolete URLs
This approach would solve your second "flaw" Neal (I guess so).
Ciao ciao
-Gabriele
--
Gabriele Bartolini: Web Programmer, ht://Dig & IWA/HWG Member, ht://Check
maintainer
Current Location: Melbourne, Victoria, Australia
bar...@in... | http://www.prato.linux.it/~gbartolini | ICQ#129221447
> "Leave every hope, ye who enter!", Dante Alighieri, Divine Comedy, The
Inferno
|
|
From: Lachlan A. <lh...@us...> - 2003-10-05 08:08:14
|
Greetings Neal,
On Sat, 4 Oct 2003 11:00, Neal Richter wrote:
> If the timestamps are the same we don't bother to download it.
>
> > I think you misinterpreted what Lachlan suggested, i.e. the case
> > where Y does NOT change. If Y is the only document with a link
> > to X, and Y does not change, it will still have the link to X, so
> > X is still "valid". However, if Y didn't change, and htdig
> > (without -i) doesn't reindex Y, then how will it find the link to
> > X to validate X's presence in the db?
>
> Changing Y is the point!
Agreed, changing Y is what triggers the current bug. However, I=20
believe that a simple fix of the current bug will introduce a *new*=20
bug for the more common case that Y *doesn't* change. Reread=20
Gilles's scenario and try to answer his question. I'd explain it=20
more clearly, but I don't have a napkin handy :)
If we get around to implementing Google's link analysis, as Geoff=20
suggested, then we may be able to fix the problem properly. It seems=20
that any fix will have to look at all links *to* a page, and then=20
mark as "obsolete" those *links* where (a) the link-from page ("Y")=20
is changed and (b) it no longer contains the link. After the dig,=20
all pages must be checked (in the database), and those with no links=20
which are not obsolete can themselves be marked as obsolete.
> However I would strongly recommend we enable head_before_get by
> default. We're basically wasting bandwidth like drunken sailors
> with it off!!!
Good suggestion. If we want some code bloat, we could have an "auto"=20
mode, which would use head_before_get unless -i is specified, but=20
not when -i is specified (since we'll always have to do the "get"=20
anyway)...
Cheers,
Lachlan
--=20
lh...@us...
ht://Dig developer DownUnder (http://www.htdig.org)
|
|
From: Neal R. <ne...@ri...> - 2003-10-04 01:11:25
|
I've seen this. The short answer is don't use MinGW. MinGW is missing many of the things we take for granted in a 'normal' unix and that cygwin supplies. If you do some Googling you'll find some information to read about what MinGW lacks. Basically MinGW attempts to use the incomplete Posix.1 subsystem in WinNT do provide a unix like environment. Cygwin does a much more complete job by finishing off the holes in WinNT's Posix.1 subsystem, fixing what is buggy, and replacing what sucked. Please try this in a cygwin shell: cp ./db/db.h.win32 ./db/db.h cp ./db/db_config.h.win32 ./db/db_config.h cp ./include/htconfig.h.win32 ./include/htconfig.h make -f Makefile.win32 That should fire off a build using Microsoft's compilers... Thanks! Neal. On Fri, 3 Oct 2003, Steve Eidemiller wrote: > Neal, > > I ran into a little trouble with the build using the htdig-3.2.0b4-20090928 full snapshot. Here were the steps I used: > > 1. Installed MinGW 3.1.0-1 (newbie alert!!) > 2. Added mingw-zlib 1.1.4-1 to my Cygwin setup (per configure's suggestion) > 3. Launched Cygwin > 4. export CC='gcc -mno-cygwin' > 5. export CXX='gcc -mno-cygwin' > 6. ./configure --host=mingw32 --build=mingw32 --target=mingw32 --prefix=c:/htdig > 7. make > 8. Went to lunch :-) > 9. Observed the following output leading to the error(s): > > ================ > gcc -mno-cygwin -DHAVE_CONFIG_H -I. -I. -I. -I./../htlib -g -O2 -c os_oflags.c > -DDLL_EXPORT -DPIC -o .libs/os_oflags.o > os_oflags.c: In function `CDB___db_omode': > os_oflags.c:69: error: `S_IRGRP' undeclared (first use in this function) > os_oflags.c:69: error: (Each undeclared identifier is reported only once > os_oflags.c:69: error: for each function it appears in.) > os_oflags.c:71: error: `S_IWGRP' undeclared (first use in this function) > os_oflags.c:73: error: `S_IROTH' undeclared (first use in this function) > os_oflags.c:75: error: `S_IWOTH' undeclared (first use in this function) > make[2]: *** [os_oflags.lo] Error 1 > make[2]: Leaving directory `/home/htdig320b4/htdig-3.2.0b4-20030928-Win32/db' > make[1]: *** [all] Error 2 > make[1]: Leaving directory `/home/htdig320b4/htdig-3.2.0b4-20030928-Win32/db' > make: *** [all-recursive] Error 1 > ================ > > Looks like the #define's at the end of db_int.h aren't happening? They appear dependent on some Win32 flags. Do I need another export or configure directive for Win32? Thanks for your patience, I'm a bit of a newbie here :) > > Thanx > -Steve > > >>> Neal Richter <ne...@ri...> 10/02/03 04:41PM >>> > > Hey, > I have produced a set of makefiles for a native windows binaries. > You do need cygwin to run 'make' (the makefiles are for GNU make). The > makefiles use the Microsoft compiler. > > Could you get a copy of the latest snapshot and try and do the > build? I'll work with you to get it fixed if it's still broken. > > We've tested older snapshots of HtDig compiled Win32 native and > run nearly a million documents through it.... > > If this doesn't satisfy your needs, I'd be willing to put in some > time looking at the cygwin build. > > Neal Richter > Knowledgebase Developer > RightNow Technologies, Inc. > Customer Service for Every Web Site > Office: 406-522-1485 > > > __________________________________ > > Confidentiality Statement: > This email/fax, including attachments, may include confidential and/or proprietary information and may be used only by the person or entity to which it is addressed. If the reader of this email/fax is not the intended recipient or his or her agent, the reader is hereby notified that any dissemination, distribution or copying of this email/fax is prohibited. If you have received this email/fax in error, please notify the sender by replying to this message and deleting this email or destroying this facsimile immediately. > > > ------------------------------------------------------- > This sf.net email is sponsored by:ThinkGeek > Welcome to geek heaven. > http://thinkgeek.com/sf > _______________________________________________ > ht://Dig Developer mailing list: > htd...@li... > List information (subscribe/unsubscribe, etc.) > https://lists.sourceforge.net/lists/listinfo/htdig-dev > Neal Richter Knowledgebase Developer RightNow Technologies, Inc. Customer Service for Every Web Site Office: 406-522-1485 |
|
From: Neal R. <ne...@ri...> - 2003-10-04 01:02:22
|
> But if we need to re-spider everything, don't we need to re-index all
> documents, whether they've changed or not? If so, then we need to do
> htdig -i all the time. If we don't reparse every document, we need some
> other means to re-validate every document to which an unchanged document
> has links.
Nope, if head_before_get=TRUE we use the HEAD request and the HTTP
server is kind enough to give us the timestamp on the document in the header.
If the timestamps are the same we don't bother to download it.
> I think you misinterpreted what Lachlan suggested, i.e. the case where Y
> does NOT change. If Y is the only document with a link to X, and Y does
> not change, it will still have the link to X, so X is still "valid".
> However, if Y didn't change, and htdig (without -i) doesn't reindex Y,
> then how will it find the link to X to validate X's presence in the db?
Changing Y is the point! I think my original description was unclear.
Bug #1
1) Website contains page X. There is at least one page that contains a
link to X.
2) Remove all links to X in the website, but don't delete it. Run htdig
without the -i option.
3) Do a search and notice that page X is still returned, even though it
technically isn't in the 'website' anymore... it is orphaned on the
webserver.
Bug #2
1) make start_url contain two separate websites & set up filters
accordingly.
2) run htdig -i.... all is OK
3) remove on of the websites from the start_url
4) rerun htdig without -i.
5) do a search and note that the removed websites pages are still
returned!
> > > I'd be inclined not to fix this until after we've released the next
> > > "archive point", whether that be 3.2.0b5 or 3.2.0rc1...
>
> I'd be inclined to agree. If it comes down to the possibility of
> losing valid documents in the db vs. keeping invalid ones, I'd prefer
> the latter behaviour. Until we can find a way to ensure all currently
> linked documents remain in the db, without having to reparse them all,
> then I think the current behaviour is the best compromise. If you
> want to reparse everything to ensure a clean db with accurate linkages,
> that's what -i is for.
If you change all pages to remove a link to a page that doesn't get
deleted, the HTTP header will change and HtDig re-downloads it.. thus
giving correct behavior.
The fix accomplishes this. There is no danger of 'losing valid
documents'. The datestamp in the http header with the proper logic will
guarantee proper behavior. If a page changes, it's re-downloaded
and reparsed and its links are examined for changes. Orphaned pages are
never revisited, and are purged after the spider is done.
I've spent hours inside a debugger examining how the spider does
things... I will continue to look for efficiency gains.
This bug in minor, and a decent workaround exists... so I agree with
waiting to commit the fix.
I'll sit on it and come up with an actual test case at the appropriate
time to demonstrate the bug. It's just plain inefficient the way we
currently do it, we revist pages that don't need it and carry cruft in the
database that is deadweight.
However I would strongly recommend we enable head_before_get by default.
We're basically wasting bandwidth like drunken sailors with it off!!!
Thanks.
Jessica: I'm heading to the Pub here in Bozeman, MT. I'll draw some
stuff on napkins for ya!
Neal Richter
Knowledgebase Developer
RightNow Technologies, Inc.
Customer Service for Every Web Site
Office: 406-522-1485
|
|
From: Jesse op d. B. <ht...@op...> - 2003-10-03 20:01:55
|
Hello, Thanks for mirroring the htdig project. Your mirror has been added to the list as of 03 oct 2003. ----- Original Message ----- From: "Mirrors Admin" <mi...@so...> To: <htd...@li...> Sent: Thursday, October 02, 2003 11:36 AM Subject: [htdig-dev] New UK Mirror > Hi there, > > We have just setup a new Ht://Dig Mirror in the UK, London which is updated > nightly and on a 2Mbit dedicated line. > > Ht://Dig Web Site: > http://www.sourcekeg.co.uk/htdig/ > > Ht://Dig Files Web Site: > http://www.sourcekeg.co.uk/htdig/files/ > > Ht://Dig Patch Web Site: > http://www.sourcekeg.co.uk/htdig/htdig-patches/ > > Ht://Dig Developer Web/FTP Site: > http://www.sourcekeg.co.uk/htdig/dev/ > > Please add this to the Ht://Dig official mirror page accordingly and use > "Onino" in the Organisation field, pointing to http://www.onino.co.uk > > If you need a contact email address, you can use mi...@so... > > Best Regards, > > Jonathan Menmuir > mi...@so... --Jesse op den Brouw. |
|
From: Steve E. <Ste...@ch...> - 2003-10-03 18:44:05
|
Neal, I ran into a little trouble with the build using the htdig-3.2.0b4-20090928 full snapshot. Here were the steps I used: 1. Installed MinGW 3.1.0-1 (newbie alert!!) 2. Added mingw-zlib 1.1.4-1 to my Cygwin setup (per configure's suggestion) 3. Launched Cygwin 4. export CC='gcc -mno-cygwin' 5. export CXX='gcc -mno-cygwin' 6. ./configure --host=mingw32 --build=mingw32 --target=mingw32 --prefix=c:/htdig 7. make 8. Went to lunch :-) 9. Observed the following output leading to the error(s): ================ gcc -mno-cygwin -DHAVE_CONFIG_H -I. -I. -I. -I./../htlib -g -O2 -c os_oflags.c -DDLL_EXPORT -DPIC -o .libs/os_oflags.o os_oflags.c: In function `CDB___db_omode': os_oflags.c:69: error: `S_IRGRP' undeclared (first use in this function) os_oflags.c:69: error: (Each undeclared identifier is reported only once os_oflags.c:69: error: for each function it appears in.) os_oflags.c:71: error: `S_IWGRP' undeclared (first use in this function) os_oflags.c:73: error: `S_IROTH' undeclared (first use in this function) os_oflags.c:75: error: `S_IWOTH' undeclared (first use in this function) make[2]: *** [os_oflags.lo] Error 1 make[2]: Leaving directory `/home/htdig320b4/htdig-3.2.0b4-20030928-Win32/db' make[1]: *** [all] Error 2 make[1]: Leaving directory `/home/htdig320b4/htdig-3.2.0b4-20030928-Win32/db' make: *** [all-recursive] Error 1 ================ Looks like the #define's at the end of db_int.h aren't happening? They appear dependent on some Win32 flags. Do I need another export or configure directive for Win32? Thanks for your patience, I'm a bit of a newbie here :) Thanx -Steve >>> Neal Richter <ne...@ri...> 10/02/03 04:41PM >>> Hey, I have produced a set of makefiles for a native windows binaries. You do need cygwin to run 'make' (the makefiles are for GNU make). The makefiles use the Microsoft compiler. Could you get a copy of the latest snapshot and try and do the build? I'll work with you to get it fixed if it's still broken. We've tested older snapshots of HtDig compiled Win32 native and run nearly a million documents through it.... If this doesn't satisfy your needs, I'd be willing to put in some time looking at the cygwin build. Neal Richter Knowledgebase Developer RightNow Technologies, Inc. Customer Service for Every Web Site Office: 406-522-1485 __________________________________ Confidentiality Statement: This email/fax, including attachments, may include confidential and/or proprietary information and may be used only by the person or entity to which it is addressed. If the reader of this email/fax is not the intended recipient or his or her agent, the reader is hereby notified that any dissemination, distribution or copying of this email/fax is prohibited. If you have received this email/fax in error, please notify the sender by replying to this message and deleting this email or destroying this facsimile immediately. |