You can subscribe to this list here.
| 2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(47) |
Nov
(74) |
Dec
(66) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2002 |
Jan
(95) |
Feb
(102) |
Mar
(83) |
Apr
(64) |
May
(55) |
Jun
(39) |
Jul
(23) |
Aug
(77) |
Sep
(88) |
Oct
(84) |
Nov
(66) |
Dec
(46) |
| 2003 |
Jan
(56) |
Feb
(129) |
Mar
(37) |
Apr
(63) |
May
(59) |
Jun
(104) |
Jul
(48) |
Aug
(37) |
Sep
(49) |
Oct
(157) |
Nov
(119) |
Dec
(54) |
| 2004 |
Jan
(51) |
Feb
(66) |
Mar
(39) |
Apr
(113) |
May
(34) |
Jun
(136) |
Jul
(67) |
Aug
(20) |
Sep
(7) |
Oct
(10) |
Nov
(14) |
Dec
(3) |
| 2005 |
Jan
(40) |
Feb
(21) |
Mar
(26) |
Apr
(13) |
May
(6) |
Jun
(4) |
Jul
(23) |
Aug
(3) |
Sep
(1) |
Oct
(13) |
Nov
(1) |
Dec
(6) |
| 2006 |
Jan
(2) |
Feb
(4) |
Mar
(4) |
Apr
(1) |
May
(11) |
Jun
(1) |
Jul
(4) |
Aug
(4) |
Sep
|
Oct
(4) |
Nov
|
Dec
(1) |
| 2007 |
Jan
(2) |
Feb
(8) |
Mar
(1) |
Apr
(1) |
May
(1) |
Jun
|
Jul
(2) |
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
|
| 2008 |
Jan
(1) |
Feb
|
Mar
(1) |
Apr
(2) |
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
|
| 2009 |
Jan
|
Feb
|
Mar
(2) |
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
| 2010 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
(1) |
| 2011 |
Jan
|
Feb
|
Mar
(1) |
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(1) |
Nov
|
Dec
|
| 2012 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
| 2013 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
| 2016 |
Jan
(1) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
| 2017 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(1) |
Dec
|
|
From: Neal R. <ne...@ri...> - 2003-10-22 21:20:47
|
Hey all, Please go to sourceforge and look at the open bugs if you can, there are 18 'Status:Open' now. There are 6 bugs in the 'Status:Open & Group:Include_in_3.2' state. Gabriele: Did you fix this one already? [ 594790 ] rundig doesn't index Apache w/mod_zip Thanks. Neal Richter Knowledgebase Developer RightNow Technologies, Inc. Customer Service for Every Web Site Office: 406-522-1485 |
|
From: Neal R. <ne...@ri...> - 2003-10-22 20:43:32
|
Lachlan wrote: > 2. We are in feature freeze, and scheduled to release in one week's > time, at the end of October. We should minimise changes to the code. > Has a bug report been filed for this issue yet? Wasn't the plan to > have no CVS commits without reference to a bug number? Gabriele: Please create a sourceforge bug for this when you change it... and clue us all in on what the 'net change' is after the commits ;-). As far as the release goes, we need to get some kind of testing report made and updated... I'll try and post something by tommorow. Thanks. Neal Richter Knowledgebase Developer RightNow Technologies, Inc. Customer Service for Every Web Site Office: 406-522-1485 |
|
From: Neal R. <ne...@ri...> - 2003-10-22 19:51:00
|
Gabriele wrote: > 1) leave the code as is > 2) remove the overriding of the head before get in the incremental dig > > In both cases we need to write down a better documentation for this > attribute (especially in the option 2 where we should talk about the > benefits of a HEAD call in the incremental dig). > > I must confess. I would prefer option 2, as I think users' must have full > control of the tool and IMHO by adding a default behaviour of HEAD before > GET to the system we've done our part. OK, you've convinced me, it IS useful to have this switch be user controlled.. I wasn't aware of the non-compliant servers causing an issue. Clearly 'automatic' behavior in that case is a bad thing. Go with option 2. Thanks! Neal Richter Knowledgebase Developer RightNow Technologies, Inc. Customer Service for Every Web Site Office: 406-522-1485 |
|
From: Lachlan A. <lh...@us...> - 2003-10-22 15:11:45
|
Greetings Gilles, In htcommon/defaults.cc, startyear is specified as 1970, so your=20 config file would have to explicitly clear startyear to say no date=20 is given. The reason for startyear being specified in defaults.cc is that=20 the default value should be in attrs.html, which is automatically=20 generated. The three fixes I can think of (in order of my=20 preference) are: 1. Set the (hard-coded) default value of startday in htsearch/Display.cc to 0 instead of 1. I'm not sure if this would work, and it may break other things. 2. Leave startyear empty in defaults.cc and manually hack attrs.hml. 3. Leave startyear undocumented. Opinions? Cheers, Lachlan On Wed, 22 Oct 2003 08:30, Gilles Detillieux wrote: > even > though these dozen or so web pages were definitely in the database, > and came out into db.docs after an htdump (with a m:0 field), > htsearch would not show these in search results. I looked at the > code, and the only thing that I can see that would cause this is if > the startyear, startmonth or startday input parameters were set, > causing the timet_startdate value in Display.cc to be greater than > 0. But I didn't set these! I ran htsearch from the command line, > so I know I wasn't passing it these values as input parameters, and > the config file I used didn't define these as attributes either. --=20 lh...@us... ht://Dig developer DownUnder (http://www.htdig.org) |
|
From: Lachlan A. <lh...@us...> - 2003-10-22 13:35:08
|
Greetings all, I've only been following this thready loosely, but my opinions are: 1. In version 3.2.1 (or 3.3, or 4.0) there should be three possible=20 settings: true, false, auto. That way the user has complete=20 control, but doesn't need to exert it. 2. We are in feature freeze, and scheduled to release in one week's=20 time, at the end of October. We should minimise changes to the code. =20 Has a bug report been filed for this issue yet? Wasn't the plan to=20 have no CVS commits without reference to a bug number? Cheers, Lachlan On Wed, 22 Oct 2003 08:30, Gabriele Bartolini wrote: > So ... we have 2 possibilities now: > > 1) leave the code as is > 2) remove the overriding of the head before get in the incremental > dig > > I must confess. I would prefer option 2, as I think users' must > have full control of the tool and IMHO by adding a default > behaviour of HEAD before GET to the system we've done our part. --=20 lh...@us... ht://Dig developer DownUnder (http://www.htdig.org) |
|
From: Gilles D. <gr...@sc...> - 2003-10-22 09:47:55
|
According to Gabriele Bartolini: > However, I would pick these general cases, where the user should disable > the attribute (please revise it): > > Case A - Persistent connections on > 1) the majority of documents are HTML (this means we "always" want to GET them) > 2) the server does not support HEAD (I have seen cases like this unfortunately) OK, that sounds pretty important. I hadn't heard that one before. Persistent connections are only on for HTTP/1.1 servers, so what you're saying is that there are servers out there that claim to be 1.1 compliant but don't support the HEAD request. Wouldn't this be an argument against overriding head_before_get during an incremental dig? > 3) cases where the persistent communication between htdig and the server > does not work at 100%: there can be some problems with persistent > connections and HEAD calls (I experience this kind of problems sometimes > with ht://Check and some NT servers) Again, is this going to be a problem if we don't allow turning off head_before_get during an update dig? > Case B - Persistent connection off > 1) same as case A > 2) same as case A In this case, the server could be HTTP/1.1 or 1.0. Either way, the same question applies. If the user needs a way to tell htdig to deal nicely with these questionably compliant servers, then wouldn't they need a way of turning off head_before_get unconditionally, whether it's an update dig or an initial one? This is what I was getting at before about this option never being explained adequately. On the surface, it seemed to be rather useless, but with these new revelations that have come out of your testing, it seems there may indeed be a need for turning this off in some cases. That's the sort of thing that should be documented so others (developers and end-users) know what you'd use this for. -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) |
|
From: Gabriele B. <bar...@in...> - 2003-10-22 00:30:59
|
At 16.01 21/10/2003 -0500, Gilles Detillieux wrote: > > 2) the server does not support HEAD (I have seen cases like this > unfortunately) >OK, that sounds pretty important. I hadn't heard that one before. I meant that some server administrators may turn off the HEAD method (in Apache you can use the Limit directive). >but don't support the HEAD request. Wouldn't this be an argument against >overriding head_before_get during an incremental dig? I guess it is a matter of choosing the less painful solution. In the normal case (p/c on and hbg on) overriding is not done; however, in the incremental dig, one more request is made (HEAD) without success and hopefully - after that - the document GETs retrieved. There is a bit of overhead for sure but the question is: is it better to have a bit of overhead in some cases (minority) or to prevent users from getting the benefit from using always a workin HEAD call when updating the database? The other way is to remove the override and leave everything in the hands of the user (I would not mind this - of course providing a better documentation). With the changes done yesterday we have moved towards a clearer situation anyway, because: - head before get is now true by default - head before get has been detached by persistent connections and has become independent > > 3) cases where the persistent communication between htdig and the server > > does not work at 100%: there can be some problems with persistent > > connections and HEAD calls (I experience this kind of problems sometimes > > with ht://Check and some NT servers) > >Again, is this going to be a problem if we don't allow turning off >head_before_get during an update dig? I guess this could be fixable, because the problem comes up with persistent connections - which may be still disabled. >with these questionably compliant servers, then wouldn't they need a way >of turning off head_before_get unconditionally, whether it's an update >dig or an initial one? Yes, that'd be great. Again, I guess we have to balance what we can do in order to make things easier to the user but, at the same time, leave the users enough freedom in order to configure their systems the way they want. Also, with 3.2, the server and URL blocks have added more dimensions to the space of configurability available to users and ... more "clear" attributes are available and more the toy gets perfect. >This is what I was getting at before about this option never being >explained adequately. You're right. > On the surface, it seemed to be rather useless, >but with these new revelations that have come out of your testing, it >seems there may indeed be a need for turning this off in some cases. >That's the sort of thing that should be documented so others (developers >and end-users) know what you'd use this for. So ... we have 2 possibilities now: 1) leave the code as is 2) remove the overriding of the head before get in the incremental dig In both cases we need to write down a better documentation for this attribute (especially in the option 2 where we should talk about the benefits of a HEAD call in the incremental dig). I must confess. I would prefer option 2, as I think users' must have full control of the tool and IMHO by adding a default behaviour of HEAD before GET to the system we've done our part. So tell me what you think, especially you Gilles and Neal that have followed this thread. I am more than happy to (in case) rechange the code today. Ciao ciao -Gabriele -- Gabriele Bartolini: Web Programmer, ht://Dig & IWA/HWG Member, ht://Check maintainer Current Location: Melbourne, Victoria, Australia bar...@in... | http://www.prato.linux.it/~gbartolini | ICQ#129221447 > "Leave every hope, ye who enter!", Dante Alighieri, Divine Comedy, The Inferno |
|
From: Gilles D. <gr...@sc...> - 2003-10-21 22:55:13
|
Hey, guys. I ran into something wierd when I was testing out the allow_numbers changes last week, which I haven't been quite able to explain or track down in the code. Of the pages on my site that I was indexing, about a dozen of them were from a CGI script that puts out a Last-Modified header to set the date appropriately in search results. Because of a recent bug in the script, which I just fixed last week, it turns out that the Last-Modified headers were coming out with no date on them, so htdig was giving them a modtime of 0 (i.e. the epoch). This is different behaviour than htdig 3.1.6, which gave them the current time instead. It may be that the 3.2 code should be fixed to do likewise, as it seems the more sensible behaviour. However, that's not the wierd thing. What was odd is that even though these dozen or so web pages were definitely in the database, and came out into db.docs after an htdump (with a m:0 field), htsearch would not show these in search results. I looked at the code, and the only thing that I can see that would cause this is if the startyear, startmonth or startday input parameters were set, causing the timet_startdate value in Display.cc to be greater than 0. But I didn't set these! I ran htsearch from the command line, so I know I wasn't passing it these values as input parameters, and the config file I used didn't define these as attributes either. I know the problem was the 0 modtime, because when I fixed the CGI script to return a proper Last-Modified header, the pages showed up in htsearch, with no other changes being made. Does anyone know of anything else that might explain this behaviour? I'd start putting trace prints in htsearch to track this down, but I have too many high-priority things right now to spend much time on ht://Dig right away. htsearch -vvvv didn't give any indication of what might be going on - the URLs in question never even showed up at all in the output. I don't think I'd consider this a showstopper, but it does seem odd that htsearch rejects any modtime value at all when none of those parameters have been specified. This, coupled with the fact that htdig will assign a 0 modtime if it can't parse the Last-Modified header (as opposed to a missing Last-Modified header, which should be taken as the current time if I'm not mistaken), could lead to others having similar problems. -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) |
|
From: Gabriele B. <bar...@in...> - 2003-10-21 08:53:08
|
I read again my e-mail and I think that I should have written this sentence in another way: >2) performing HEAD calls only in the incremental dig (either with or >without persistent connections) I meant: "in the incremental dig perform just HEAD calls". I guess you guys understood: "HEAD is performed only in incremental digs". If so ... I am sorry about that and my english. Ciao -Gabriele -- Gabriele Bartolini: Web Programmer, ht://Dig & IWA/HWG Member, ht://Check maintainer Current Location: Melbourne, Victoria, Australia bar...@in... | http://www.prato.linux.it/~gbartolini | ICQ#129221447 > "Leave every hope, ye who enter!", Dante Alighieri, Divine Comedy, The Inferno |
|
From: Gabriele B. <bar...@in...> - 2003-10-21 07:49:58
|
Hi guys, At 17.02 20/10/2003 -0500, Gilles Detillieux wrote: >wrong, I think it would be helpful if the code automatically did the right >thing in most circumstances, and if the documentation for this attribute >made it clear in which circumstances it would make sense to turn it off. Yep. I think so too. Anyway, I modified the defaults.cc by putting the attribute in a 'true' default state and by explaining that: - during an incremental dig, the value is overridden; - in general, it is recommended to leave this value on. I did not specify cases in which the attribute should be turned off as I thought I would have generated more confusion in the user. However, I would pick these general cases, where the user should disable the attribute (please revise it): Case A - Persistent connections on 1) the majority of documents are HTML (this means we "always" want to GET them) 2) the server does not support HEAD (I have seen cases like this unfortunately) 3) cases where the persistent communication between htdig and the server does not work at 100%: there can be some problems with persistent connections and HEAD calls (I experience this kind of problems sometimes with ht://Check and some NT servers) Case B - Persistent connection off 1) same as case A 2) same as case A 3) I have never experienced any problem as in case A.3 with persistent connections disabled >Well, it seems to me that there are actually two different cases where >htdig does an initial dig. The obvious one is when the user specifies >-i, which sets the initial flag. The less obvious one is when htdig is >run without -i, but with no existing database, or with an empty one. >What matters is whether there are URLs in the database or not. If there >are none, then you'll never reject a document as "not changed". OK. Good point. I think I changed the Retriever class in order to perform this check as well. Also, during an incremental dig, if debug > 1 I show a notice message, saying that any head before get attribute configuration is overridden and that HEAD is always enabled. Sounds good? Ciao and thanks, -Gabriele -- Gabriele Bartolini: Web Programmer, ht://Dig & IWA/HWG Member, ht://Check maintainer Current Location: Melbourne, Victoria, Australia bar...@in... | http://www.prato.linux.it/~gbartolini | ICQ#129221447 > "Leave every hope, ye who enter!", Dante Alighieri, Divine Comedy, The Inferno |
|
From: Gabriele B. <bar...@in...> - 2003-10-21 07:47:46
|
Hi guys,
I have given an higher debug level (2 instead of 1) to the display of
the configuration information of a Server (performed in the constructor); I
remember I enabled it when we first started the server configuration and -
sorry - I found it really frustrating now. I think that a level 1 debug was
inappropriate for that and I guess a level 2 is enough.
Ciao
-Gabriele
--
Gabriele Bartolini: Web Programmer, ht://Dig & IWA/HWG Member, ht://Check
maintainer
Current Location: Melbourne, Victoria, Australia
bar...@in... | http://www.prato.linux.it/~gbartolini | ICQ#129221447
> "Leave every hope, ye who enter!", Dante Alighieri, Divine Comedy, The
Inferno
|
|
From: Gilles D. <gr...@sc...> - 2003-10-20 23:00:35
|
According to Gabriele Bartolini: > At 14.39 14/10/2003 -0500, Gilles Detillieux wrote: > >It would be a good idea, in the general case, for us all to learn how > >to properly override config parameters in the code, so that a server > >block or URL block definition doesn't override an internal override > > Maybe I am missing something. I am not aware of a way that allows us to > override blocks definitions through the Configuration classes. Can you > please point it out to me? Sorry. No, I think you're right that it can't be done in the ConfiguRation class right now. It seems the only way to override block attribute definitions in the code is to add logic where the attribute is used and ignore the attribute if it's appropriate to do so. The logic now is that server block definitions override globals, and URL block definitions override both (for attribute definitions that can be used in server or URL blocks). So, there's no way in the code to globally override a server or URL block definition. -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) |
|
From: Neal R. <ne...@ri...> - 2003-10-20 22:46:43
|
The overall question I have is this (it was pointed out by someone in a
earlier mail):
Given that calling HEAD enables us to short-ciruit files with invalid
mime-types.. isn't it nearly always benefitial to call HEAD, even when
doing an 'initial-dig'?
The answer to this question may influence your choice of what to commit,
but the description below looks good to me if we want to never call HEAD
during an initial dig.
Thanks.
On Sun, 19 Oct 2003, Gabriele Bartolini wrote:
>
> > I think what we've had here is informative debate. You as much as
> >anyone else wrote the networking code, so for me it's your decision. I
> >think the new TRUE default is fine.
>
> OK. Any other opinions?
>
> > If you've perfected this logic in ht://Check, then we should probably
> >consider syncing with your net code after 3.2 is done.
>
> So ... is it ok for you guys if I go on with the Retriever, Document and
> HtHTTP patch as suggested in the previous e-mails?
>
> Basically, in order to perform always a HEAD call during an incremental
> indexing, I need to store the information in both the Retriever and
> Document class. Is that right for you? In particular, I suggest this enum:
>
> enum RetrieverType {
> Retriever_Initial,
> Retriever_Incremental
> };
>
> and then change the constructor this way:
>
> Retriever(RetrieverLog flags = Retriever_noLog, RetrieverType t =
> Retriever_Initial);
>
> In 'htdig.cc', we check whether the dig is an initial dig or not and:
>
> if(!initial) // Switch the retriever type to Incremental
> retriever_type = Retriever_Incremental;
>
> therefore, when we instantiate the main retriever object, we just simply
> add this:
>
> Retriever retriever(Retriever_logUrl, retriever_type);
>
> Please let me know.
>
> Ciao and thanks,
> -Gabriele
> --
> Gabriele Bartolini: Web Programmer, ht://Dig & IWA/HWG Member, ht://Check
> maintainer
> Current Location: Melbourne, Victoria, Australia
> bar...@in... | http://www.prato.linux.it/~gbartolini | ICQ#129221447
> > "Leave every hope, ye who enter!", Dante Alighieri, Divine Comedy, The
> Inferno
>
>
>
> -------------------------------------------------------
> This SF.net email sponsored by: Enterprise Linux Forum Conference & Expo
> The Event For Linux Datacenter Solutions & Strategies in The Enterprise
> Linux in the Boardroom; in the Front Office; & in the Server Room
> http://www.enterpriselinuxforum.com
> _______________________________________________
> ht://Dig Developer mailing list:
> htd...@li...
> List information (subscribe/unsubscribe, etc.)
> https://lists.sourceforge.net/lists/listinfo/htdig-dev
>
Neal Richter
Knowledgebase Developer
RightNow Technologies, Inc.
Customer Service for Every Web Site
Office: 406-522-1485
|
|
From: Gilles D. <gr...@sc...> - 2003-10-20 22:03:50
|
According to Gabriele Bartolini:
> > I think what we've had here is informative debate. You as much as
> >anyone else wrote the networking code, so for me it's your decision. I
> >think the new TRUE default is fine.
>
> OK. Any other opinions?
I think it was just a matter of not understanding what the attribute did or
didn't do, and in which circumstances it would be useful to change it.
Because of the potential for serious performance degradation when you get it
wrong, I think it would be helpful if the code automatically did the right
thing in most circumstances, and if the documentation for this attribute
made it clear in which circumstances it would make sense to turn it off.
> > If you've perfected this logic in ht://Check, then we should probably
> >consider syncing with your net code after 3.2 is done.
>
> So ... is it ok for you guys if I go on with the Retriever, Document and
> HtHTTP patch as suggested in the previous e-mails?
I think that's what Neal was getting at when he said it's your decision.
You wrote the networking code, so you know better than anyone else what's
needed to make this particular change. It sounds reasonable to me that
you'd need to make changes to these classes, as that's where the needed
decisions must be made about the appropriate default action.
> Basically, in order to perform always a HEAD call during an incremental
> indexing, I need to store the information in both the Retriever and
> Document class. Is that right for you? In particular, I suggest this enum:
>
> enum RetrieverType {
> Retriever_Initial,
> Retriever_Incremental
> };
>
> and then change the constructor this way:
>
> Retriever(RetrieverLog flags = Retriever_noLog, RetrieverType t =
> Retriever_Initial);
>
> In 'htdig.cc', we check whether the dig is an initial dig or not and:
>
> if(!initial) // Switch the retriever type to Incremental
> retriever_type = Retriever_Incremental;
>
> therefore, when we instantiate the main retriever object, we just simply
> add this:
>
> Retriever retriever(Retriever_logUrl, retriever_type);
>
> Please let me know.
Well, it seems to me that there are actually two different cases where
htdig does an initial dig. The obvious one is when the user specifies
-i, which sets the initial flag. The less obvious one is when htdig is
run without -i, but with no existing database, or with an empty one.
What matters is whether there are URLs in the database or not. If there
are none, then you'll never reject a document as "not changed".
--
Gilles R. Detillieux E-mail: <gr...@sc...>
Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/
Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada)
|
|
From: Gabriele B. <bar...@in...> - 2003-10-19 13:37:42
|
> I think what we've had here is informative debate. You as much as
>anyone else wrote the networking code, so for me it's your decision. I
>think the new TRUE default is fine.
OK. Any other opinions?
> If you've perfected this logic in ht://Check, then we should probably
>consider syncing with your net code after 3.2 is done.
So ... is it ok for you guys if I go on with the Retriever, Document and
HtHTTP patch as suggested in the previous e-mails?
Basically, in order to perform always a HEAD call during an incremental
indexing, I need to store the information in both the Retriever and
Document class. Is that right for you? In particular, I suggest this enum:
enum RetrieverType {
Retriever_Initial,
Retriever_Incremental
};
and then change the constructor this way:
Retriever(RetrieverLog flags = Retriever_noLog, RetrieverType t =
Retriever_Initial);
In 'htdig.cc', we check whether the dig is an initial dig or not and:
if(!initial) // Switch the retriever type to Incremental
retriever_type = Retriever_Incremental;
therefore, when we instantiate the main retriever object, we just simply
add this:
Retriever retriever(Retriever_logUrl, retriever_type);
Please let me know.
Ciao and thanks,
-Gabriele
--
Gabriele Bartolini: Web Programmer, ht://Dig & IWA/HWG Member, ht://Check
maintainer
Current Location: Melbourne, Victoria, Australia
bar...@in... | http://www.prato.linux.it/~gbartolini | ICQ#129221447
> "Leave every hope, ye who enter!", Dante Alighieri, Divine Comedy, The
Inferno
|
|
From: Lachlan A. <lh...@us...> - 2003-10-17 23:56:27
|
Thanks for the offer of testing, Ted. Regarding test cases, I think that the main part of testing at the=20 moment is actually generating the test cases. Essentially, we have=20 to use each of the features of ht://Dig and make sure that it works=20 as documented. Neal has suggested testing each of the configuration=20 attributes and command line arguments. If we're keen, we should also=20 test each template variable. =46rom the attached list of attributes, select a group of attributes. =20 Write a config file which sets each of them to some value. One by=20 one, change the attribute in a way which should produce an observable=20 change, and make sure you observe that change. For example, if you=20 were testing the "meta" group, you would check that, with=20 create_url_list=3Dtrue, it correctly creates a list of URLs retrieved,=20 and that with create_url_list=3Dfalse, it doesn't create such a list. This testing may be very simplistic, but it does reveal bugs. Thanks again, Lachlan On Tue, 14 Oct 2003 23:23, Ted Stresen-Reuter wrote: > I'm happy to do testing on Mac OS X. > When requesting help testing, please provide a test case (the steps > one must take to complete the test) and the intended behavior (so > the testers know what to look for and what shouldn't be appearing). --=20 lh...@us... ht://Dig developer DownUnder (http://www.htdig.org) |
|
From: Lachlan A. <lh...@us...> - 2003-10-17 13:38:34
|
On Wed, 15 Oct 2003 04:01, Gilles Detillieux wrote: > I got thouroughly confused in reading your patch, though, because > it is reversed, with the new code appearing in the first file and > the old code in the second, rather than the other way around. Oops... :) =20 > Taking that into account, though, the patch seems right to me. I > think it should be committed ASAP. Done. Could you (or someone) please confirm it and close the bug=20 report? Thanks, Lachlan --=20 lh...@us... ht://Dig developer DownUnder (http://www.htdig.org) |
|
From: Gilles D. <gr...@sc...> - 2003-10-16 21:41:19
|
I tried a "make check" of the current CVS tree on both Red Hat Linux
6.2 and Red Hat Linux 9. On 6.2, it passes with flying colours, if
I remember to chmod +x test/t_url first. On RH9, I ran into a couple
different problems.
First of all, there's the infamous WordType::instance undefined error
which has dogged Mac OS X users. Has anyone yet come up with a fix for
this, other than manually linking and changing the order of the libraries?
I tried a hack to a couple test programs to help prod the linker into
loading the needed modules from the libraries, but I don't think it's
the ideal solution (it causes a warning when t_url runs the url.cc code,
and I imagine testnet would do likewise if I got Apache to run). My hack
is below.
The other problem is the 5 tests that require Apache fail because I
can't get it to run. I commented out or modified all the lines in
test/conf/httpd.conf that were causing httpd to give error messages,
but it still won't start up. Apache 2.0.40 seems to have some problems
with the conf file in our distribution, but I wasted too many fruitless
hours yesterday to figure out what it needs. Anyone else had better luck?
I don't personally consider this a showstopper for the upcoming 3.2.0rc1,
but it would be nice to have this all working reliably in the final release.
On the bright side, the RH9 build does pass the other 9 tests, and perhaps
more importantly, it has no trouble indexing the SCRC's web site.
Here's my ugly hack to get it to link on RH9...
--- test/testnet.cc.orig 2003-07-21 07:40:22.000000000 -0500
+++ test/testnet.cc 2003-10-15 13:21:44.000000000 -0500
@@ -7,6 +7,7 @@
#include "HtHTTP.h"
#include "HtHTTPBasic.h"
#include "HtDateTime.h"
+#include "WordContext.h"
#include <URL.h>
#ifdef HAVE_STD
@@ -75,6 +76,9 @@ int main(int ac, char **av)
// Flag variable for errors
int _errors = 0;
+ // Needed to satisfy linker dependencies...
+ (void) WordContext::Initialize();
+
///////
// Retrieving options from command line with getopt
///////
--- test/url.cc.orig 2003-07-21 07:40:22.000000000 -0500
+++ test/url.cc 2003-10-15 13:19:33.000000000 -0500
@@ -38,6 +38,7 @@ using namespace std;
#include "HtConfiguration.h"
#include "URL.h"
+#include "WordContext.h"
// These should probably be tested individually
@@ -114,6 +115,7 @@ static void dourl(params_t* params)
{
if(verbose) cerr << "Test WordKey class with " <<
params->url_parents << " and " << params->url_children << "\n";
+ (void) WordContext::Initialize();
HtConfiguration* const config= HtConfiguration::config();
config->Defaults(defaults);
dolist(params);
--
Gilles R. Detillieux E-mail: <gr...@sc...>
Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/
Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada)
|
|
From: Neal R. <ne...@ri...> - 2003-10-16 04:00:29
|
> but maybe in future release we could use other HTTP headers (i.e. cookies, > language, etc.) and a pre-emptive head could save time in a initial dig as > well. Yep.. even on an initial dig HEAD is a good idea.. unless the website is almost all HTML pages with few images... which seems pretty pie-in-the-sky at this point. > 2) I share the library with ht://Check which massively uses this option as > it has to retrieve any document - images too - and a HEAD call could save a > lot of time in the initial dig. I'd love to maintain the logic of the net > library the more similar possible. > Please let me know if the Retriever and Document classes changes make sense > to you guys and I will modify the code. I think what we've had here is informative debate. You as much as anyone else wrote the networking code, so for me it's your decision. I think the new TRUE default is fine. If you've perfected this logic in ht://Check, then we should probably consider syncing with your net code after 3.2 is done. Thanks. Neal Richter Knowledgebase Developer RightNow Technologies, Inc. Customer Service for Every Web Site Office: 406-522-1485 |
|
From: Gabriele B. <bar...@in...> - 2003-10-15 23:55:42
|
Cheers Neal, >It seems like it's to our advantage to always do a HEAD call, unless it's >an initial dig, where it is wastefull... and that the state of >persitent_connections is irrelevant to this decision. Let me try to understand. What you suggest is: 1) killing head_before_get 2) performing HEAD calls only in the incremental dig (either with or without persistent connections) 3) Unlinking the Head before Get mechanism from the persistent connections one If it is so, it could be good for me (for number 1 I will do what you guys decide). I had not understood it from the earlier messages, sorry. Even though - personally - I would not kill the attribute because: 1) It could be useful in cases when we don't know whether a document is parsable or not according to the *usual* means of exclusions (that is to say the URL). I know so far we take in consideration only the content-type but maybe in future release we could use other HTTP headers (i.e. cookies, language, etc.) and a pre-emptive head could save time in a initial dig as well. 2) I share the library with ht://Check which massively uses this option as it has to retrieve any document - images too - and a HEAD call could save a lot of time in the initial dig. I'd love to maintain the logic of the net library the more similar possible. 3) Killing the attribute would not avoid us to change the code in order to store information about the retrieval status in the Retriever and Document classes (unless we intend to use some classes variables - which I hate) >I don't have a problem keeping head_before_get, as long as we make the >default TRUE. That's the default. Please let me know if the Retriever and Document classes changes make sense to you guys and I will modify the code. Ciao ciao -Gabriele -- Gabriele Bartolini: Web Programmer, ht://Dig & IWA/HWG Member, ht://Check maintainer Current Location: Melbourne, Victoria, Australia bar...@in... | http://www.prato.linux.it/~gbartolini | ICQ#129221447 > "Leave every hope, ye who enter!", Dante Alighieri, Divine Comedy, The Inferno |
|
From: Neal R. <ne...@ri...> - 2003-10-15 19:12:23
|
>2) persistent connections off: we perform a GET call and if the document
>is not what we want we simply close the connection (we anticipate it).
When persitent_connections is off, we do a GET and see a MIME type we
don't like we just close the connection. Isn't this just a tad bit 'ugly'
and possibly abusive to the webserver? At a minimum it wastes the
webservers time starting the GET only to have the connection closed
prematurely. It will definetely waste some processor time on the server
buffering up the data to send as the server CPU is much faster than the
latency of the network connection. We are also causing potential
server memory allocation churn which would affect SWAP on a highly loaded
webserver.
> I can think of this possible solution. The scenario above is still valid
> (IMHO) for the initial dig case; I would modify it for the incremental dig
> as mentioned yesterday, as follows:
>
> if "persistent_connections" (on a server basis) is set to on:
> enable persistent connections
> else
> disable them
>
> if incremental or ("head_before_get" and "persistent_connections" are both
> set to on) - I have to modify yesterday's patch a bit
> enable head before get
> else
> disable head before get
OK.. just to be absolutely clear.. if we can design logic that will
optimally set head_before_get automatically based upon that state of
persistent_connections, what is the reason for keeping it around as a
user-configurable setting?
It seems like it's to our advantage to always do a HEAD call, unless it's
an initial dig, where it is wastefull... and that the state of
persitent_connections is irrelevant to this decision.
If this is not the case please reply with a clear example situation where
we have some advantage to NOT setting head_before_get automatically.
I don't have a problem keeping head_before_get, as long as we make the
default TRUE.
Thanks!
Neal Richter
Knowledgebase Developer
RightNow Technologies, Inc.
Customer Service for Every Web Site
Office: 406-522-1485
|
|
From: <And...@wi...> - 2003-10-15 15:12:46
|
Hey. Not to harp, but I tried building the CVS a little bit ago and it still didn't set the HAVE_SSL_H stuff right, in that I had to go into the .h file and change the #define by hand. Sorry, I don't have the details, I can't get to that machine right now, but I just thought I'd mention it. a Andy Bach, Sys. Mangler Internet: and...@wi... VOICE: (608) 261-5738 FAX 264-5030 "We are either doing something, or we are not. 'Talking about' is a subset of 'not'." -- Mike Sphar in alt.sysadmin.recovery |
|
From: Jesse op d. B. <ht...@op...> - 2003-10-15 07:49:01
|
I vote -1 for killing, if its function is described clearly and doesn't change, 0 otherwise, between any two succeeding versions of htdig. It's no harm in having this option, is there? If you don't want it, just turn it off. ----- Original Message ----- From: "Gabriele Bartolini" <bar...@in...> > > > I'm with you on this one.. we should just kill head_before_get. I would > >vote for killing it instead of hacking the logic. > > Hi guys, I hope that after my previous message you could change your mind. > I vote -1 for killing this attribute. > > Ciao, > -Gabriele --Jesse |
|
From: Gabriele B. <bar...@in...> - 2003-10-14 23:29:11
|
> I'm with you on this one.. we should just kill head_before_get. I would >vote for killing it instead of hacking the logic. Hi guys, I hope that after my previous message you could change your mind. I vote -1 for killing this attribute. Ciao, -Gabriele -- Gabriele Bartolini: Web Programmer, ht://Dig & IWA/HWG Member, ht://Check maintainer Current Location: Melbourne, Victoria, Australia bar...@in... | http://www.prato.linux.it/~gbartolini | ICQ#129221447 > "Leave every hope, ye who enter!", Dante Alighieri, Divine Comedy, The Inferno |
|
From: Gabriele B. <bar...@in...> - 2003-10-14 23:28:11
|
At 14.39 14/10/2003 -0500, Gilles Detillieux wrote:
>It would be a good idea, in the general case, for us all to learn how
>to properly override config parameters in the code, so that a server
>block or URL block definition doesn't override an internal override
Maybe I am missing something. I am not aware of a way that allows us to
override blocks definitions through the Configuration classes. Can you
please point it out to me? Sorry.
>doing an initial dig, then it will only ever take effect when doing an
>update (or incremental) dig with persistent connections turned on.
No no ... wait. I have never talked about turning off head before get. Let
me try and give an explanation about this parameter.
I remember we issued the 'head_before_get' attribute because of this: when
requesting a non-parsable document we generally had 3 options:
1) persistent connections on:
a) head before get on: we perform a HEAD call and notice that the
document's content-type is not what we want so we simply avoid doing the
GET call
b) head before get off: we perform a GET call but in this case we
must receive all the content returned by the server, otherwise we have to
close the connection - that's not what we want in general.
2) persistent connections off: we perform a GET call and if the document is
not what we want we simply close the connection (we anticipate it).
IMHO the 'head_before_get' could make the difference in some cases with
persistent connections on and only the webmaster can see the difference in
performances between turning it on or off. If we don't have many multimedia
files we could simply turn it off (avoiding a 'double' call), whereas if we
have big files to be downloaded (especially from the Internet) this
attribute could make the difference, as a pre-emptive HEAD call would let
us know about the type of document we are being requested and eventually
save us a big download.
>not as versed in HTTP/1.1 as you are. It seems to me that htdig should
>always be doing a HEAD before a GET when doing incremental digs through
>persistent connections.
Yes. And not only there. Even when performing an initial dig, if the user
wants it, we must enable it.
I can think of this possible solution. The scenario above is still valid
(IMHO) for the initial dig case; I would modify it for the incremental dig
as mentioned yesterday, as follows:
if "persistent_connections" (on a server basis) is set to on:
enable persistent connections
else
disable them
if incremental or ("head_before_get" and "persistent_connections" are both
set to on) - I have to modify yesterday's patch a bit
enable head before get
else
disable head before get
In this way, for initial dig the user can choose whether activate
persistent connections and head before get, whether for incremental digs
the users' settings get overridden.
For me this sounds good. There can be issues regarding the way of doing it;
I thought that adding some object variables in the Retriever and Document
class would be fine. Unless there is a way of overriding specific settings
through the Configuration classes.
Please let me know.
>By the way, Gabriele, good call on the Accept-Encoding header. It's a
>simple, elegant fix to a troublesome bug. You're right that adding
>support for gzip encoding is a feature request, and not a bug fix,
>and should be done after the upcoming release (not before). Good work.
Thank you. However, by following Neal's directives, could someone of you
try it and let me know so I can close the bug?
Ciao ciao
-Gabriele
>--
>Gilles R. Detillieux E-mail: <gr...@sc...>
>Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/
>Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada)
>
>
>-------------------------------------------------------
>This SF.net email is sponsored by: SF.net Giveback Program.
>SourceForge.net hosts over 70,000 Open Source Projects.
>See the people who have HELPED US provide better services:
>Click here: http://sourceforge.net/supporters.php
>_______________________________________________
>ht://Dig Developer mailing list:
>htd...@li...
>List information (subscribe/unsubscribe, etc.)
>https://lists.sourceforge.net/lists/listinfo/htdig-dev
--
Gabriele Bartolini: Web Programmer, ht://Dig & IWA/HWG Member, ht://Check
maintainer
Current Location: Melbourne, Victoria, Australia
bar...@in... | http://www.prato.linux.it/~gbartolini | ICQ#129221447
> "Leave every hope, ye who enter!", Dante Alighieri, Divine Comedy, The
Inferno
|