Thread: Re: [htdig-dev] head_before_get attribute (was: 3.2RC1 Feature Freeze)

Brought to you by: angusgb, grdetil, lha, nealr, scherpbier

htdig-dev

Re: [htdig-dev] head_before_get attribute (was: 3.2RC1 Feature Freeze)

From: Gabriele B. <bar...@in...> - 2003-10-15 23:55:42

Cheers Neal,

>It seems like it's to our advantage to always do a HEAD call, unless it's
>an initial dig, where it is wastefull...  and that the state of
>persitent_connections is irrelevant to this decision.

Let me try to understand. What you suggest is:

1) killing head_before_get
2) performing HEAD calls only in the incremental dig (either with or 
without persistent connections)
3) Unlinking the Head before Get mechanism from the persistent connections one

If it is so, it could be good for me (for number 1 I will do what you guys 
decide). I had not understood it from the earlier messages, sorry.

Even though - personally - I would not kill the attribute because:

1) It could be useful in cases when we don't know whether a document is 
parsable or not according to the *usual* means of exclusions (that is to 
say the URL). I know so far we take in consideration only the content-type 
but maybe in future release we could use other HTTP headers (i.e. cookies, 
language, etc.) and a pre-emptive head could save time in a initial dig as 
well.
2) I share the library with ht://Check which massively uses this option as 
it has to retrieve any document - images too - and a HEAD call could save a 
lot of time in the initial dig. I'd love to maintain the logic of the net 
library the more similar possible.
3) Killing the attribute would not avoid us to change the code in order to 
store information about the retrieval status in the Retriever and Document 
classes (unless we intend to use some classes variables - which I hate)

>I don't have a problem keeping head_before_get, as long as we make the
>default TRUE.

That's the default.

Please let me know if the Retriever and Document classes changes make sense 
to you guys and I will modify the code.

Ciao ciao
-Gabriele
--
Gabriele Bartolini: Web Programmer, ht://Dig & IWA/HWG Member, ht://Check 
maintainer
Current Location: Melbourne, Victoria, Australia
bar...@in... | http://www.prato.linux.it/~gbartolini | ICQ#129221447
 > "Leave every hope, ye who enter!", Dante Alighieri, Divine Comedy, The 
Inferno

Re: [htdig-dev] head_before_get attribute (was: 3.2RC1 Feature Freeze)

From: Neal R. <ne...@ri...> - 2003-10-16 04:00:29

> but maybe in future release we could use other HTTP headers (i.e. cookies,
> language, etc.) and a pre-emptive head could save time in a initial dig as
> well.

  Yep.. even on an initial dig HEAD is a good idea.. unless the website is
almost all HTML pages with few images... which seems pretty pie-in-the-sky
at this point.

> 2) I share the library with ht://Check which massively uses this option as
> it has to retrieve any document - images too - and a HEAD call could save a
> lot of time in the initial dig. I'd love to maintain the logic of the net
> library the more similar possible.

> Please let me know if the Retriever and Document classes changes make sense
> to you guys and I will modify the code.

  I think what we've had here is informative debate.  You as much as
anyone else wrote the networking code, so for me it's your decision.  I
think the new TRUE default is fine.

  If you've perfected this logic in ht://Check, then we should probably
consider syncing with your net code after 3.2 is done.

Thanks.

Neal Richter
Knowledgebase Developer
RightNow Technologies, Inc.
Customer Service for Every Web Site
Office: 406-522-1485

Re: [htdig-dev] head_before_get attribute (was: 3.2RC1 Feature Freeze)

From: Gabriele B. <bar...@in...> - 2003-10-19 13:37:42

>   I think what we've had here is informative debate.  You as much as
>anyone else wrote the networking code, so for me it's your decision.  I
>think the new TRUE default is fine.

OK. Any other opinions?

>   If you've perfected this logic in ht://Check, then we should probably
>consider syncing with your net code after 3.2 is done.

So ... is it ok for you guys if I go on with the Retriever, Document and 
HtHTTP patch as suggested in the previous e-mails?

Basically, in order to perform always a HEAD call during an incremental 
indexing, I need to store the information in both the Retriever and 
Document class. Is that right for you? In particular, I suggest this enum:

         enum  RetrieverType {
                 Retriever_Initial,
                 Retriever_Incremental
         };

and then change the constructor this way:

         Retriever(RetrieverLog flags = Retriever_noLog, RetrieverType t = 
Retriever_Initial);

In 'htdig.cc', we check whether the dig is an initial dig or not and:

         if(!initial) // Switch the retriever type to Incremental
                 retriever_type = Retriever_Incremental;

therefore, when we instantiate the main retriever object, we just simply 
add this:

         Retriever retriever(Retriever_logUrl, retriever_type);

Please let me know.

Ciao and thanks,
-Gabriele
--
Gabriele Bartolini: Web Programmer, ht://Dig & IWA/HWG Member, ht://Check 
maintainer
Current Location: Melbourne, Victoria, Australia
bar...@in... | http://www.prato.linux.it/~gbartolini | ICQ#129221447
 > "Leave every hope, ye who enter!", Dante Alighieri, Divine Comedy, The 
Inferno

Re: [htdig-dev] head_before_get attribute (was: 3.2RC1 Feature Freeze)

From: Gilles D. <gr...@sc...> - 2003-10-20 22:03:50

According to Gabriele Bartolini:
> >   I think what we've had here is informative debate.  You as much as
> >anyone else wrote the networking code, so for me it's your decision.  I
> >think the new TRUE default is fine.
> 
> OK. Any other opinions?

I think it was just a matter of not understanding what the attribute did or
didn't do, and in which circumstances it would be useful to change it.
Because of the potential for serious performance degradation when you get it
wrong, I think it would be helpful if the code automatically did the right
thing in most circumstances, and if the documentation for this attribute
made it clear in which circumstances it would make sense to turn it off.

> >   If you've perfected this logic in ht://Check, then we should probably
> >consider syncing with your net code after 3.2 is done.
> 
> So ... is it ok for you guys if I go on with the Retriever, Document and 
> HtHTTP patch as suggested in the previous e-mails?

I think that's what Neal was getting at when he said it's your decision.
You wrote the networking code, so you know better than anyone else what's
needed to make this particular change.  It sounds reasonable to me that
you'd need to make changes to these classes, as that's where the needed
decisions must be made about the appropriate default action.

> Basically, in order to perform always a HEAD call during an incremental 
> indexing, I need to store the information in both the Retriever and 
> Document class. Is that right for you? In particular, I suggest this enum:
> 
>          enum  RetrieverType {
>                  Retriever_Initial,
>                  Retriever_Incremental
>          };
> 
> and then change the constructor this way:
> 
>          Retriever(RetrieverLog flags = Retriever_noLog, RetrieverType t = 
> Retriever_Initial);
> 
> In 'htdig.cc', we check whether the dig is an initial dig or not and:
> 
>          if(!initial) // Switch the retriever type to Incremental
>                  retriever_type = Retriever_Incremental;
> 
> therefore, when we instantiate the main retriever object, we just simply 
> add this:
> 
>          Retriever retriever(Retriever_logUrl, retriever_type);
> 
> Please let me know.

Well, it seems to me that there are actually two different cases where
htdig does an initial dig.  The obvious one is when the user specifies
-i, which sets the initial flag.  The less obvious one is when htdig is
run without -i, but with no existing database, or with an empty one.
What matters is whether there are URLs in the database or not.  If there
are none, then you'll never reject a document as "not changed".

-- 
Gilles R. Detillieux              E-mail: <gr...@sc...>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/
Dept. Physiology, U. of Manitoba  Winnipeg, MB  R3E 3J7  (Canada)

Re: [htdig-dev] head_before_get attribute (was: 3.2RC1 Feature Freeze)

From: Gabriele B. <bar...@in...> - 2003-10-21 07:49:58

Hi guys,

At 17.02 20/10/2003 -0500, Gilles Detillieux wrote:
>wrong, I think it would be helpful if the code automatically did the right
>thing in most circumstances, and if the documentation for this attribute
>made it clear in which circumstances it would make sense to turn it off.

Yep. I think so too. Anyway, I modified the defaults.cc by putting the 
attribute in a 'true' default state and by explaining that:

- during an incremental dig, the value is overridden;
- in general, it is recommended to leave this value on.

I did not specify cases in which the attribute should be turned off as I 
thought I would have generated more confusion in the user.

However, I would pick these general cases, where the user should disable 
the attribute (please revise it):

Case A - Persistent connections on
1) the majority of documents are HTML (this means we "always" want to GET them)
2) the server does not support HEAD (I have seen cases like this unfortunately)
3) cases where the persistent communication between htdig and the server 
does not work at 100%: there can be some problems with persistent 
connections and HEAD calls (I experience this kind of problems sometimes 
with ht://Check and some NT servers)

Case B - Persistent connection off
1) same as case A
2) same as case A
3) I have never experienced any problem as in case A.3 with persistent 
connections disabled

>Well, it seems to me that there are actually two different cases where
>htdig does an initial dig.  The obvious one is when the user specifies
>-i, which sets the initial flag.  The less obvious one is when htdig is
>run without -i, but with no existing database, or with an empty one.
>What matters is whether there are URLs in the database or not.  If there
>are none, then you'll never reject a document as "not changed".

OK. Good point. I think I changed the Retriever class in order to perform 
this check as well. Also, during an incremental dig, if debug > 1 I show a 
notice message, saying that any head before get attribute configuration is 
overridden and that HEAD is always enabled.

Sounds good?

Ciao and thanks,
-Gabriele
--
Gabriele Bartolini: Web Programmer, ht://Dig & IWA/HWG Member, ht://Check 
maintainer
Current Location: Melbourne, Victoria, Australia
bar...@in... | http://www.prato.linux.it/~gbartolini | ICQ#129221447
 > "Leave every hope, ye who enter!", Dante Alighieri, Divine Comedy, The 
Inferno

Re: [htdig-dev] head_before_get attribute (was: 3.2RC1 Feature Freeze)

From: Gabriele B. <bar...@in...> - 2003-10-22 00:30:59

At 16.01 21/10/2003 -0500, Gilles Detillieux wrote:
> > 2) the server does not support HEAD (I have seen cases like this 
> unfortunately)
>OK, that sounds pretty important.  I hadn't heard that one before.

I meant that some server administrators may turn off the HEAD method (in 
Apache you can use the Limit directive).

>but don't support the HEAD request.  Wouldn't this be an argument against
>overriding head_before_get during an incremental dig?

I guess it is a matter of choosing the less painful solution. In the normal 
case (p/c on and hbg on) overriding is not done; however, in the 
incremental dig, one more request is made (HEAD) without success and 
hopefully - after that - the document GETs retrieved. There is a bit of 
overhead for sure but the question is: is it better to have a bit of 
overhead in some cases (minority) or to prevent users from getting the 
benefit from using always a workin HEAD call when updating the database?

The other way is to remove the override and leave everything in the hands 
of the user (I would not mind this - of course providing a better 
documentation).

With the changes done yesterday we have moved towards a clearer situation 
anyway, because:
- head before get is now true by default
- head before get has been detached by persistent connections and has 
become independent

> > 3) cases where the persistent communication between htdig and the server
> > does not work at 100%: there can be some problems with persistent
> > connections and HEAD calls (I experience this kind of problems sometimes
> > with ht://Check and some NT servers)
>
>Again, is this going to be a problem if we don't allow turning off
>head_before_get during an update dig?

I guess this could be fixable, because the problem comes up with persistent 
connections - which may be still disabled.

>with these questionably compliant servers, then wouldn't they need a way
>of turning off head_before_get unconditionally, whether it's an update
>dig or an initial one?

Yes, that'd be great.

Again, I guess we have to balance what we can do in order to make things 
easier to the user but, at the same time, leave the users enough freedom in 
order to configure their systems the way they want. Also, with 3.2, the 
server and URL blocks have added more dimensions to the space of 
configurability available to users and ... more "clear" attributes are 
available and more the toy gets perfect.

>This is what I was getting at before about this option never being
>explained adequately.

You're right.

>   On the surface, it seemed to be rather useless,
>but with these new revelations that have come out of your testing, it
>seems there may indeed be a need for turning this off in some cases.
>That's the sort of thing that should be documented so others (developers
>and end-users) know what you'd use this for.

So ... we have 2 possibilities now:

1) leave the code as is
2) remove the overriding of the head before get in the incremental dig

In both cases we need to write down a better documentation for this 
attribute (especially in the option 2 where we should talk about the 
benefits of a HEAD call in the incremental dig).

I must confess. I would prefer option 2, as I think users' must have full 
control of the tool and IMHO by adding a default behaviour of HEAD before 
GET to the system we've done our part.

So tell me what you think, especially you Gilles and Neal that have 
followed this thread. I am more than happy to (in case) rechange the code 
today.

Ciao ciao
-Gabriele
--
Gabriele Bartolini: Web Programmer, ht://Dig & IWA/HWG Member, ht://Check 
maintainer
Current Location: Melbourne, Victoria, Australia
bar...@in... | http://www.prato.linux.it/~gbartolini | ICQ#129221447
 > "Leave every hope, ye who enter!", Dante Alighieri, Divine Comedy, The 
Inferno

Re: [htdig-dev] head_before_get attribute (was: 3.2RC1 Feature Freeze)

From: Lachlan A. <lh...@us...> - 2003-10-22 13:35:08

Greetings all,

I've only been following this thready loosely, but my opinions are:

1. In version 3.2.1 (or 3.3, or 4.0) there should be three possible=20
settings:  true, false, auto.  That way the user has complete=20
control, but doesn't need to exert it.

2. We are in feature freeze, and scheduled to release in one week's=20
time, at the end of October.  We should minimise changes to the code. =20
Has a bug report been filed for this issue yet?  Wasn't the plan to=20
have no CVS commits without reference to a bug number?

Cheers,
Lachlan

On Wed, 22 Oct 2003 08:30, Gabriele Bartolini wrote:

> So ... we have 2 possibilities now:
>
> 1) leave the code as is
> 2) remove the overriding of the head before get in the incremental
> dig
>
> I must confess. I would prefer option 2, as I think users' must
> have full control of the tool and IMHO by adding a default
> behaviour of HEAD before GET to the system we've done our part.

--=20
lh...@us...
ht://Dig developer DownUnder  (http://www.htdig.org)

Re: [htdig-dev] head_before_get attribute (was: 3.2RC1 Feature Freeze)

From: Neal R. <ne...@ri...> - 2003-10-22 20:43:32

Lachlan wrote:
> 2. We are in feature freeze, and scheduled to release in one week's
> time, at the end of October.  We should minimise changes to the code.
> Has a bug report been filed for this issue yet?  Wasn't the plan to
> have no CVS commits without reference to a bug number?

  Gabriele:  Please create a sourceforge bug for this when you change
it... and clue us all in on what the 'net change' is after the commits
;-).

  As far as the release goes, we need to get some kind of testing report
made and updated... I'll try and post something by tommorow.

  Thanks.

Neal Richter
Knowledgebase Developer
RightNow Technologies, Inc.
Customer Service for Every Web Site
Office: 406-522-1485

Re: [htdig-dev] head_before_get attribute (was: 3.2RC1 Feature Freeze)

From: Gabriele B. <bar...@in...> - 2003-10-26 00:17:13

At 13.48 22/10/2003 -0600, Neal Richter wrote:
>   Gabriele:  Please create a sourceforge bug for this when you change
>it... and clue us all in on what the 'net change' is after the commits
>;-).

Sorry ... I forgot to open the bug before. Done everything.

Ciao
-Gabriele
--
Gabriele Bartolini: Web Programmer, ht://Dig & IWA/HWG Member, ht://Check 
maintainer
Current Location: Melbourne, Victoria, Australia
bar...@in... | http://www.prato.linux.it/~gbartolini | ICQ#129221447
 > "Leave every hope, ye who enter!", Dante Alighieri, Divine Comedy, The 
Inferno

[htdig-dev] Testing Tasks

From: Neal R. <ne...@ri...> - 2003-10-23 17:46:19

Hey all,
	I used Lachlan's break down (nearly unchanged) to create 19
testing tasks @ Soureforge.

Please visit http://sourceforge.net/projects/htdig/ and navigate to
Tasks-> Testing 3.2

Some of them are fairly long, some are short.

I'm not assigning any to anyone, it's up to each of us to grab a task and
complete it.  I'm also leaving it up to each person to decide how deep to
test.  We need to get reasonable coverage.

I would also encourage each of you to use valgrind to check for memory
leaks while you are testing. Again, the depth you go looking for them is
up to you.
http://developer.kde.org/~sewardj/

If you need a Sourceforge account, please register yourself and send me an
email with your account and I'll add you to the htDig project.

If you find an error during testing please:
1)Create a bug
2)Contact appropriate developer or fix it yourself
3)Test fix
4)Commit fix (if you are fixing it)
5)Update status of bug
6)Have a second person either test fix or verify that commited
  code looks OK.. their choice.

So the standards for release of 3.2RC1 are:
1) No important bugs in 'Include_in_3.2' queue
2) All testing tasks completed.

Sourceforge is very cool!

Thanks all!

Neal Richter
Knowledgebase Developer
RightNow Technologies, Inc.
Customer Service for Every Web Site
Office: 406-522-1485

[htdig-dev] Re: Testing Tasks

From: Lachlan A. <lh...@us...> - 2003-10-25 09:32:39

Greetings all,

Thanks, Neal, for your work setting up the tasks!

When I run  valgrind,  I get
=3D=3D4058=3D=3D Conditional jump or move depends on uninitialised value(=
s)
=3D=3D4058=3D=3D    at 0x40300421: CDB___lock_put_nolock (lock.c:650)
from various contexts.

I don't fancy changing the BDB code (my last foray was rather=20
forgettable :)  Does anyone think that this is an issue, or should we=20
ignore it?

Cheers,
Lachlan

On Fri, 24 Oct 2003 01:09, Neal Richter wrote:

> I would also encourage each of you to use valgrind to check for
> memory leaks while you are testing. Again, the depth you go looking
> for them is up to you.

--=20
lh...@us...
ht://Dig developer DownUnder  (http://www.htdig.org)

Re: [htdig-dev] Re: Testing Tasks

From: Neal R. <ne...@ri...> - 2003-10-26 23:01:35

Yea, ignore the BDB errors.

And most errors about unitialized memory read, and Conditional jump or
move depends on uninitialised value(s) are spurious.  They are formed by
the compilers code generation and there isn't to much that can be to
elimiate them at the C/C++-level.

You are kicking butt on the testing tasks..

On Fri, 24 Oct 2003, Lachlan Andrew wrote:

> Greetings all,
>
> Thanks, Neal, for your work setting up the tasks!
>
> When I run  valgrind,  I get
> ==4058== Conditional jump or move depends on uninitialised value(s)
> ==4058==    at 0x40300421: CDB___lock_put_nolock (lock.c:650)
> from various contexts.
>
> I don't fancy changing the BDB code (my last foray was rather
> forgettable :)  Does anyone think that this is an issue, or should we
> ignore it?
>
> Cheers,
> Lachlan
>
> On Fri, 24 Oct 2003 01:09, Neal Richter wrote:
>
> > I would also encourage each of you to use valgrind to check for
> > memory leaks while you are testing. Again, the depth you go looking
> > for them is up to you.
>
> --
> lh...@us...
> ht://Dig developer DownUnder  (http://www.htdig.org)
>
>
> -------------------------------------------------------
> This SF.net email is sponsored by: The SF.net Donation Program.
> Do you like what SourceForge.net is doing for the Open
> Source Community?  Make a contribution, and help us add new
> features and functionality. Click here: http://sourceforge.net/donate/
> _______________________________________________
> ht://Dig Developer mailing list:
> htd...@li...
> List information (subscribe/unsubscribe, etc.)
> https://lists.sourceforge.net/lists/listinfo/htdig-dev
>

Neal Richter
Knowledgebase Developer
RightNow Technologies, Inc.
Customer Service for Every Web Site
Office: 406-522-1485

Re: [htdig-dev] head_before_get attribute (was: 3.2RC1 Feature Freeze)

From: Neal R. <ne...@ri...> - 2003-10-22 19:51:00

Gabriele wrote:
> 1) leave the code as is
> 2) remove the overriding of the head before get in the incremental dig
>
> In both cases we need to write down a better documentation for this
> attribute (especially in the option 2 where we should talk about the
> benefits of a HEAD call in the incremental dig).
>
> I must confess. I would prefer option 2, as I think users' must have full
> control of the tool and IMHO by adding a default behaviour of HEAD before
> GET to the system we've done our part.

  OK, you've convinced me, it IS useful to have this switch be user
controlled..  I wasn't aware of the non-compliant servers causing an
issue.  Clearly 'automatic' behavior in that case is a bad thing.
Go with option 2.

  Thanks!

Neal Richter
Knowledgebase Developer
RightNow Technologies, Inc.
Customer Service for Every Web Site
Office: 406-522-1485

[htdig-dev] Sourceforge Bugs

From: Neal R. <ne...@ri...> - 2003-10-22 21:20:47

Hey all,
	Please go to sourceforge and look at the open bugs if you can,
there are 18 'Status:Open' now.  There are 6 bugs in the 'Status:Open &
Group:Include_in_3.2' state.

Gabriele:  Did you fix this one already?
[ 594790 ] rundig doesn't index Apache w/mod_zip

Thanks.

Neal Richter
Knowledgebase Developer
RightNow Technologies, Inc.
Customer Service for Every Web Site
Office: 406-522-1485

Re: [htdig-dev] head_before_get attribute (was: 3.2RC1 Feature Freeze)

From: Gabriele B. <bar...@in...> - 2003-10-23 17:33:29

At 13.45 22/10/2003 -0600, Neal Richter wrote:
>   OK, you've convinced me, it IS useful to have this switch be user
>controlled..  I wasn't aware of the non-compliant servers causing an
>issue.  Clearly 'automatic' behavior in that case is a bad thing.
>Go with option 2.

Roger that. :-)

-Gabriele
--
Gabriele Bartolini: Web Programmer, ht://Dig & IWA/HWG Member, ht://Check 
maintainer
Current Location: Melbourne, Victoria, Australia
bar...@in... | http://www.prato.linux.it/~gbartolini | ICQ#129221447
 > "Leave every hope, ye who enter!", Dante Alighieri, Divine Comedy, The 
Inferno

Re: [htdig-dev] head_before_get attribute (was: 3.2RC1 Feature Freeze)

From: Gilles D. <gr...@sc...> - 2003-10-25 23:23:25

According to Gabriele Bartolini:
> At 13.45 22/10/2003 -0600, Neal Richter wrote:
> >   OK, you've convinced me, it IS useful to have this switch be user
> >controlled..  I wasn't aware of the non-compliant servers causing an
> >issue.  Clearly 'automatic' behavior in that case is a bad thing.
> >Go with option 2.
> 
> Roger that. :-)

I guess the only safe way to automate the selection of this would be
for htdig to keep track, on a server by server basis, to see if a server
responds favourably to HEAD requests.  If it doesn't, then it would turn
off this action for this server, but otherwise it seems it would almost
always be an advantage to keep it on.  But now we're getting into the
area of feature requests, not bug fixes, so this should wait till after
the upcoming release.

If I'm not mistaken, as the code now stands, htdig will assume a document
is inaccessible if the HEAD request fails, and so it won't try the GET on
that document at all (unless head_before_get is explicitly set to false).
So, properly automating this selection would require some code changes
to the HtHTTP classs to implement this -- not something we want to start
monkeying with at the eleventh hour before release.

I think the current compromise is best, but it should be given a good
pounding to make sure it's solid.

-- 
Gilles R. Detillieux              E-mail: <gr...@sc...>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/
Dept. Physiology, U. of Manitoba  Winnipeg, MB  R3E 3J7  (Canada)

Re: [htdig-dev] head_before_get attribute (was: 3.2RC1 Feature Freeze)

From: Gabriele B. <bar...@in...> - 2003-10-25 09:24:27

At 10.19 23/10/2003 -0500, Gilles Detillieux wrote:
>I guess the only safe way to automate the selection of this would be
>for htdig to keep track, on a server by server basis, to see if a server
>responds favourably to HEAD requests.  If it doesn't, then it would turn
>off this action for this server, but otherwise it seems it would almost
>always be an advantage to keep it on.  But now we're getting into the

That's what actually happens with persistent connections. However, for 
instance, the 'Limit' directive with apache can be set by directories or 
locations and I would not risk to disable the attribute for every document 
on the server just because one failed. Again, I guess that the 'webmaster' 
is the one that knows his scenario better than any one.

>If I'm not mistaken, as the code now stands, htdig will assume a document
>is inaccessible if the HEAD request fails, and so it won't try the GET on
>that document at all (unless head_before_get is explicitly set to false).

Hmmm ... by looking at the code in HtHTTP::Request(), we should add an 'if 
statement' for the case when the server returns a 405 status code. We 
should also add a proper Document Status for this in the Transport class 
(Document_method_not_allowed?).

Basically when we issue a HEAD method and we get a not allowed method 
response, we should get the resource.

What do you think? Anyway, I am going to open a feature request for this so 
we keep it in mind.

>So, properly automating this selection would require some code changes
>to the HtHTTP classs to implement this -- not something we want to start
>monkeying with at the eleventh hour before release.

I agree.

Ciao,
-Gabriele
--
Gabriele Bartolini: Web Programmer, ht://Dig & IWA/HWG Member, ht://Check 
maintainer
Current Location: Melbourne, Victoria, Australia
bar...@in... | http://www.prato.linux.it/~gbartolini | ICQ#129221447
 > "Leave every hope, ye who enter!", Dante Alighieri, Divine Comedy, The 
Inferno

Re: [htdig-dev] Sourceforge Bugs

From: Gabriele B. <bar...@in...> - 2003-10-24 00:54:21

>Gabriele:  Did you fix this one already?
>[ 594790 ] rundig doesn't index Apache w/mod_zip

Yep ... and [828628] too. Sorry, I understood that someone but me should 
have closed it after testing it. For me they are both fixed and closed.

Thanks,
-Gabriele
--
Gabriele Bartolini: Web Programmer, ht://Dig & IWA/HWG Member, ht://Check 
maintainer
Current Location: Melbourne, Victoria, Australia
bar...@in... | http://www.prato.linux.it/~gbartolini | ICQ#129221447
 > "Leave every hope, ye who enter!", Dante Alighieri, Divine Comedy, The 
Inferno

Re: [htdig-dev] head_before_get attribute (was: 3.2RC1 Feature Freeze)

From: Gilles D. <gr...@sc...> - 2003-10-22 09:47:55

According to Gabriele Bartolini:
> However, I would pick these general cases, where the user should disable 
> the attribute (please revise it):
> 
> Case A - Persistent connections on
> 1) the majority of documents are HTML (this means we "always" want to GET them)
> 2) the server does not support HEAD (I have seen cases like this unfortunately)

OK, that sounds pretty important.  I hadn't heard that one before.
Persistent connections are only on for HTTP/1.1 servers, so what you're
saying is that there are servers out there that claim to be 1.1 compliant
but don't support the HEAD request.  Wouldn't this be an argument against
overriding head_before_get during an incremental dig?

> 3) cases where the persistent communication between htdig and the server 
> does not work at 100%: there can be some problems with persistent 
> connections and HEAD calls (I experience this kind of problems sometimes 
> with ht://Check and some NT servers)

Again, is this going to be a problem if we don't allow turning off
head_before_get during an update dig?

> Case B - Persistent connection off
> 1) same as case A
> 2) same as case A

In this case, the server could be HTTP/1.1 or 1.0.  Either way, the same
question applies.  If the user needs a way to tell htdig to deal nicely
with these questionably compliant servers, then wouldn't they need a way
of turning off head_before_get unconditionally, whether it's an update
dig or an initial one?

This is what I was getting at before about this option never being
explained adequately.  On the surface, it seemed to be rather useless,
but with these new revelations that have come out of your testing, it
seems there may indeed be a need for turning this off in some cases.
That's the sort of thing that should be documented so others (developers
and end-users) know what you'd use this for.

-- 
Gilles R. Detillieux              E-mail: <gr...@sc...>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/
Dept. Physiology, U. of Manitoba  Winnipeg, MB  R3E 3J7  (Canada)

Re: [htdig-dev] head_before_get attribute (was: 3.2RC1 Feature Freeze)

From: Neal R. <ne...@ri...> - 2003-10-20 22:46:43

The overall question I have is this (it was pointed out by someone in a
earlier mail):

Given that calling HEAD enables us to short-ciruit files with invalid
mime-types.. isn't it nearly always benefitial to call HEAD, even when
doing an 'initial-dig'?

The answer to this question may influence your choice of what to commit,
but the description below looks good to me if we want to never call HEAD
during an initial dig.

Thanks.

On Sun, 19 Oct 2003, Gabriele Bartolini wrote:

>
> >   I think what we've had here is informative debate.  You as much as
> >anyone else wrote the networking code, so for me it's your decision.  I
> >think the new TRUE default is fine.
>
> OK. Any other opinions?
>
> >   If you've perfected this logic in ht://Check, then we should probably
> >consider syncing with your net code after 3.2 is done.
>
> So ... is it ok for you guys if I go on with the Retriever, Document and
> HtHTTP patch as suggested in the previous e-mails?
>
> Basically, in order to perform always a HEAD call during an incremental
> indexing, I need to store the information in both the Retriever and
> Document class. Is that right for you? In particular, I suggest this enum:
>
>          enum  RetrieverType {
>                  Retriever_Initial,
>                  Retriever_Incremental
>          };
>
> and then change the constructor this way:
>
>          Retriever(RetrieverLog flags = Retriever_noLog, RetrieverType t =
> Retriever_Initial);
>
> In 'htdig.cc', we check whether the dig is an initial dig or not and:
>
>          if(!initial) // Switch the retriever type to Incremental
>                  retriever_type = Retriever_Incremental;
>
> therefore, when we instantiate the main retriever object, we just simply
> add this:
>
>          Retriever retriever(Retriever_logUrl, retriever_type);
>
> Please let me know.
>
> Ciao and thanks,
> -Gabriele
> --
> Gabriele Bartolini: Web Programmer, ht://Dig & IWA/HWG Member, ht://Check
> maintainer
> Current Location: Melbourne, Victoria, Australia
> bar...@in... | http://www.prato.linux.it/~gbartolini | ICQ#129221447
>  > "Leave every hope, ye who enter!", Dante Alighieri, Divine Comedy, The
> Inferno
>
>
>
> -------------------------------------------------------
> This SF.net email sponsored by: Enterprise Linux Forum Conference & Expo
> The Event For Linux Datacenter Solutions & Strategies in The Enterprise
> Linux in the Boardroom; in the Front Office; & in the Server Room
> http://www.enterpriselinuxforum.com
> _______________________________________________
> ht://Dig Developer mailing list:
> htd...@li...
> List information (subscribe/unsubscribe, etc.)
> https://lists.sourceforge.net/lists/listinfo/htdig-dev
>

Neal Richter
Knowledgebase Developer
RightNow Technologies, Inc.
Customer Service for Every Web Site
Office: 406-522-1485

Re: [htdig-dev] head_before_get attribute (was: 3.2RC1 Feature Freeze)

From: Gabriele B. <bar...@in...> - 2003-10-21 08:53:08

I read again my e-mail and I think that I should have written this sentence 
in another way:

>2) performing HEAD calls only in the incremental dig (either with or 
>without persistent connections)

I meant: "in the incremental dig perform just HEAD calls". I guess you guys 
understood: "HEAD is performed only in incremental digs".

If so ... I am sorry about that and my english.

Ciao
-Gabriele
--
Gabriele Bartolini: Web Programmer, ht://Dig & IWA/HWG Member, ht://Check 
maintainer
Current Location: Melbourne, Victoria, Australia
bar...@in... | http://www.prato.linux.it/~gbartolini | ICQ#129221447
 > "Leave every hope, ye who enter!", Dante Alighieri, Divine Comedy, The 
Inferno