htdig-dev Mailing List for ht://Dig (Page 42)

Brought to you by: angusgb, grdetil, lha, nealr, scherpbier

htdig-dev — Developer Discussion for the ht://Dig project

You can subscribe to this list here.

2001	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct (47)	Nov (74)	Dec (66)
2002	Jan (95)	Feb (102)	Mar (83)	Apr (64)	May (55)	Jun (39)	Jul (23)	Aug (77)	Sep (88)	Oct (84)	Nov (66)	Dec (46)
2003	Jan (56)	Feb (129)	Mar (37)	Apr (63)	May (59)	Jun (104)	Jul (48)	Aug (37)	Sep (49)	Oct (157)	Nov (119)	Dec (54)
2004	Jan (51)	Feb (66)	Mar (39)	Apr (113)	May (34)	Jun (136)	Jul (67)	Aug (20)	Sep (7)	Oct (10)	Nov (14)	Dec (3)
2005	Jan (40)	Feb (21)	Mar (26)	Apr (13)	May (6)	Jun (4)	Jul (23)	Aug (3)	Sep (1)	Oct (13)	Nov (1)	Dec (6)
2006	Jan (2)	Feb (4)	Mar (4)	Apr (1)	May (11)	Jun (1)	Jul (4)	Aug (4)	Sep	Oct (4)	Nov	Dec (1)
2007	Jan (2)	Feb (8)	Mar (1)	Apr (1)	May (1)	Jun	Jul (2)	Aug	Sep (1)	Oct	Nov	Dec
2008	Jan (1)	Feb	Mar (1)	Apr (2)	May	Jun	Jul (1)	Aug	Sep (1)	Oct	Nov	Dec
2009	Jan	Feb	Mar (2)	Apr	May (1)	Jun	Jul	Aug	Sep	Oct	Nov	Dec
2010	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec (1)
2011	Jan	Feb	Mar (1)	Apr	May (1)	Jun	Jul	Aug	Sep	Oct (1)	Nov	Dec
2012	Jan	Feb	Mar	Apr	May	Jun	Jul (1)	Aug	Sep	Oct	Nov	Dec
2013	Jan	Feb	Mar	Apr (1)	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec
2016	Jan (1)	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec
2017	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov (1)	Dec

Flat | Threaded

<< < 1 .. 40 41 42 43 44 .. 108 > >> (Page 42 of 108)

[htdig-dev] Sourceforge Bugs

From: Neal R. <ne...@ri...> - 2003-10-22 21:20:47

Hey all,
	Please go to sourceforge and look at the open bugs if you can,
there are 18 'Status:Open' now.  There are 6 bugs in the 'Status:Open &
Group:Include_in_3.2' state.

Gabriele:  Did you fix this one already?
[ 594790 ] rundig doesn't index Apache w/mod_zip

Thanks.

Neal Richter
Knowledgebase Developer
RightNow Technologies, Inc.
Customer Service for Every Web Site
Office: 406-522-1485

Re: [htdig-dev] head_before_get attribute (was: 3.2RC1 Feature Freeze)

From: Neal R. <ne...@ri...> - 2003-10-22 20:43:32

Lachlan wrote:
> 2. We are in feature freeze, and scheduled to release in one week's
> time, at the end of October.  We should minimise changes to the code.
> Has a bug report been filed for this issue yet?  Wasn't the plan to
> have no CVS commits without reference to a bug number?

  Gabriele:  Please create a sourceforge bug for this when you change
it... and clue us all in on what the 'net change' is after the commits
;-).

  As far as the release goes, we need to get some kind of testing report
made and updated... I'll try and post something by tommorow.

  Thanks.

Neal Richter
Knowledgebase Developer
RightNow Technologies, Inc.
Customer Service for Every Web Site
Office: 406-522-1485

Re: [htdig-dev] head_before_get attribute (was: 3.2RC1 Feature Freeze)

From: Neal R. <ne...@ri...> - 2003-10-22 19:51:00

Gabriele wrote:
> 1) leave the code as is
> 2) remove the overriding of the head before get in the incremental dig
>
> In both cases we need to write down a better documentation for this
> attribute (especially in the option 2 where we should talk about the
> benefits of a HEAD call in the incremental dig).
>
> I must confess. I would prefer option 2, as I think users' must have full
> control of the tool and IMHO by adding a default behaviour of HEAD before
> GET to the system we've done our part.

  OK, you've convinced me, it IS useful to have this switch be user
controlled..  I wasn't aware of the non-compliant servers causing an
issue.  Clearly 'automatic' behavior in that case is a bad thing.
Go with option 2.

  Thanks!

Neal Richter
Knowledgebase Developer
RightNow Technologies, Inc.
Customer Service for Every Web Site
Office: 406-522-1485

[htdig-dev] Re: htsearch 3.2 rejects records with modtime of 0?

From: Lachlan A. <lh...@us...> - 2003-10-22 15:11:45

Greetings Gilles,

In  htcommon/defaults.cc,  startyear  is specified as 1970, so your=20
config file would have to explicitly clear  startyear  to say no date=20
is given.

The reason for  startyear  being specified in  defaults.cc  is that=20
the default value should be in  attrs.html,  which is automatically=20
generated.  The three fixes I can think of (in order of my=20
preference) are:

1. Set the (hard-coded) default value of  startday  in
   htsearch/Display.cc to 0 instead of 1. I'm not sure if this would
   work, and it may break other things.
2. Leave  startyear  empty in defaults.cc and manually hack attrs.hml.
3. Leave  startyear  undocumented.

Opinions?

Cheers,
Lachlan

On Wed, 22 Oct 2003 08:30, Gilles Detillieux wrote:
> even
> though these dozen or so web pages were definitely in the database,
> and came out into db.docs after an htdump (with a m:0 field),
> htsearch would not show these in search results.  I looked at the
> code, and the only thing that I can see that would cause this is if
> the startyear, startmonth or startday input parameters were set,
> causing the timet_startdate value in Display.cc to be greater than
> 0.  But I didn't set these!  I ran htsearch from the command line,
> so I know I wasn't passing it these values as input parameters, and
> the config file I used didn't define these as attributes either.

--=20
lh...@us...
ht://Dig developer DownUnder  (http://www.htdig.org)

Re: [htdig-dev] head_before_get attribute (was: 3.2RC1 Feature Freeze)

From: Lachlan A. <lh...@us...> - 2003-10-22 13:35:08

Greetings all,

I've only been following this thready loosely, but my opinions are:

1. In version 3.2.1 (or 3.3, or 4.0) there should be three possible=20
settings:  true, false, auto.  That way the user has complete=20
control, but doesn't need to exert it.

2. We are in feature freeze, and scheduled to release in one week's=20
time, at the end of October.  We should minimise changes to the code. =20
Has a bug report been filed for this issue yet?  Wasn't the plan to=20
have no CVS commits without reference to a bug number?

Cheers,
Lachlan

On Wed, 22 Oct 2003 08:30, Gabriele Bartolini wrote:

> So ... we have 2 possibilities now:
>
> 1) leave the code as is
> 2) remove the overriding of the head before get in the incremental
> dig
>
> I must confess. I would prefer option 2, as I think users' must
> have full control of the tool and IMHO by adding a default
> behaviour of HEAD before GET to the system we've done our part.

--=20
lh...@us...
ht://Dig developer DownUnder  (http://www.htdig.org)

Re: [htdig-dev] head_before_get attribute (was: 3.2RC1 Feature Freeze)

From: Gilles D. <gr...@sc...> - 2003-10-22 09:47:55

According to Gabriele Bartolini:
> However, I would pick these general cases, where the user should disable 
> the attribute (please revise it):
> 
> Case A - Persistent connections on
> 1) the majority of documents are HTML (this means we "always" want to GET them)
> 2) the server does not support HEAD (I have seen cases like this unfortunately)

OK, that sounds pretty important.  I hadn't heard that one before.
Persistent connections are only on for HTTP/1.1 servers, so what you're
saying is that there are servers out there that claim to be 1.1 compliant
but don't support the HEAD request.  Wouldn't this be an argument against
overriding head_before_get during an incremental dig?

> 3) cases where the persistent communication between htdig and the server 
> does not work at 100%: there can be some problems with persistent 
> connections and HEAD calls (I experience this kind of problems sometimes 
> with ht://Check and some NT servers)

Again, is this going to be a problem if we don't allow turning off
head_before_get during an update dig?

> Case B - Persistent connection off
> 1) same as case A
> 2) same as case A

In this case, the server could be HTTP/1.1 or 1.0.  Either way, the same
question applies.  If the user needs a way to tell htdig to deal nicely
with these questionably compliant servers, then wouldn't they need a way
of turning off head_before_get unconditionally, whether it's an update
dig or an initial one?

This is what I was getting at before about this option never being
explained adequately.  On the surface, it seemed to be rather useless,
but with these new revelations that have come out of your testing, it
seems there may indeed be a need for turning this off in some cases.
That's the sort of thing that should be documented so others (developers
and end-users) know what you'd use this for.

-- 
Gilles R. Detillieux              E-mail: <gr...@sc...>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/
Dept. Physiology, U. of Manitoba  Winnipeg, MB  R3E 3J7  (Canada)

Re: [htdig-dev] head_before_get attribute (was: 3.2RC1 Feature Freeze)

From: Gabriele B. <bar...@in...> - 2003-10-22 00:30:59

At 16.01 21/10/2003 -0500, Gilles Detillieux wrote:
> > 2) the server does not support HEAD (I have seen cases like this 
> unfortunately)
>OK, that sounds pretty important.  I hadn't heard that one before.

I meant that some server administrators may turn off the HEAD method (in 
Apache you can use the Limit directive).

>but don't support the HEAD request.  Wouldn't this be an argument against
>overriding head_before_get during an incremental dig?

I guess it is a matter of choosing the less painful solution. In the normal 
case (p/c on and hbg on) overriding is not done; however, in the 
incremental dig, one more request is made (HEAD) without success and 
hopefully - after that - the document GETs retrieved. There is a bit of 
overhead for sure but the question is: is it better to have a bit of 
overhead in some cases (minority) or to prevent users from getting the 
benefit from using always a workin HEAD call when updating the database?

The other way is to remove the override and leave everything in the hands 
of the user (I would not mind this - of course providing a better 
documentation).

With the changes done yesterday we have moved towards a clearer situation 
anyway, because:
- head before get is now true by default
- head before get has been detached by persistent connections and has 
become independent

> > 3) cases where the persistent communication between htdig and the server
> > does not work at 100%: there can be some problems with persistent
> > connections and HEAD calls (I experience this kind of problems sometimes
> > with ht://Check and some NT servers)
>
>Again, is this going to be a problem if we don't allow turning off
>head_before_get during an update dig?

I guess this could be fixable, because the problem comes up with persistent 
connections - which may be still disabled.

>with these questionably compliant servers, then wouldn't they need a way
>of turning off head_before_get unconditionally, whether it's an update
>dig or an initial one?

Yes, that'd be great.

Again, I guess we have to balance what we can do in order to make things 
easier to the user but, at the same time, leave the users enough freedom in 
order to configure their systems the way they want. Also, with 3.2, the 
server and URL blocks have added more dimensions to the space of 
configurability available to users and ... more "clear" attributes are 
available and more the toy gets perfect.

>This is what I was getting at before about this option never being
>explained adequately.

You're right.

>   On the surface, it seemed to be rather useless,
>but with these new revelations that have come out of your testing, it
>seems there may indeed be a need for turning this off in some cases.
>That's the sort of thing that should be documented so others (developers
>and end-users) know what you'd use this for.

So ... we have 2 possibilities now:

1) leave the code as is
2) remove the overriding of the head before get in the incremental dig

In both cases we need to write down a better documentation for this 
attribute (especially in the option 2 where we should talk about the 
benefits of a HEAD call in the incremental dig).

I must confess. I would prefer option 2, as I think users' must have full 
control of the tool and IMHO by adding a default behaviour of HEAD before 
GET to the system we've done our part.

So tell me what you think, especially you Gilles and Neal that have 
followed this thread. I am more than happy to (in case) rechange the code 
today.

Ciao ciao
-Gabriele
--
Gabriele Bartolini: Web Programmer, ht://Dig & IWA/HWG Member, ht://Check 
maintainer
Current Location: Melbourne, Victoria, Australia
bar...@in... | http://www.prato.linux.it/~gbartolini | ICQ#129221447
 > "Leave every hope, ye who enter!", Dante Alighieri, Divine Comedy, The 
Inferno

[htdig-dev] htsearch 3.2 rejects records with modtime of 0?

From: Gilles D. <gr...@sc...> - 2003-10-21 22:55:13

Hey, guys.  I ran into something wierd when I was testing out the
allow_numbers changes last week, which I haven't been quite able to
explain or track down in the code.  Of the pages on my site that I was
indexing, about a dozen of them were from a CGI script that puts out a
Last-Modified header to set the date appropriately in search results.
Because of a recent bug in the script, which I just fixed last week,
it turns out that the Last-Modified headers were coming out with no
date on them, so htdig was giving them a modtime of 0 (i.e. the epoch).
This is different behaviour than htdig 3.1.6, which gave them the current
time instead.  It may be that the 3.2 code should be fixed to do likewise,
as it seems the more sensible behaviour.

However, that's not the wierd thing.  What was odd is that even though
these dozen or so web pages were definitely in the database, and came
out into db.docs after an htdump (with a m:0 field), htsearch would not
show these in search results.  I looked at the code, and the only thing
that I can see that would cause this is if the startyear, startmonth or
startday input parameters were set, causing the timet_startdate value
in Display.cc to be greater than 0.  But I didn't set these!  I ran
htsearch from the command line, so I know I wasn't passing it these
values as input parameters, and the config file I used didn't define
these as attributes either.

I know the problem was the 0 modtime, because when I fixed the CGI script
to return a proper Last-Modified header, the pages showed up in htsearch,
with no other changes being made.

Does anyone know of anything else that might explain this behaviour?
I'd start putting trace prints in htsearch to track this down, but I have
too many high-priority things right now to spend much time on ht://Dig
right away.  htsearch -vvvv didn't give any indication of what might be
going on - the URLs in question never even showed up at all in the output.

I don't think I'd consider this a showstopper, but it does seem odd that
htsearch rejects any modtime value at all when none of those parameters
have been specified.  This, coupled with the fact that htdig will assign
a 0 modtime if it can't parse the Last-Modified header (as opposed to a
missing Last-Modified header, which should be taken as the current time
if I'm not mistaken), could lead to others having similar problems.

-- 
Gilles R. Detillieux              E-mail: <gr...@sc...>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/
Dept. Physiology, U. of Manitoba  Winnipeg, MB  R3E 3J7  (Canada)

Re: [htdig-dev] head_before_get attribute (was: 3.2RC1 Feature Freeze)

From: Gabriele B. <bar...@in...> - 2003-10-21 08:53:08

I read again my e-mail and I think that I should have written this sentence 
in another way:

>2) performing HEAD calls only in the incremental dig (either with or 
>without persistent connections)

I meant: "in the incremental dig perform just HEAD calls". I guess you guys 
understood: "HEAD is performed only in incremental digs".

If so ... I am sorry about that and my english.

Ciao
-Gabriele
--
Gabriele Bartolini: Web Programmer, ht://Dig & IWA/HWG Member, ht://Check 
maintainer
Current Location: Melbourne, Victoria, Australia
bar...@in... | http://www.prato.linux.it/~gbartolini | ICQ#129221447
 > "Leave every hope, ye who enter!", Dante Alighieri, Divine Comedy, The 
Inferno

Re: [htdig-dev] head_before_get attribute (was: 3.2RC1 Feature Freeze)

From: Gabriele B. <bar...@in...> - 2003-10-21 07:49:58

Hi guys,

At 17.02 20/10/2003 -0500, Gilles Detillieux wrote:
>wrong, I think it would be helpful if the code automatically did the right
>thing in most circumstances, and if the documentation for this attribute
>made it clear in which circumstances it would make sense to turn it off.

Yep. I think so too. Anyway, I modified the defaults.cc by putting the 
attribute in a 'true' default state and by explaining that:

- during an incremental dig, the value is overridden;
- in general, it is recommended to leave this value on.

I did not specify cases in which the attribute should be turned off as I 
thought I would have generated more confusion in the user.

However, I would pick these general cases, where the user should disable 
the attribute (please revise it):

Case A - Persistent connections on
1) the majority of documents are HTML (this means we "always" want to GET them)
2) the server does not support HEAD (I have seen cases like this unfortunately)
3) cases where the persistent communication between htdig and the server 
does not work at 100%: there can be some problems with persistent 
connections and HEAD calls (I experience this kind of problems sometimes 
with ht://Check and some NT servers)

Case B - Persistent connection off
1) same as case A
2) same as case A
3) I have never experienced any problem as in case A.3 with persistent 
connections disabled

>Well, it seems to me that there are actually two different cases where
>htdig does an initial dig.  The obvious one is when the user specifies
>-i, which sets the initial flag.  The less obvious one is when htdig is
>run without -i, but with no existing database, or with an empty one.
>What matters is whether there are URLs in the database or not.  If there
>are none, then you'll never reject a document as "not changed".

OK. Good point. I think I changed the Retriever class in order to perform 
this check as well. Also, during an incremental dig, if debug > 1 I show a 
notice message, saying that any head before get attribute configuration is 
overridden and that HEAD is always enabled.

Sounds good?

Ciao and thanks,
-Gabriele
--
Gabriele Bartolini: Web Programmer, ht://Dig & IWA/HWG Member, ht://Check 
maintainer
Current Location: Melbourne, Victoria, Australia
bar...@in... | http://www.prato.linux.it/~gbartolini | ICQ#129221447
 > "Leave every hope, ye who enter!", Dante Alighieri, Divine Comedy, The 
Inferno

[htdig-dev] Server information

From: Gabriele B. <bar...@in...> - 2003-10-21 07:47:46

Hi guys,

    I have given an higher debug level (2 instead of 1) to the display of 
the configuration information of a Server (performed in the constructor); I 
remember I enabled it when we first started the server configuration and - 
sorry - I found it really frustrating now. I think that a level 1 debug was 
inappropriate for that and I guess a level 2 is enough.

Ciao
-Gabriele
--
Gabriele Bartolini: Web Programmer, ht://Dig & IWA/HWG Member, ht://Check 
maintainer
Current Location: Melbourne, Victoria, Australia
bar...@in... | http://www.prato.linux.it/~gbartolini | ICQ#129221447
 > "Leave every hope, ye who enter!", Dante Alighieri, Divine Comedy, The 
Inferno

Re: [htdig-dev] overriding attributes (was: head_before_get attribute)

From: Gilles D. <gr...@sc...> - 2003-10-20 23:00:35

According to Gabriele Bartolini:
> At 14.39 14/10/2003 -0500, Gilles Detillieux wrote:
> >It would be a good idea, in the general case, for us all to learn how
> >to properly override config parameters in the code, so that a server
> >block or URL block definition doesn't override an internal override
> 
> Maybe I am missing something. I am not aware of a way that allows us to 
> override blocks definitions through the Configuration classes. Can you 
> please point it out to me? Sorry.

No, I think you're right that it can't be done in the ConfiguRation
class right now.  It seems the only way to override block attribute
definitions in the code is to add logic where the attribute is used
and ignore the attribute if it's appropriate to do so.

The logic now is that server block definitions override globals, and
URL block definitions override both (for attribute definitions that
can be used in server or URL blocks).  So, there's no way in the code
to globally override a server or URL block definition.

-- 
Gilles R. Detillieux              E-mail: <gr...@sc...>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/
Dept. Physiology, U. of Manitoba  Winnipeg, MB  R3E 3J7  (Canada)

Re: [htdig-dev] head_before_get attribute (was: 3.2RC1 Feature Freeze)

From: Neal R. <ne...@ri...> - 2003-10-20 22:46:43

The overall question I have is this (it was pointed out by someone in a
earlier mail):

Given that calling HEAD enables us to short-ciruit files with invalid
mime-types.. isn't it nearly always benefitial to call HEAD, even when
doing an 'initial-dig'?

The answer to this question may influence your choice of what to commit,
but the description below looks good to me if we want to never call HEAD
during an initial dig.

Thanks.

On Sun, 19 Oct 2003, Gabriele Bartolini wrote:

>
> >   I think what we've had here is informative debate.  You as much as
> >anyone else wrote the networking code, so for me it's your decision.  I
> >think the new TRUE default is fine.
>
> OK. Any other opinions?
>
> >   If you've perfected this logic in ht://Check, then we should probably
> >consider syncing with your net code after 3.2 is done.
>
> So ... is it ok for you guys if I go on with the Retriever, Document and
> HtHTTP patch as suggested in the previous e-mails?
>
> Basically, in order to perform always a HEAD call during an incremental
> indexing, I need to store the information in both the Retriever and
> Document class. Is that right for you? In particular, I suggest this enum:
>
>          enum  RetrieverType {
>                  Retriever_Initial,
>                  Retriever_Incremental
>          };
>
> and then change the constructor this way:
>
>          Retriever(RetrieverLog flags = Retriever_noLog, RetrieverType t =
> Retriever_Initial);
>
> In 'htdig.cc', we check whether the dig is an initial dig or not and:
>
>          if(!initial) // Switch the retriever type to Incremental
>                  retriever_type = Retriever_Incremental;
>
> therefore, when we instantiate the main retriever object, we just simply
> add this:
>
>          Retriever retriever(Retriever_logUrl, retriever_type);
>
> Please let me know.
>
> Ciao and thanks,
> -Gabriele
> --
> Gabriele Bartolini: Web Programmer, ht://Dig & IWA/HWG Member, ht://Check
> maintainer
> Current Location: Melbourne, Victoria, Australia
> bar...@in... | http://www.prato.linux.it/~gbartolini | ICQ#129221447
>  > "Leave every hope, ye who enter!", Dante Alighieri, Divine Comedy, The
> Inferno
>
>
>
> -------------------------------------------------------
> This SF.net email sponsored by: Enterprise Linux Forum Conference & Expo
> The Event For Linux Datacenter Solutions & Strategies in The Enterprise
> Linux in the Boardroom; in the Front Office; & in the Server Room
> http://www.enterpriselinuxforum.com
> _______________________________________________
> ht://Dig Developer mailing list:
> htd...@li...
> List information (subscribe/unsubscribe, etc.)
> https://lists.sourceforge.net/lists/listinfo/htdig-dev
>

Neal Richter
Knowledgebase Developer
RightNow Technologies, Inc.
Customer Service for Every Web Site
Office: 406-522-1485

Re: [htdig-dev] head_before_get attribute (was: 3.2RC1 Feature Freeze)

From: Gilles D. <gr...@sc...> - 2003-10-20 22:03:50

According to Gabriele Bartolini:
> >   I think what we've had here is informative debate.  You as much as
> >anyone else wrote the networking code, so for me it's your decision.  I
> >think the new TRUE default is fine.
> 
> OK. Any other opinions?

I think it was just a matter of not understanding what the attribute did or
didn't do, and in which circumstances it would be useful to change it.
Because of the potential for serious performance degradation when you get it
wrong, I think it would be helpful if the code automatically did the right
thing in most circumstances, and if the documentation for this attribute
made it clear in which circumstances it would make sense to turn it off.

> >   If you've perfected this logic in ht://Check, then we should probably
> >consider syncing with your net code after 3.2 is done.
> 
> So ... is it ok for you guys if I go on with the Retriever, Document and 
> HtHTTP patch as suggested in the previous e-mails?

I think that's what Neal was getting at when he said it's your decision.
You wrote the networking code, so you know better than anyone else what's
needed to make this particular change.  It sounds reasonable to me that
you'd need to make changes to these classes, as that's where the needed
decisions must be made about the appropriate default action.

> Basically, in order to perform always a HEAD call during an incremental 
> indexing, I need to store the information in both the Retriever and 
> Document class. Is that right for you? In particular, I suggest this enum:
> 
>          enum  RetrieverType {
>                  Retriever_Initial,
>                  Retriever_Incremental
>          };
> 
> and then change the constructor this way:
> 
>          Retriever(RetrieverLog flags = Retriever_noLog, RetrieverType t = 
> Retriever_Initial);
> 
> In 'htdig.cc', we check whether the dig is an initial dig or not and:
> 
>          if(!initial) // Switch the retriever type to Incremental
>                  retriever_type = Retriever_Incremental;
> 
> therefore, when we instantiate the main retriever object, we just simply 
> add this:
> 
>          Retriever retriever(Retriever_logUrl, retriever_type);
> 
> Please let me know.

Well, it seems to me that there are actually two different cases where
htdig does an initial dig.  The obvious one is when the user specifies
-i, which sets the initial flag.  The less obvious one is when htdig is
run without -i, but with no existing database, or with an empty one.
What matters is whether there are URLs in the database or not.  If there
are none, then you'll never reject a document as "not changed".

-- 
Gilles R. Detillieux              E-mail: <gr...@sc...>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/
Dept. Physiology, U. of Manitoba  Winnipeg, MB  R3E 3J7  (Canada)

Re: [htdig-dev] head_before_get attribute (was: 3.2RC1 Feature Freeze)

From: Gabriele B. <bar...@in...> - 2003-10-19 13:37:42

>   I think what we've had here is informative debate.  You as much as
>anyone else wrote the networking code, so for me it's your decision.  I
>think the new TRUE default is fine.

OK. Any other opinions?

>   If you've perfected this logic in ht://Check, then we should probably
>consider syncing with your net code after 3.2 is done.

So ... is it ok for you guys if I go on with the Retriever, Document and 
HtHTTP patch as suggested in the previous e-mails?

Basically, in order to perform always a HEAD call during an incremental 
indexing, I need to store the information in both the Retriever and 
Document class. Is that right for you? In particular, I suggest this enum:

         enum  RetrieverType {
                 Retriever_Initial,
                 Retriever_Incremental
         };

and then change the constructor this way:

         Retriever(RetrieverLog flags = Retriever_noLog, RetrieverType t = 
Retriever_Initial);

In 'htdig.cc', we check whether the dig is an initial dig or not and:

         if(!initial) // Switch the retriever type to Incremental
                 retriever_type = Retriever_Incremental;

therefore, when we instantiate the main retriever object, we just simply 
add this:

         Retriever retriever(Retriever_logUrl, retriever_type);

Please let me know.

Ciao and thanks,
-Gabriele
--
Gabriele Bartolini: Web Programmer, ht://Dig & IWA/HWG Member, ht://Check 
maintainer
Current Location: Melbourne, Victoria, Australia
bar...@in... | http://www.prato.linux.it/~gbartolini | ICQ#129221447
 > "Leave every hope, ye who enter!", Dante Alighieri, Divine Comedy, The 
Inferno

[htdig-dev] Re: testing

From: Lachlan A. <lh...@us...> - 2003-10-17 23:56:27

Attachments: attribs

Thanks for the offer of testing, Ted.

Regarding test cases, I think that the main part of testing at the=20
moment is actually generating the test cases.  Essentially, we have=20
to use each of the features of ht://Dig and make sure that it works=20
as documented.  Neal has suggested testing each of the configuration=20
attributes and command line arguments.  If we're keen, we should also=20
test each template variable.

=46rom the attached list of attributes, select a group of attributes. =20
Write a config file which sets each of them to some value.  One by=20
one, change the attribute in a way which should produce an observable=20
change, and make sure you observe that change.  For example, if you=20
were testing the "meta" group, you would check that, with=20
create_url_list=3Dtrue, it correctly creates a list of URLs retrieved,=20
and that with create_url_list=3Dfalse, it doesn't create such a list.

This testing may be very simplistic, but it does reveal bugs.

Thanks again,
Lachlan

On Tue, 14 Oct 2003 23:23, Ted Stresen-Reuter wrote:
> I'm happy to do testing on Mac OS X.
> When requesting help testing, please provide a test case (the steps
> one must take to complete the test) and the intended behavior (so
> the testers know what to look for and what shouldn't be appearing).

--=20
lh...@us...
ht://Dig developer DownUnder  (http://www.htdig.org)

Re: [htdig-dev] allow_numbers

From: Lachlan A. <lh...@us...> - 2003-10-17 13:38:34

On Wed, 15 Oct 2003 04:01, Gilles Detillieux wrote:

> I got thouroughly confused in reading your patch, though, because
> it is reversed, with the new code appearing in the first file and
> the old code in the second, rather than the other way around.

Oops... :)
=20
> Taking that into account, though, the patch seems right to me.  I
> think it should be committed ASAP.

Done.  Could you (or someone) please confirm it and close the bug=20
report?

Thanks,
Lachlan

--=20
lh...@us...
ht://Dig developer DownUnder  (http://www.htdig.org)

[htdig-dev] "make check" problems on Red Hat 9

From: Gilles D. <gr...@sc...> - 2003-10-16 21:41:19

I tried a "make check" of the current CVS tree on both Red Hat Linux
6.2 and Red Hat Linux 9.  On 6.2, it passes with flying colours, if
I remember to chmod +x test/t_url first.  On RH9, I ran into a couple
different problems.

First of all, there's the infamous WordType::instance undefined error
which has dogged Mac OS X users.  Has anyone yet come up with a fix for
this, other than manually linking and changing the order of the libraries?
I tried a hack to a couple test programs to help prod the linker into
loading the needed modules from the libraries, but I don't think it's
the ideal solution (it causes a warning when t_url runs the url.cc code,
and I imagine testnet would do likewise if I got Apache to run).  My hack
is below.

The other problem is the 5 tests that require Apache fail because I
can't get it to run.  I commented out or modified all the lines in
test/conf/httpd.conf that were causing httpd to give error messages,
but it still won't start up.  Apache 2.0.40 seems to have some problems
with the conf file in our distribution, but I wasted too many fruitless
hours yesterday to figure out what it needs.  Anyone else had better luck?

I don't personally consider this a showstopper for the upcoming 3.2.0rc1,
but it would be nice to have this all working reliably in the final release.
On the bright side, the RH9 build does pass the other 9 tests, and perhaps
more importantly, it has no trouble indexing the SCRC's web site.

Here's my ugly hack to get it to link on RH9...

--- test/testnet.cc.orig	2003-07-21 07:40:22.000000000 -0500
+++ test/testnet.cc	2003-10-15 13:21:44.000000000 -0500
@@ -7,6 +7,7 @@
 #include "HtHTTP.h"
 #include "HtHTTPBasic.h"
 #include "HtDateTime.h"
+#include "WordContext.h"
 #include <URL.h>
 
 #ifdef HAVE_STD
@@ -75,6 +76,9 @@ int main(int ac, char **av)
    // Flag variable for errors
    int _errors = 0;
 
+   // Needed to satisfy linker dependencies...
+   (void) WordContext::Initialize();
+
 ///////
    //	Retrieving options from command line with getopt
 ///////
--- test/url.cc.orig	2003-07-21 07:40:22.000000000 -0500
+++ test/url.cc	2003-10-15 13:19:33.000000000 -0500
@@ -38,6 +38,7 @@ using namespace std;
 
 #include "HtConfiguration.h"
 #include "URL.h"
+#include "WordContext.h"
 
 
 // These should probably be tested individually
@@ -114,6 +115,7 @@ static void dourl(params_t* params)
 {
   if(verbose) cerr << "Test WordKey class with " <<
 		params->url_parents << " and " << params->url_children << "\n";
+  (void) WordContext::Initialize();
   HtConfiguration* const config= HtConfiguration::config();
   config->Defaults(defaults);
   dolist(params);

-- 
Gilles R. Detillieux              E-mail: <gr...@sc...>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/
Dept. Physiology, U. of Manitoba  Winnipeg, MB  R3E 3J7  (Canada)

Re: [htdig-dev] head_before_get attribute (was: 3.2RC1 Feature Freeze)

From: Neal R. <ne...@ri...> - 2003-10-16 04:00:29

> but maybe in future release we could use other HTTP headers (i.e. cookies,
> language, etc.) and a pre-emptive head could save time in a initial dig as
> well.

  Yep.. even on an initial dig HEAD is a good idea.. unless the website is
almost all HTML pages with few images... which seems pretty pie-in-the-sky
at this point.

> 2) I share the library with ht://Check which massively uses this option as
> it has to retrieve any document - images too - and a HEAD call could save a
> lot of time in the initial dig. I'd love to maintain the logic of the net
> library the more similar possible.

> Please let me know if the Retriever and Document classes changes make sense
> to you guys and I will modify the code.

  I think what we've had here is informative debate.  You as much as
anyone else wrote the networking code, so for me it's your decision.  I
think the new TRUE default is fine.

  If you've perfected this logic in ht://Check, then we should probably
consider syncing with your net code after 3.2 is done.

Thanks.

Neal Richter
Knowledgebase Developer
RightNow Technologies, Inc.
Customer Service for Every Web Site
Office: 406-522-1485

Re: [htdig-dev] head_before_get attribute (was: 3.2RC1 Feature Freeze)

From: Gabriele B. <bar...@in...> - 2003-10-15 23:55:42

Cheers Neal,

>It seems like it's to our advantage to always do a HEAD call, unless it's
>an initial dig, where it is wastefull...  and that the state of
>persitent_connections is irrelevant to this decision.

Let me try to understand. What you suggest is:

1) killing head_before_get
2) performing HEAD calls only in the incremental dig (either with or 
without persistent connections)
3) Unlinking the Head before Get mechanism from the persistent connections one

If it is so, it could be good for me (for number 1 I will do what you guys 
decide). I had not understood it from the earlier messages, sorry.

Even though - personally - I would not kill the attribute because:

1) It could be useful in cases when we don't know whether a document is 
parsable or not according to the *usual* means of exclusions (that is to 
say the URL). I know so far we take in consideration only the content-type 
but maybe in future release we could use other HTTP headers (i.e. cookies, 
language, etc.) and a pre-emptive head could save time in a initial dig as 
well.
2) I share the library with ht://Check which massively uses this option as 
it has to retrieve any document - images too - and a HEAD call could save a 
lot of time in the initial dig. I'd love to maintain the logic of the net 
library the more similar possible.
3) Killing the attribute would not avoid us to change the code in order to 
store information about the retrieval status in the Retriever and Document 
classes (unless we intend to use some classes variables - which I hate)

>I don't have a problem keeping head_before_get, as long as we make the
>default TRUE.

That's the default.

Please let me know if the Retriever and Document classes changes make sense 
to you guys and I will modify the code.

Ciao ciao
-Gabriele
--
Gabriele Bartolini: Web Programmer, ht://Dig & IWA/HWG Member, ht://Check 
maintainer
Current Location: Melbourne, Victoria, Australia
bar...@in... | http://www.prato.linux.it/~gbartolini | ICQ#129221447
 > "Leave every hope, ye who enter!", Dante Alighieri, Divine Comedy, The 
Inferno

Re: [htdig-dev] head_before_get attribute (was: 3.2RC1 Feature Freeze)

From: Neal R. <ne...@ri...> - 2003-10-15 19:12:23

>2) persistent connections off: we perform a GET call and if the document
>is not what we want we simply close the connection (we anticipate it).

When persitent_connections is off,  we do a GET and see a MIME type we
don't like we just close the connection.  Isn't this just a tad bit 'ugly'
and possibly abusive to the webserver?  At a minimum it wastes the
webservers time starting the GET only to have the connection closed
prematurely.  It will definetely waste some processor time on the server
buffering up the data to send as the server CPU is much faster than the
latency of the network connection.  We are also causing potential
server memory allocation churn which would affect SWAP on a highly loaded
webserver.

> I can think of this possible solution. The scenario above is still valid
> (IMHO) for the initial dig case; I would modify it for the incremental dig
> as mentioned yesterday, as follows:
>
> if "persistent_connections" (on a server basis) is set to on:
>          enable persistent connections
> else
>          disable them
>
> if incremental or ("head_before_get" and "persistent_connections" are both
> set to on)  - I have to modify yesterday's patch a bit
>          enable head before get
> else
>          disable head before get

OK.. just to be absolutely clear.. if we can design logic that will
optimally set head_before_get automatically based upon that state of
persistent_connections, what is the reason for keeping it around as a
 user-configurable setting?

It seems like it's to our advantage to always do a HEAD call, unless it's
an initial dig, where it is wastefull...  and that the state of
persitent_connections is irrelevant to this decision.

If this is not the case please reply with a clear example situation where
we have some advantage to NOT setting head_before_get automatically.

I don't have a problem keeping head_before_get, as long as we make the
default TRUE.

Thanks!


Neal Richter
Knowledgebase Developer
RightNow Technologies, Inc.
Customer Service for Every Web Site
Office: 406-522-1485

[htdig-dev] SSL troubles

From: <And...@wi...> - 2003-10-15 15:12:46

Hey.

Not to harp, but I tried building the  CVS a little bit ago and it still
didn't set the HAVE_SSL_H stuff right, in that I had to go into the .h file
and change the #define by hand.  Sorry, I don't have the details, I can't
get to that machine right now, but I just thought I'd mention it.

a

Andy Bach, Sys. Mangler
Internet: and...@wi...
VOICE: (608) 261-5738  FAX 264-5030

"We are either doing something, or we are not. 'Talking about' is a
subset of 'not'."            -- Mike Sphar in alt.sysadmin.recovery

Re: [htdig-dev] head_before_get attribute (was: 3.2RC1 Feature Freeze)

From: Jesse op d. B. <ht...@op...> - 2003-10-15 07:49:01

I vote -1 for killing, if its function is described clearly and doesn't
change, 0
otherwise, between any two succeeding versions of htdig.
It's no harm in having this option, is there?
If you don't want it, just turn it off.

----- Original Message ----- 
From: "Gabriele Bartolini" <bar...@in...>
>
> >   I'm with you on this one.. we should just kill head_before_get.  I
would
> >vote for killing it instead of hacking the logic.
>
> Hi guys, I hope that after my previous message you could change your mind.
> I vote -1 for killing this attribute.
>
> Ciao,
> -Gabriele

--Jesse

Re: [htdig-dev] head_before_get attribute (was: 3.2RC1 Feature Freeze)

From: Gabriele B. <bar...@in...> - 2003-10-14 23:29:11

>   I'm with you on this one.. we should just kill head_before_get.  I would
>vote for killing it instead of hacking the logic.

Hi guys, I hope that after my previous message you could change your mind. 
I vote -1 for killing this attribute.

Ciao,
-Gabriele
--
Gabriele Bartolini: Web Programmer, ht://Dig & IWA/HWG Member, ht://Check 
maintainer
Current Location: Melbourne, Victoria, Australia
bar...@in... | http://www.prato.linux.it/~gbartolini | ICQ#129221447
 > "Leave every hope, ye who enter!", Dante Alighieri, Divine Comedy, The 
Inferno

Re: [htdig-dev] head_before_get attribute (was: 3.2RC1 Feature Freeze)

From: Gabriele B. <bar...@in...> - 2003-10-14 23:28:11

At 14.39 14/10/2003 -0500, Gilles Detillieux wrote:
>It would be a good idea, in the general case, for us all to learn how
>to properly override config parameters in the code, so that a server
>block or URL block definition doesn't override an internal override

Maybe I am missing something. I am not aware of a way that allows us to 
override blocks definitions through the Configuration classes. Can you 
please point it out to me? Sorry.

>doing an initial dig, then it will only ever take effect when doing an
>update (or incremental) dig with persistent connections turned on.

No no ... wait. I have never talked about turning off head before get. Let 
me try and give an explanation about this parameter.

I remember we issued the 'head_before_get' attribute because of this: when 
requesting a non-parsable document we generally had 3 options:

1) persistent connections on:
         a) head before get on: we perform a HEAD call and notice that the 
document's content-type is not what we want so we simply avoid doing the 
GET call
         b) head before get off: we perform a GET call but in this case we 
must receive all the content returned by the server, otherwise we have to 
close the connection - that's not what we want in general.
2) persistent connections off: we perform a GET call and if the document is 
not what we want we simply close the connection (we anticipate it).

IMHO the 'head_before_get' could make the difference in some cases with 
persistent connections on and only the webmaster can see the difference in 
performances between turning it on or off. If we don't have many multimedia 
files we could simply turn it off (avoiding a 'double' call), whereas if we 
have big files to be downloaded (especially from the Internet) this 
attribute could make the difference, as a pre-emptive HEAD call would let 
us know about the type of document we are being requested and eventually 
save us a big download.

>not as versed in HTTP/1.1 as you are.  It seems to me that htdig should
>always be doing a HEAD before a GET when doing incremental digs through
>persistent connections.

Yes. And not only there. Even when performing an initial dig, if the user 
wants it, we must enable it.

I can think of this possible solution. The scenario above is still valid 
(IMHO) for the initial dig case; I would modify it for the incremental dig 
as mentioned yesterday, as follows:

if "persistent_connections" (on a server basis) is set to on:
         enable persistent connections
else
         disable them

if incremental or ("head_before_get" and "persistent_connections" are both 
set to on)  - I have to modify yesterday's patch a bit
         enable head before get
else
         disable head before get

In this way, for initial dig the user can choose whether activate 
persistent connections and head before get, whether for incremental digs 
the users' settings get overridden.

For me this sounds good. There can be issues regarding the way of doing it; 
I thought that adding some object variables in the Retriever and Document 
class would be fine. Unless there is a way of overriding specific settings 
through the Configuration classes.

Please let me know.

>By the way, Gabriele, good call on the Accept-Encoding header.  It's a
>simple, elegant fix to a troublesome bug.  You're right that adding
>support for gzip encoding is a feature request, and not a bug fix,
>and should be done after the upcoming release (not before).  Good work.

Thank you. However, by following Neal's directives, could someone of you 
try it and let me know so I can close the bug?

Ciao ciao
-Gabriele

>--
>Gilles R. Detillieux              E-mail: <gr...@sc...>
>Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/
>Dept. Physiology, U. of Manitoba  Winnipeg, MB  R3E 3J7  (Canada)
>
>
>-------------------------------------------------------
>This SF.net email is sponsored by: SF.net Giveback Program.
>SourceForge.net hosts over 70,000 Open Source Projects.
>See the people who have HELPED US provide better services:
>Click here: http://sourceforge.net/supporters.php
>_______________________________________________
>ht://Dig Developer mailing list:
>htd...@li...
>List information (subscribe/unsubscribe, etc.)
>https://lists.sourceforge.net/lists/listinfo/htdig-dev

--
Gabriele Bartolini: Web Programmer, ht://Dig & IWA/HWG Member, ht://Check 
maintainer
Current Location: Melbourne, Victoria, Australia
bar...@in... | http://www.prato.linux.it/~gbartolini | ICQ#129221447
 > "Leave every hope, ye who enter!", Dante Alighieri, Divine Comedy, The 
Inferno

9 messages has been excluded from this view by a project administrator.

Flat | Threaded

<< < 1 .. 40 41 42 43 44 .. 108 > >> (Page 42 of 108)