From: Chuck P. (C. Inc.) <chu...@co...> - 2005-02-17 01:04:22
|
I've setup htdig 3.2.0b6 and 3.1.6 on both Solaris 8 and Redhat 9 trying to get use_doc_date to work with my CGI scripts that output HTML. I've applied the recommended patch to the 3.1.6 as described here: http://www.htdig.org/mail/2000/04/0156.html I've been going nuts for two days and must be missing something simple or possibly misunderstanding the function of use_doc_date. In my htdig.conf file I have the following lines: # use meta date to determine a new page use_doc_date: true Within the head of each of the pages indexed I have something like this: <META NAME="Date" CONTENT="2005-02-16"> I've tried using META name dc.date, dc.date.created and dc.date.modified. I've also tried dates in the format YYYYMMDD, YYMMDD, YYYY-MM-DD HH:MM:SS, YYYY MM DD, etc.... Running curl --head on the pages I'm indexing tells me that apache sets the mod date to now: HTTP/1.1 200 OK Date: Thu, 17 Feb 2005 00:48:58 GMT Server: Apache/1.3.29 (Unix) mod_perl/1.29 Content-Type: text/html X-Cache: MISS from nbcmv4.console.net Thanks in advance! -Chuck |
From: Jim <li...@yg...> - 2005-02-17 03:07:01
|
On Wed, 16 Feb 2005, Chuck Phillips (Console, Inc.) wrote: > I've setup htdig 3.2.0b6 and 3.1.6 on both Solaris 8 and Redhat 9 trying to > get use_doc_date to work with my CGI scripts that output HTML. I've applied > the recommended patch to the 3.1.6 as described here: > http://www.htdig.org/mail/2000/04/0156.html I don't know much about this patch, but it looks like it was for 3.1.5. It is possible that it is not applicable to 3.1.6. There is a more important patch for 3.1.6. There is a parsing problem with the code in the default distribution that is corrected with the following patch. ftp://ftp.ccsf.org/htdig-patches/3.1.6/metadate.0 > I've been going nuts for two days and must be missing something simple or > possibly misunderstanding the function of use_doc_date. > > In my htdig.conf file I have the following lines: > # use meta date to determine a new page > use_doc_date: true > > Within the head of each of the pages indexed I have something like this: > <META NAME="Date" CONTENT="2005-02-16"> > > I've tried using META name dc.date, dc.date.created and dc.date.modified. > I've also tried dates in the format YYYYMMDD, YYMMDD, YYYY-MM-DD HH:MM:SS, > YYYY MM DD, etc.... You are reindexing after each change aren't you? Jim |
From: Chuck P. (C. Inc.) <chu...@co...> - 2005-02-17 05:56:18
|
>> I've setup htdig 3.2.0b6 and 3.1.6 on both Solaris 8 and Redhat 9 >> trying to get use_doc_date to work with my CGI scripts that output >> HTML. I've applied the recommended patch to the 3.1.6 as described >> here: http://www.htdig.org/mail/2000/04/0156.html > > I don't know much about this patch, but it looks like it was for > 3.1.5. It > is possible that it is not applicable to 3.1.6. > > There is a more important patch for 3.1.6. There is a parsing problem > with > the code in the default distribution that is corrected with the > following > patch. > > ftp://ftp.ccsf.org/htdig-patches/3.1.6/metadate.0 My apologies, that is the patch we used. I referred to the wrong archived mail-list message. Still no change. > >> I've been going nuts for two days and must be missing something >> simple or possibly misunderstanding the function of use_doc_date. >> >> In my htdig.conf file I have the following lines: >> # use meta date to determine a new page >> use_doc_date: true >> >> Within the head of each of the pages indexed I have something like >> this: >> <META NAME="Date" CONTENT="2005-02-16"> >> >> I've tried using META name dc.date, dc.date.created and >> dc.date.modified. I've also tried dates in the format YYYYMMDD, >> YYMMDD, YYYY-MM-DD HH:MM:SS, YYYY MM DD, etc.... > > You are reindexing after each change aren't you? > I'm reindexing after each change with a modified version of rundig that doesn't use the -i flag for htdig (I've also disabled the -a flag for the modified rundig). I've tried using the default rundig as well. No luck. |
From: Jim <li...@yg...> - 2005-02-17 06:58:42
|
On Wed, 16 Feb 2005, Chuck Phillips (Console, Inc.) wrote: >>> I've been going nuts for two days and must be missing something simple >>> or possibly misunderstanding the function of use_doc_date. >>> >>> In my htdig.conf file I have the following lines: >>> # use meta date to determine a new page >>> use_doc_date: true >>> >>> Within the head of each of the pages indexed I have something like >>> this: >>> <META NAME="Date" CONTENT="2005-02-16"> >>> >>> I've tried using META name dc.date, dc.date.created and >>> dc.date.modified. I've also tried dates in the format YYYYMMDD, YYMMDD, >>> YYYY-MM-DD HH:MM:SS, YYYY MM DD, etc.... >> >> You are reindexing after each change aren't you? >> > I'm reindexing after each change with a modified version of rundig that > doesn't use the -i flag for htdig (I've also disabled the -a flag for the > modified rundig). I've tried using the default rundig as well. No luck. You might want to try running with the -i (or manually removing the database) in order to verify that everything is being retrieved and indexed from scratch. Using the meta tag you list above, exactly as you have typed it (except with a date that is not today), gives me the results I would expect when I index a page with use_doc_date enabled. What exactly are you expecting when you enable the use_doc_date attribute and in what way are the results differing from your expectations? Btw, if you run with a couple -v options tacked on, and use_doc_date is working as expected, you should see some evidence in the debug output; there should be a line that says something like 'time: 2000-01-02'. Jim |
From: Chuck P. (C. Inc.) <chu...@co...> - 2005-02-17 17:45:31
|
>>>> I've been going nuts for two days and must be missing something >>>> simple or possibly misunderstanding the function of use_doc_date. >>>> In my htdig.conf file I have the following lines: >>>> # use meta date to determine a new page >>>> use_doc_date: true >>>> Within the head of each of the pages indexed I have something like >>>> this: >>>> <META NAME="Date" CONTENT="2005-02-16"> >>>> I've tried using META name dc.date, dc.date.created and >>>> dc.date.modified. I've also tried dates in the format YYYYMMDD, >>>> YYMMDD, YYYY-MM-DD HH:MM:SS, YYYY MM DD, etc.... >>> You are reindexing after each change aren't you? >> I'm reindexing after each change with a modified version of rundig >> that doesn't use the -i flag for htdig (I've also disabled the -a >> flag for the modified rundig). I've tried using the default rundig as >> well. No luck. > > You might want to try running with the -i (or manually removing the > database) in order to verify that everything is being retrieved and > indexed from scratch. > > Using the meta tag you list above, exactly as you have typed it (except > with a date that is not today), gives me the results I would expect > when I > index a page with use_doc_date enabled. > > What exactly are you expecting when you enable the use_doc_date > attribute > and in what way are the results differing from your expectations? > > Btw, if you run with a couple -v options tacked on, and use_doc_date is > working as expected, you should see some evidence in the debug output; > there should be a line that says something like 'time: 2000-01-02'. I tried manually deleting the database as well as running with -i. No change. I expected that enabling use_doc_date would make my modified rundig (no -i, no -a) only update the index for pages that have newer meta dates. This page for example has the following meta date tag: <META NAME="Date" CONTENT="2005-02-16"> but it always gets reindexed. 5:28:1:http://nbcmv4.console.net/release_detail.dev.nbc/ nbcuniversalcable-20050216000000-academy-awardnomin.html: (changed) ------------------------ size = 13255 Running rundig and grepping out the times matched gives me the following: homeplate local/htdig/bin $ sudo ./rundig.new -vvvv | grep time: time: 2005-02-16 time: 2005-02-16 time: 2005-02-16 time: 2005-02-16 time: 2005-02-16 time: 2005-02-16 I read in an archive that this output isn't confirmation that htdig is using the meta date, that if it fails and defaults to now it will do so silently. Thanks again, Chuck |
From: Jim <li...@yg...> - 2005-02-18 04:59:39
|
On Thu, 17 Feb 2005, Chuck Phillips (Console, Inc.) wrote: > I expected that enabling use_doc_date would make my modified rundig (no -i, > no -a) only update the index for pages that have newer meta dates. I don't think that use_doc_date is intended to be used in this way. By the time htdig knows anything about the date/time in the meta tag it has already checked the response header, retrieved the document, and started parsing out content. In many cases I am not sure there is a lot to be gained by bailing out at that point. Not claiming that is true in your case, but in the general case you have already paid the price for hitting the network, setting up the structures for the document, and performing some amount of text parsing. As far as I know only the date returned in the response header is considered when determining whether to reindex based on modification time. > I read in an archive that this output isn't confirmation that htdig is using > the meta date, that if it fails and defaults to now it will do so silently. True. But I think that can only happen if the code fails to parse the string in the content attribute. The value of your content attribute looked fine. If what you see in the time: output is in the correct format, then I think it is safe to assume that the date was associated with the document when indexed. Jim |
From: Janine S. <ja...@fu...> - 2005-02-18 23:32:38
|
On Feb 17, 2005, at 8:58 PM, Jim wrote: > On Thu, 17 Feb 2005, Chuck Phillips (Console, Inc.) wrote: > >> I expected that enabling use_doc_date would make my modified rundig >> (no -i, no -a) only update the index for pages that have newer meta >> dates. > > I don't think that use_doc_date is intended to be used in this way. So then what is the "correct" way to do this? I have a site that takes about 30 hours to fully index, so obviously I'd like to just do an update most of the time, but it sounds like this isn't going to be as easy as dropping the -i and -a. I've run that a couple of times on a subset of my pages and it looks like they are all being processed each time. thanks, janine |
From: Jim <li...@yg...> - 2005-02-24 20:52:59
|
On Fri, 18 Feb 2005, Janine Sisk wrote: > On Feb 17, 2005, at 8:58 PM, Jim wrote: > >> On Thu, 17 Feb 2005, Chuck Phillips (Console, Inc.) wrote: >> >>> I expected that enabling use_doc_date would make my modified rundig (no >>> -i, no -a) only update the index for pages that have newer meta dates. >> >> I don't think that use_doc_date is intended to be used in this way. > > So then what is the "correct" way to do this? I have a site that takes about > 30 hours to fully index, so obviously I'd like to just do an update most of > the time, but it sounds like this isn't going to be as easy as dropping the > -i and -a. I've run that a couple of times on a subset of my pages and it > looks like they are all being processed each time. The technique htdig uses to determine whether a document needs to be reindexed involves the 'If-Modified-Since:' header. If the server you are contacting respects this header and returns correct last-modified dates when documents are retrieved, then dropping the -i option should prevent unmodified documents from being reindexed. If the server is not configured to meet these conditions, or the pages are dynamically generated in a manner that results in there being no associated date of last-modification, then htdig assumes that the document needs to be reindexed. Jim |