Thread: [htdig-dev] still in active development?

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

I'm just wondering what the status is on the development of htdig. I 
haven't seen much going on on the list for a while.

Best Regards,

- Mick

3.2b6 is either done or dead.  I suppose that either me or Geoff need to 
finish it off and call it 3.2 and update the website.  I've called for a 
psuedo-vote on this several times with silence being the general response.

On a more positive note Anthony Arnone (Montana State Univ. grad student) 
and I have started active development of HtDig 4.0.  It will be a merge of 
HtDig + CLucene with a significant amount of code for the existing 
Berkeley DB based WordDb being flushed.

The main impetous for this is Unicode support and a speed and index size 
improvement.

We expect to produce a decently detailed refactoring document next week 
and create a 4.0 CVS branch then.

Thanks

On Mon, 18 Jul 2005, Mick Weiss wrote:

> I'm just wondering what the status is on the development of htdig. I 
> haven't seen much going on on the list for a while.
> 
> Best Regards,
> 
> - Mick

-- 
Neal Richter
Knowledgebase Developer
RightNow Technologies, Inc.
Customer Service for Every Web Site
Office: 406-522-1485

Hello,

on 07/19/2005 07:14 PM Neal Richter said the following:
> 3.2b6 is either done or dead.  I suppose that either me or Geoff need to 
> finish it off and call it 3.2 and update the website.  I've called for a 
> psuedo-vote on this several times with silence being the general response.

The lack of visible activity and the fact that many of us us htdig as is 
without problems made people believe that voting would irrelevant.

Anyway, if you still think you need votes to go ahead, here is a +1 on 
my behalf.

> On a more positive note Anthony Arnone (Montana State Univ. grad student) 
> and I have started active development of HtDig 4.0.  It will be a merge of 
> HtDig + CLucene with a significant amount of code for the existing 
> Berkeley DB based WordDb being flushed.
> 
> The main impetous for this is Unicode support and a speed and index size 
> improvement.
> 
> We expect to produce a decently detailed refactoring document next week 
> and create a 4.0 CVS branch then.

Great. I hope that will allow us to do things like making Htdig crawl 
individual pages and only update their entries in the index. That is 
what miss most in the current HTDig version.

I make htdig crawl the static version of my site every day, but that is 
not very efficient and often it is too late.

I can keep track of all pages that change and need to reindexed, but it 
is odd to make Htdig crawl the hole site just because a few pages 
changed. I would be more satisfied if I could just tell htdig once an 
hour to reindex a limited list of pages that changed.

-- 

Regards,
Manuel Lemos

PHP Classes - Free ready to use OOP components written in PHP
http://www.phpclasses.org/

PHP Reviews - Reviews of PHP books and other products
http://www.phpclasses.org/reviews/

Metastorage - Data object relational mapping layer generator
http://www.meta-language.net/metastorage.html

On Tue, 2005-07-19 at 19:45 -0300, Manuel Lemos wrote:
> Great. I hope that will allow us to do things like making Htdig crawl 
> individual pages and only update their entries in the index. That is 
> what miss most in the current HTDig version.

 I'm using htdig 3.2 for doing incremental indexing right now and it
seems to be working fine. What sort of problems are you having?

To remove a list of URLs:

htpurge -c conf_file.conf -u list_of_urls.txt

To do an incremental index:

echo URL_list.txt | htdig -m foo -c conf_file.conf -

(notice the trailing '-'). Making this work wasn't obvious, but I had a
bit of help from the list, and it's all working for me now.

Cheers,

Chris

-- 
Christopher Murtagh
Enterprise Systems Administrator
ISR / Web Service Group 
McGill University
Montreal, Quebec
Canada

Tel.: (514) 398-3122
Fax:  (514) 398-2017

On Tue, 19 Jul 2005, Manuel Lemos wrote:

> > We expect to produce a decently detailed refactoring document next week 
> > and create a 4.0 CVS branch then.
> 
> Great. I hope that will allow us to do things like making Htdig crawl 
> individual pages and only update their entries in the index. That is 
> what miss most in the current HTDig version.
> 
> I make htdig crawl the static version of my site every day, but that is 
> not very efficient and often it is too late.
> 
> I can keep track of all pages that change and need to reindexed, but it 
> is odd to make Htdig crawl the hole site just because a few pages 
> changed. I would be more satisfied if I could just tell htdig once an 
> hour to reindex a limited list of pages that changed.

This should be exactly what Chris Murtagh's command does:

echo URL_list.txt | htdig -m foo -c conf_file.conf -

-m suppresses the addition of the full list of URLs in the db.docs to the 
'to be requested queue' in the spider.

Please reply back if this is not addressing your desires...

Thanks

-- 
Neal Richter
Knowledgebase Developer
RightNow Technologies, Inc.
Customer Service for Every Web Site
Office: 406-522-1485

Hello,

on 07/19/2005 08:10 PM Christopher Murtagh said the following:
> On Tue, 2005-07-19 at 19:45 -0300, Manuel Lemos wrote:
>>Great. I hope that will allow us to do things like making Htdig crawl 
>>individual pages and only update their entries in the index. That is 
>>what miss most in the current HTDig version.
> 
>  I'm using htdig 3.2 for doing incremental indexing right now and it
> seems to be working fine. What sort of problems are you having?
> 
> To remove a list of URLs:
> 
> htpurge -c conf_file.conf -u list_of_urls.txt
> 
> To do an incremental index:
> 
> echo URL_list.txt | htdig -m foo -c conf_file.conf -
> 
> (notice the trailing '-'). Making this work wasn't obvious, but I had a
> bit of help from the list, and it's all working for me now.

hummm... I had the impression from a message posted in this list that 
when you do incremental indexing, HtDig will still traverse all pages 
but just performs HEAD requests to verify whether other pages were 
updated. Is this what happens or I misunderstood the point of this?

Another thing that confuses me about the example above is the parameter 
that follows the -m switch. If it is supposed to read from STDIN, why 
foo and not just - ?

Other than that, if I want to update existing index database files, 
letting the users search the current databases while htdig is finishe, 
adding -a switch to the htdig command line will work ok whe just 
updating a few URLs as you suggest?

Should I follow htdig command with the usual htmerge and htfuzzy command 
calls as in a full reindex?

If this works ok like this, that will solve my problem as I need. If so, 
I plan to update my HTDIG PHP interface class and release a new version 
soon for the benefit of all that use HTDIG with PHP.

http://www.phpclasses.org/htdiginterface

-- 

Regards,
Manuel Lemos

PHP Classes - Free ready to use OOP components written in PHP
http://www.phpclasses.org/

PHP Reviews - Reviews of PHP books and other products
http://www.phpclasses.org/reviews/

Metastorage - Data object relational mapping layer generator
http://www.meta-language.net/metastorage.html

On Fri, 22 Jul 2005, Manuel Lemos wrote:

> > (notice the trailing '-'). Making this work wasn't obvious, but I had a
> > bit of help from the list, and it's all working for me now.
> 
> hummm... I had the impression from a message posted in this list that 
> when you do incremental indexing, HtDig will still traverse all pages 
> but just performs HEAD requests to verify whether other pages were 
> updated. Is this what happens or I misunderstood the point of this?

  It does exactly that unless you give it the -m switch, at which point it 
just adds/updates the pages you specify.

> Another thing that confuses me about the example above is the parameter 
> that follows the -m switch. If it is supposed to read from STDIN, why 
> foo and not just - ?

  I'll have to read the code to see why, looks like a command-line parsing 
problem.

> Other than that, if I want to update existing index database files, 
> letting the users search the current databases while htdig is finishe, 
> adding -a switch to the htdig command line will work ok whe just 
> updating a few URLs as you suggest?

  The 'alt work files' command line option is another way to do what you 
described above (with the -m switch).

> Should I follow htdig command with the usual htmerge and htfuzzy command 
> calls as in a full reindex?

  Htmerge is only necessary when you are merging two databases... the 
above scenario doesn't need it.

> If this works ok like this, that will solve my problem as I need. If so, 
> I plan to update my HTDIG PHP interface class and release a new version 
> soon for the benefit of all that use HTDIG with PHP.
> 
> http://www.phpclasses.org/htdiginterface

  There is a 'libhtdig' & 'libhtdigphp' directories in the 3.2b6 snapshot 
which would enable you to directly call htdig search functions without a 
PHP wrapper.

Thanks

-- 
Neal Richter
Knowledgebase Developer
RightNow Technologies, Inc.
Customer Service for Every Web Site
Office: 406-522-1485

On Jul 25, 2005, at 5:17 AM, Neal Richter wrote:

>   There is a 'libhtdig' & 'libhtdigphp' directories in the 3.2b6 
> snapshot
> which would enable you to directly call htdig search functions without 
> a
> PHP wrapper.
>

How are these libraries built? Do I just navigate to the directory and 
./configure make make install? Doing just the basic ./configure make 
make install from the root directory doesn't seem to build the 
"binaries".

Also, does PHP need to be built with these libraries or can they be 
accessed by adding lines to php.ini (or at runtime using dl())? 
Finally, do these libraries come with any documentation?

Thanks in advance... And please count on me providing a final Mac OS X 
package when this release is ready to go.

Ted Stresen-Reuter

On Mon, 25 Jul 2005, Gustave T. Stresen-Reuter wrote:

> On Jul 25, 2005, at 5:17 AM, Neal Richter wrote:
> 
> >   There is a 'libhtdig' & 'libhtdigphp' directories in the 3.2b6 
> > snapshot
> > which would enable you to directly call htdig search functions without 
> > a
> > PHP wrapper.
> >
> 
> How are these libraries built? Do I just navigate to the directory and 
> ./configure make make install? Doing just the basic ./configure make 
> make install from the root directory doesn't seem to build the 
> "binaries".

  A 'make' in libhtdig (after ./configure in the main directory) should 
give you a libhtdig.so

  In libhtdigphp you will have to ./configure (with the appropriate PHP 
system config flags to configure) then run 'make' (which will fail to 
link) and then run 'relink.sh'.

  Since these are auxilary libraries, I have not bothered to integrate 
them into the base htdig make system.

  htdig 4.0 will probably be structured a bit differently....

> Also, does PHP need to be built with these libraries or can they be 
> accessed by adding lines to php.ini (or at runtime using dl())? 
> Finally, do these libraries come with any documentation?

  runtime using dl()

  Not much documentation, but libhtdig_api.h is fairly instructive.  There 
is an example PHP file n libhtdigphp

> Thanks in advance... And please count on me providing a final Mac OS X 
> package when this release is ready to go.

  How has it been working for you?  Any issues? 

-- 
Neal Richter
Knowledgebase Developer
RightNow Technologies, Inc.
Customer Service for Every Web Site
Office: 406-522-1485

>> Thanks in advance... And please count on me providing a final Mac OS X
>> package when this release is ready to go.
>
>   How has it been working for you?  Any issues?

Actually, it's been working fine, but I'm not using it in production 
(and I'm only testing it on the OS X client version). However, I did 
start a dig that dug thousands of pages (over the course of a couple 
days) and it went just fine, no problems at all and it's only a little 
slower than a standard Linux machine of equal horsepower (but that's 
more a reflection of the Mac subsystems than the htdig source).

I built it on Panther (10.3) but it is my understanding that Tiger 
(10.4) ships with a new version of the gcc compiler (not sure which) so 
I don't know if the new compiler might be causing problems for others 
or not, my guess is not. I haven't updated yet and don't have any plans 
to so the final package will be built on 10.3.9.

Thanks for the replies to the other inquiries. I'll probably release 
binaries of these packages as part of the main package as well.

Ted

On Fri, 2005-07-22 at 01:08 -0300, Manuel Lemos wrote:
> on 07/19/2005 08:10 PM Christopher Murtagh said the following:
> > To do an incremental index:
> > 
> > echo URL_list.txt | htdig -m foo -c conf_file.conf -
> > 
> > (notice the trailing '-'). Making this work wasn't obvious, but I had a
> > bit of help from the list, and it's all working for me now.
> 
> hummm... I had the impression from a message posted in this list that 
> when you do incremental indexing, HtDig will still traverse all pages 
> but just performs HEAD requests to verify whether other pages were 
> updated. Is this what happens or I misunderstood the point of this?
> 
> Another thing that confuses me about the example above is the parameter 
> that follows the -m switch. If it is supposed to read from STDIN, why 
> foo and not just - ?

 Yeah, I can't remember exactly why, other than it didn't work if I
didn't do it. Sorry, it was a while ago when I set things up. A smarter
person would have documented what I did, but I was swamped and
didn't. :-)

> Other than that, if I want to update existing index database files, 
> letting the users search the current databases while htdig is finishe, 
> adding -a switch to the htdig command line will work ok whe just 
> updating a few URLs as you suggest?

 I use htdig for several things, including indexing results of
PostgreSQL queries and joins. For example, if you go to:

 http://www.mcgill.ca/classified/

 The search tool uses htdig, embedded inside PostgreSQL (via stored
procedures that call htdig).

 Same goes for:

 http://www.mcgill.ca/search/

 Just about everything there uses htdig, inside PostgreSQL and with a
PHP wrapper.

Cheers,

Chris