htdig-dev Mailing List for ht://Dig (Page 98)

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On Tue, 29 Jan 2002, Neal Richter wrote:

> defined.. virtual void functions, etc.  The current Retriever is highly
> build around the idea of webpages and HTTP (of course)...

The Retriever class isn't really built around much of anything IMHO. It
requires that documents have a URL and that the URLs can be grouped into
Server objects. 

In the 3.1 code, the Document class was bound to webpages and HTTP. In the
3.2 code, the Document class is more of a "branch point," picking
Transport and Parser objects as needed.

> you could write a retriever to get docs directly out of a database,
> from a file, scp, POP, via parameters, etc.

The distinction I drew when working on indexing towards the beginning of
3.2 code was between the Retriever (the spider itself) and the
Transport. The latter basically handles a specific URL schema. So there's
now HtHTTP, HtFile, HtNNTP and External transport classes. It sounds like
you're talking more about Transport-type concepts. 

Again, I think it's the URL that's the critical point. Otherwise how are
the search results useful? How do you "jump to" a particular result from
the output? The databases tie the URL to the DocID which is used
internally, but this doesn't seem particularly useful to the outside
world. Maybe I'm misunderstanding you.

> unix-style mail files, 
> XML files (given a spec.. see XSLT), 
> ...
> other document formats..

Certainly all of these would work fine within the current Parser class
with perhaps some additional revision. The current trend has been to cut
out Parser classes in favor of the external parsers and external
converters. Either approach can work *if* the parsers are
maintainable. (The previous PostScript and PDF parsers weren't.)

If it seems that the HTML parser is somehow "special" it's simply that the
other remaining parser classes are much simpler. There's very little left
in the 3.2 code that cares whether it's HTML or XML or whatever.

> 	Again, as I write and test this stuff I'll forward .tgz files with
> a script to do the setup and diff-ing.  Feel free to use it or pipe it to
> /dev/null if all you want is a web-crawling search engine. ;-)

Your work is appreciated. I'm just trying to point out a few things as
someone who's been around for a while.

1) We've been moving in this direction with 3.2 and for most purposes,
IMHO it's already there. Certainly if you have other suggestions, feel
free to contribute.

2) It's better not to reinvent the wheel. The less code that needs to be
maintained, generally the better. Do we really need new Retriever classes,
or do we need to refactor what we have?

3) There are differing philosophies on the Parser class and what should be
internal ht://Dig code and what should be plugged through the external
parsers and converters.
(As far as #3, I'm personally all for new Parser subclasses if they won't
become headaches like the old PDF.cc became.)

-Geoff

2001	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct (47)	Nov (74)	Dec (66)
2002	Jan (95)	Feb (102)	Mar (83)	Apr (64)	May (55)	Jun (39)	Jul (23)	Aug (77)	Sep (88)	Oct (84)	Nov (66)	Dec (46)
2003	Jan (56)	Feb (129)	Mar (37)	Apr (63)	May (59)	Jun (104)	Jul (48)	Aug (37)	Sep (49)	Oct (157)	Nov (119)	Dec (54)
2004	Jan (51)	Feb (66)	Mar (39)	Apr (113)	May (34)	Jun (136)	Jul (67)	Aug (20)	Sep (7)	Oct (10)	Nov (14)	Dec (3)
2005	Jan (40)	Feb (21)	Mar (26)	Apr (13)	May (6)	Jun (4)	Jul (23)	Aug (3)	Sep (1)	Oct (13)	Nov (1)	Dec (6)
2006	Jan (2)	Feb (4)	Mar (4)	Apr (1)	May (11)	Jun (1)	Jul (4)	Aug (4)	Sep	Oct (4)	Nov	Dec (1)
2007	Jan (2)	Feb (8)	Mar (1)	Apr (1)	May (1)	Jun	Jul (2)	Aug	Sep (1)	Oct	Nov	Dec
2008	Jan (1)	Feb	Mar (1)	Apr (2)	May	Jun	Jul (1)	Aug	Sep (1)	Oct	Nov	Dec
2009	Jan	Feb	Mar (2)	Apr	May (1)	Jun	Jul	Aug	Sep	Oct	Nov	Dec
2010	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec (1)
2011	Jan	Feb	Mar (1)	Apr	May (1)	Jun	Jul	Aug	Sep	Oct (1)	Nov	Dec
2012	Jan	Feb	Mar	Apr	May	Jun	Jul (1)	Aug	Sep	Oct	Nov	Dec
2013	Jan	Feb	Mar	Apr (1)	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec
2016	Jan (1)	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec
2017	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov (1)	Dec

htdig-dev Mailing List for ht://Dig (Page 98)

htdig-dev — Developer Discussion for the ht://Dig project