Re: [htdig-dev] Retriever/Parser

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

At 8:58 PM -0700 2/6/02, Neal Richter wrote:
>	Not sure.  Anyone?  Could be some ability to fetch documents over
>samba connections be usefull?

That seems like file:// or some other external Transport mechanism, 
but not necessarily a new Retriever.

>	One idea worth consideration along the lines of the
>"Google cached" document feature would be to offload all spidering duties
>to code like 'httrack',
...
>	It's pretty well maintaned and well thought of, maybe you'd rather
>leave the maintenance of spidering code to that project instead.

The catch is that some parsing requires "spidering." The full HTML 
4.0 specification includes the ability to link to metadata in a 
document, e.g.

<LINK rel="DC.identifier" type="text/plain" 
href="http://www.ietf.org/rfc/rfc1866.txt">

There's also a "longdesc" attribute for some tags:

<IMG src="sitemap.gif" alt="HP Labs Site Map" longdesc="sitemap.html">

I think there's also much to be said for handling the spidering along 
with the indexing in the network-centric case. It allows some 
balancing of server load, keeping track of things like backlinks, 
link text, etc.

-Geoff