Re: [htdig-dev] Retriever/Parser

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

> Can we come up with other types of Retriever classes beyond the 
> "here's a document in memory, index it" and "here's a URL, fetch it, 
> check status, index and spider" approaches?

	Not sure.  Anyone?  Could be some ability to fetch documents over
samba connections be usefull?

	I've got working BasicDocument & TextCollecter classes I'll post
soon.


> What do you present in the search results? How does a user select a 
> particular document--is it a link to fetch the document based on the 
> DocID? This may help for the people who've asked if htdig could not 
> only fetch the document but leave a local copy, a la the Google 
> "Cached Results" feature.

	The search results will be fetched via another set of classes. I'm
adapting the current htsearch query & display classes to have a per
document API.

	As each result is fetched, the 'URL' is in effect a pointer to an
XML document which is parsed and displayed with a PHP & XSLT.

	The 'URL' as it stands is not useable as a seperate entity, at
least for this application.

	One idea worth consideration along the lines of the
"Google cached" document feature would be to offload all spidering duties
to code like 'httrack', then index the files on the database.  With a log
file produced during httrack spidering' a second CACHED_URL could be
filled with the location of the local-copy, while the source URL is
preserved.

	httrack is built to spider and save web-pages to have
local-relative access to what is needed & linked to in the page.

	It's pretty well maintaned and well thought of, maybe you'd rather
leave the maintenance of spidering code to that project instead.

-- 
Neal Richter 
Knowledgebase Developer
RightNow Technologies, Inc.
Customer Service for Every Web Site