From: Neal R. <ne...@ri...> - 2002-02-07 04:02:09
|
> Can we come up with other types of Retriever classes beyond the > "here's a document in memory, index it" and "here's a URL, fetch it, > check status, index and spider" approaches? Not sure. Anyone? Could be some ability to fetch documents over samba connections be usefull? I've got working BasicDocument & TextCollecter classes I'll post soon. > What do you present in the search results? How does a user select a > particular document--is it a link to fetch the document based on the > DocID? This may help for the people who've asked if htdig could not > only fetch the document but leave a local copy, a la the Google > "Cached Results" feature. The search results will be fetched via another set of classes. I'm adapting the current htsearch query & display classes to have a per document API. As each result is fetched, the 'URL' is in effect a pointer to an XML document which is parsed and displayed with a PHP & XSLT. The 'URL' as it stands is not useable as a seperate entity, at least for this application. One idea worth consideration along the lines of the "Google cached" document feature would be to offload all spidering duties to code like 'httrack', then index the files on the database. With a log file produced during httrack spidering' a second CACHED_URL could be filled with the location of the local-copy, while the source URL is preserved. httrack is built to spider and save web-pages to have local-relative access to what is needed & linked to in the page. It's pretty well maintaned and well thought of, maybe you'd rather leave the maintenance of spidering code to that project instead. -- Neal Richter Knowledgebase Developer RightNow Technologies, Inc. Customer Service for Every Web Site |