From: Antonio Merker <merker@mv...> - 2008-06-17 13:02:24
I've been looking into HE for a few days now, and have set up a crawler
configuration, because I need to index dynamically generated webpages as
well as static pages. A few questions remain:
It is possible to have incremental updates with "estcmd gather" by using
the "-sd" argument. Is there such an option for the crawler, or anything
to avoid having to re-check every page? After all, there is a last
modified attribute sent via HTTP, as far as I know.
Example: Adding a new word to a static HTML page, then running the
crawler with no arguments, with "-revisit" or with "-revcont" does not
seem to include the new word in the index. Only "-restart" works, but
I'd like to avoid having to re-index every page, even though it did not
2) Old entries / purging
Whenever I delete a HTML page and use "-restart" (see question 1), it's
still in the index. How can I remove old pages from a crawled index?
Thanks a lot!
Get latest updates about Open Source Projects, Conferences and News.