From: Neal R. <ne...@ri...> - 2004-11-05 16:46:42
|
Hi Manuel, htdig -i forces a 'from scratch' recrawl. htdig be default does a traversal of the existing index and issues HEAD requests to see if a page has changed. Exactly what you described below... Please make sure you have 'head_before_get' enabled. What version are you using? Thanks On Thu, 4 Nov 2004, Manuel Lemos wrote: > Hello, > > I tried the general list but it seems nobody could help. Lets see if anybody > can help here: > > I have been using htdig for years to crawl a site that now has over > 10.000 pages. Since it may go through many changes in the pages I have been > reindexing the whole site once on a daily basis. > > However this lazy indexing approach is taking too much resources. > Therefore I am looking into a better approach of keeping a list of only > the pages that have changed and just reindex those pages in much shorter > cycle than what I am doing. > > My question is how can I reindex just a few pages at once and merge the > crawled pages with a previously indexed site database? I mean, index > only a few pages that I list and only follow links to site pages that > were not yet indexed. > > -- Neal Richter Knowledgebase Developer RightNow Technologies, Inc. Customer Service for Every Web Site Office: 406-522-1485 |