Re: [htdig] htdig 3.1.6 won't stay on site

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Actually, I *think* we've found the answer.  The problem is that when
updating the existing database, htdig will follow ALL existing links
first, *then* go back and index anything else on our original list,
staying on the specified sites.

I can overcome this by running htdig with the -i option every time.  While
that works, it's really inefficient, since it will still download
everything, rather than simply checking timestamps first on existing info.

In digging through the lists, it seems that's considered a "feature".
Personally, I don't see it as a feature, since I never crawled those sites
with anything except the limit_urls_to option set, but since I'm not
coding it, I suppose there's not much I can do about it.

Thanks.

> That's wierd.  We had problems like this in the older 3.2 betas, where the
> limit_urls_to pattern got crammed into a very large regular expression,
> which failed when the expression got too large.  The 3.1 code, on the
> other hand, uses the StringMatch class to handle limit_urls_to, and I
> don't know of any problems with really large patterns in StringMatch.
> Indeed, it's supposed to allocate a pattern table big enough to handle
> the worst case scenario for the size of string it's given.  Still,
> I suppose it's not impossible that it chokes on really big patterns.
> Can you find out what the breaking point is, after which it stops limiting
> htdig to the list of URLs you want?
>
> > I've been using htdig for a little while, and I've recently been alerted
> > to an indexing "issue", which I'm hoping someone might be able to help.
> > We have a list of about 800 sites we need to index.  If I run a small
> > subset (10 or 20 sites), they index fine.  However, when I index the full
> > 800, I find that htdig no longer stays on the site - that is, it seems to
> > crawl off-site links as well (which is definitely a problem for us).
> >
> > I have "limit_urls_to: ${start_url}" set in both my htdig.conf and a
> > seperate scitechdb.conf (science & technology database) file.  I'm
> > actually using a multidig configuration (we index a few other small sites
> > on the same server as different databases), which otherwise works well.
> >
> > I'm wondering if there is an issue with indexing large amounts of data - a
>