Re: [htdig] db size

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

On Fri, 28 Dec 2001, Daniel Escobar wrote:

> I'm building a db on two of my domains, which each contains app. 2000
> files each.

The size is obviously dependent not just on the number of files (4000
total), but the size of each file. If each file is several megabytes, then
it's certainly possible to take up a lot of space.

> According to what the FAQ's say, it should be approximatelly, 48 MB,
> but not over a gig!

The FAQ and the requirements page are using rough estimates of typical
page sizes.

Here are some things that could be going wrong:
1) Your indexing is going over more than just those two servers.
  (check your limit_urls_to directive)
2) You started indexing with databases that contained other URLs rather
than indexing "from scratch"
  (If htdig is updating db, it will start by checking all the URLs
currently in the db and then ADD the new URLs to this.)
3) You have a variety of "duplicate" URLs:
  <http://www.htdig.org/FAQ.html#q4.24>
4) You have some sort of URL "loop" (this is actually one cause of
(#3) that makes it appear that there are many, many more documents than
there are.

But let's take a look at what you sent:
703M    ./db.docdb
1.2M    ./db.docdb.work
84k     ./db.docs.index
1.1G    ./db.wordlist
2.0M    ./db.wordlist.work
8.8M    ./db.words.db

I don't know how you indexed this, but your .work files and your regular
files are quite different. This implies that you have two different
indexing runs here (normally after indexing, the script would move the
.work files to replace the others). So which files correspond to the
"huge" ones? And how did the other set come about?

--
-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/