Re: [htdig-dev] binary document-database format questions

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On Wed, 4 Sep 2002, Walantis Giosis wrote:

> The ID bytes for length informations (excerpt length, docume size, URL
> length) varies. Say we have a document size of less than 100h bytes.
> Then the ID byte has the value 44h for that information. The size
> needs only one byte. If the size exceeds 100h bytes (it needs two or
> more bytes) then the ID byte has the value 84h. What's the logic
> behind this ? Only to determine the byte count for the size ? At the
> moment I've handled it using a switch/case statement.

Hans-Peter Nilsson rewrote the Serialize/Deserialize routines very
carefully, so I can't speak authoritatively.  I think he was trying to
save as much space as possible. AFAICT, there's a marker indicating that
the next variable coming up is sizeof() whatever.

Take a look at htcommon/DocumentRef.cc::Serialize() to see the code.

> And why is the document size information stored twice in the database ?

They should be different. See htcommon/DocumentRef.[cc,h] which deals with
the document DB records. In particular, there's the text size of the
database and optionally, it can figure out the size of the document
including all images.

--
-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/