Documentation to supplement the primary source, which is
Witten, Ian H.; Moffat, Alistair; Bell, Timothy C. (1999). Managing Gigabytes. Morgan Kaufmann Publishing, San Francisco, 2nd edition
Further suggestions of sources to integrate / emulate (thanks to Serge Heiden for these):
For the various index files of CQP, to start I would recommend:
> IMS Corpus Workbench "CQP Corpus Administrator's Manual", Oliver Christ, Universität Stuttgart, Institut für maschinelle Sprache, 1994 (p. 14 for a partial overview of index architecture) A copy of which is here:
Something nice would be to do documents
like the ones Stefan Evert has done for the NXT Search engine :
A) a CQP object model justifying a detailed description of index files architecture (like the "CQP Corpus Administrator's Manual" schema p. 14 but with real file names to begin with) Like this document:
Formal specification of the NITE Object Model, the abstract data model used by the NITE XML Toolkit.
B) a CQL formal specification
Like this document:
Formal specification of NiteQL, the query language that operates over data conforming to the NITE Object Model.
I once started a list of all the CQL syntax features I know of in a Googledoc, but it hasn't evolved to something readable:
These binary formats are now obsolete, so we won't waste time documenting them. We'll document the new formats as they are written.
The new index format for v4, by the way, won't follow Witten et al. in detail.