Hello, everybody,
I've taken first steps to actually get rid of invlib.c in the medium term
and replace it by a real database layer. I used Berkeley DB to do it.
I gave up on the original plan of properly separating invlib.c from the
rest of the code to allow for easy plug-n-play replacement of the search
layer implementation. For now. The current find.c and invlib.c are just
too tangled together make that effort worthwile. Efforts to make the DB
contents cross-platform compatible have not begun yet.
So I'm now eating up invlib.c from inside, adding DB code everywhere.
Eventually, I'll put #ifdef's around the original invlib.c, and if it
turns out to work, it'll be turned off, and even later yet, removed. Or
maybe there'll be two source files, one of which is chosen for the build
by autoconf, like we already do it with fscanner.l <-> scanner.l.
In a first prototype, I replaced only *.in.out, the actual inverted index,
by a DB. The *.po.out is really just an on-disk binary array of raw data,
and I left it alone for this. I used DB type BTree so we can have ordered
travel through the index, which is needed for regular expression queries
instead of fixed names. The DB is designed like this:
record['name'] = {first_posting ; n_postings}
where first_posting, n_postings index into the *.po.out file just like
invlib did. Results were quite promising: it could reproduce all the
important query results (checked against invlib). The DB was the same
size as *.in.out, to within a few percent.
But that only helps with reading the database. In the longer run, the
creation process should be re-written, too. In particular, I think
build() and invmake() should stop using the external "sort" command. The
temporary file written by build(), sorted, and then passed to invmake(),
will go away, along with the funny encoding of numbers in base 96 (128 - '
') or base-224 (256 - ' ') to make them sortable. I expect the database to
deal with that, instead. BTrees are inherently sorted, after all.
Which lead to the next (and current) prototype: it still uses the
externally sorted file for input, but the data are now
record['name'] = {struct POSTING}
with more than one record for the same name allowed. Searches through
this can walk either from 'name' to 'name', or through all the POSTING
structs stored for a given 'name', in order of insertion. Which currently
means order of the input file. This also worked well after some tweaking.
But it showed the first potential drawback, too: the database is rather
larger than the combined size of *.in.out and *.po.out. E.g. for cscope's
own sources, I had about 220 KB of *.in.out, and 600KB of *.po.out. The
*.db.out file was 1.7 MB, i.e. more than twice as large as the two
combined. Inspection of the DB file hints that the way BerkeleyDB handles
long list data for the same key ('name') is the culprit. At some point,
it'll move those to separate pages in the file, which will hardly ever be
even nearly full. Thus the overhead.
One way around this would be to switch to a single solid array of POSTING
structs per 'name'. Another might be to go back to the first prototype
version, but have a separate table for the POSTINGs to replace the
*.po.out file, using DB mode "RecNo", i.e. essentially just a disk-based
array.
Even the cscope.out file itself could be moved or at least mapped into the
DB, using a RecNo table backed by an external file. I.e. in the end
'cscope -q' may only need a single file, instead of three. This should
solve all possible complications raised by the need to keep the three file
names in synch, in all cases. On the back side, this would mean that the
DB is utterly incompatible with every other existing version of cscope.
So, opinions, please: how bad would it be if the cscope DBs had to become
at most a factor of 3 larger, for the benefit of having maintainable
invlib.c code? I have no idea yet how search speed will be influenced,
but wouldn't expect any large effects in either direction.
--
Hans-Bernhard Broeker (broeker@...)
Even if all the snow were burnt, ashes would remain.
|