[srvx-devel] next generation database approach

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

There was some discussion today about what kind of database srvx
should use in its reincarnation.  There are three choices that I see:
whether to use multiple databases (versus just one), the kind of
database to use, and how to give other processes access to the data.

The question of multiple databases is mostly an issue if we have
in-memory, write-all-at-once databases (which we do in srvx-1.x).
Reading or writing all the databases for GamesNET at once can take a
few seconds, even on an Athlon 1 GHz.  If separate databases are used,
it is easier for them to be out of sync.  So this is mostly dependent
on the next issue.

I've seen three suggestions for the kind of database to use:

- The current recdb code (or a faster version of it, such as saxdb in
the CVS HEAD).  This is a hierarchical, string-keyed human-readable
database with several types of data.  The limitations we have seen
with it are: long read and write times; all the data being in-memory
at once (which will not be much of a problem until we grow a lot); and
lack of any way to do IPC operations on the data.  (Any other process
would have to talk to it through IRC, which is ugly and unreliable.)

- An out-of-process relational database, such as SQL.  This eliminates
the read and write time issues, moves the problem of what data to keep
in memory to another process, and easily allows others to access the
data.  However, it has drawbacks: any read or write becomes much more
expensive, due to the context switch (or even network access); we need
to add locking logic in many places; and our database schema is
constrained to what is in the database (or how we talk to it).  To be
more precise, updating the database schema is hard.

- A flat database with a relational information and serialization
layer on top.  This is approximately how my proposed "oodb" is
structured; Berkeley DB (or something similar) could provide the flat
database operations.  It solves the long read and write times (by only
writing some dirty data at once, it amortizes storing data over time)
and allows us to control how much data we keep in memory.  Direct
access to the database from other processes is probably harder than
with SQL.

The third issue is how to give other processes access to the data.  If
we use a standard format (or protocol) to store the data, we can
directly access it from other processes.  If we use a proprietary
format (or protocol), we would have to provide some sort of IPC
bridge.  This might be through IRC (eww), or might be through some
other direct TCP connection.

The drawbacks with giving other processes direct access to the data
are that we need to have rigid (and correct) rules about how to keep
the database consistent (this probably requires transactions) and
about how to lock parts of the database (or the whole thing) while you
work on it, and it is hard to extend the format.  If we provide an IPC
bridge, we can provide atomic operations and an extensible format
through that interface.  (SOAPy RPC over XML over HTTP!  Zoom
zooooom!!!</winer>)

If we look at what's practical, I see these choices:

1) Multi-file saxdb with IPC bridge
   - Status quo
2) One-file saxdb with IPC bridge
   - Probably too slow
3) SQL server with direct access
   - Buzzword compliant!
4) SQL server with IPC bridge
   - Almost no point in having the DB out of process
5) oodb on Berkeley DB with IPC bridge
   - Possible code reuse advantages; freedom of storage format
6) oodb on Berkeley DB with direct access
   - Possible code reuse advantages
7) oodb with proprietary DB with IPC bridge
   - Best control over caching and writing back data, but much code

Right now we use (1), but I think we want to get rid of it, and we
probably don't want (2) either.  Did I leave anything out?

-- Entrope