xdbm Code

XDBM is an XQuery implementation on the W3C's EXI binary XML format.

Status: Planning

Brought to you by: mettw

Tree [r3] / History

HTTPS access

File	Date	Author	Commit
doc	2013-12-31	mettw	[r2] Deleted non-source files
include	2013-12-31	mettw	[r2] Deleted non-source files
lib	2013-12-31	mettw	[r2] Deleted non-source files
utils	2013-12-31	mettw	[r2] Deleted non-source files
AUTHORS	2013-12-31	mettw	[r1] Initial commit
COPYING	2013-12-31	mettw	[r1] Initial commit
INSTALL	2013-12-31	mettw	[r1] Initial commit
Makefile.am	2013-12-31	mettw	[r1] Initial commit
README	2013-12-31	mettw	[r3]
config.guess	2013-12-31	mettw	[r1] Initial commit
config.sub	2013-12-31	mettw	[r1] Initial commit
configure	2013-12-31	mettw	[r1] Initial commit
configure.in	2013-12-31	mettw	[r1] Initial commit
copyright.w	2013-12-31	mettw	[r1] Initial commit
cweb.el	2013-12-31	mettw	[r1] Initial commit
install-sh	2013-12-31	mettw	[r1] Initial commit
ltconfig	2013-12-31	mettw	[r1] Initial commit
ltmain.sh	2013-12-31	mettw	[r1] Initial commit
missing	2013-12-31	mettw	[r1] Initial commit
mkinstalldirs	2013-12-31	mettw	[r1] Initial commit
stamp-h.in	2013-12-31	mettw	[r1] Initial commit

Read Me

			XML DataBase Manager (XDBM)
			===========================

This library is both a validating XML parser (or will be) as well
as a database designed to hold XML data.  Because of the graph nature
of XML, pointers to the XML nodes are stored in a hashtable.  The 
XML data itself is stored in a pre-parsed format.
The advantage of this being: greater speed through not needing 
to parse the XML file; fast searching for elements; and less 
memory usage as only those parts that are needed are read into 
memory.  For compatibility with other XML programs, two utilities, 
xml2xdbm and xdbm2xml, are provided for converting XML data to and
from xdbm format. 

The library functions are documented in the doc/ directory,
I've included untangled .c files and tex files for those who don't
have CWEB installed.  If you want to contribute to XDBM however you
will have to use CWEB.  Don't be put off by this though, CWEB is really
very simple and is, in my opinion, the greatest advance in
programming since C. 

I was originally going to use the xmltok interface to expat but I've
had to use the xmlparse one instead.  Supposedly xmltok is documented
in the file xmltok.h - pig's arse! 

Utilities
=========

This library comes with three utilities:

xml2xdbm - converts text/xml to xdbm format.

xdbm2xml - convert an xdbm file to text/xml.

lessopen.sh - allows you to read xdbm files as plain text in less.

xdbm-mode.el - This is a derived mode of PSGML mode.  It converts an
	xdbm file to text/xml on opening and converts it back to
	xdbm format on saving.

Apache?? - Need an Apache module that converts xdbm files into text/xml
           for transmission.


XDBM file format
================

Requirements
------------

1)   Each node must maintain sibling order information as this can be critical
     for interpreting the document - this is defined in the DOM standard.

2)     The child nodes are in two forms:

2.a)   attribute nodes have no sibling information; 
2.a.i) All attribute nodes are leaf nodes.

2.b)   element nodes do contain sibling information.
2.b.i) Some element nodes (such as cdata) are always leaf nodes, some
       are always inner nodes (such as <ul> in HTML) and some may be
       either leaf or inner nodes and thier status as such may change.

2.c)   The only distinction DOM makes between attributes and elements
       is in where they hang off the parent node.

3)   Namespace collision will occur regularly, 

3.a) so each btree entry must store an array of pointers to the Nodes.

3.b) Must also store an array of correspoding parent nodes so that
     the path of the child can be determined without having to introduce
     the overhead of scaning each Node.

3.c) To determine if a Node has the correct parent, a comparison
     between the Nodes matching the correct parent name and the actual
     parents of the matching child Nodes must be done.  To speed this
     up all arrays in the btree nodes must be qsorted.

SEARCHING
---------

The ID/IDREF attributes turn XML into a graph rather than a tree,
but it still always holds a tree-like structure.  Most importantly,
all searches of the XML graph will have a tree like structure.  That is,
searches will consist of searching for a particular child of a 
parent.

We need to be able to search for name, value and type. 

Ther are three search lists:
	searchDatumList
		All nodes, sorted by nodePos
	searchTypeList
		Parent list is referenced by XDBMNodeType
		Child list is sorted by parentPos
	searchList
		Parent list is referenced by the node's name or value
		hash.
		Child list is sorted by parentPos.

Individual elements in these lists are pointers to an XDBMSearchDatum
structure

typedef struct __XDBMSearchDatum {
	struct __Node	*node;
	long		parentPos;
	long		nodePos;
	unsigned long	nameHash;
	unsigned long	valueHash;
	double dewey;
} XDBMSearchDatum;

the dewey member is used to sort nodes by tree traversal order.

ID/IDREF are accounted for by having a seperate SearchDatum for each
refering to the same node - only the parentPos member is different.

This should be small enough that it can be loaded into memory upon
opening the file, thereby also acheiving a significant performance
improvement on top of the fact that ID/IDREF attributes can be used.

So if you get an xml-ql [1] query of the form:

	WHERE
		<publisher>
		<book>
		<title>Data Structures and Algorithm Analysis in C</title>
		<author>$a</author>
		</book>
		</publisher>
	IN "www.foo.com/bibliography.xml"
	CONSTRUCT $a

Then to process the request you would:

	1) Do a Find for the key "Data Structures and Algorithm Analysis in C"
	2) Do Finds for the keys "book", "title", "author" and "publisher"
	3) Compare the fpos_t's for the node and thier parents to
	   find the appropriate entry.	

FILE FORMAT
-----------

To speed file accesses the file will be broken up into blocks of
64 bytes by default.  So we have

Magic = 1 block
each Node = 1+ block

A particular node may take up more than one block, but it is unlikely.

Magic (Block 0)
---------------
Every file starts with some magic to allow easy determination of file
type:

"XDBM%s\0", XDBM_FILE_VERSION

Following this is the XDBMDocument structure:

"%d%d%d%d", 
XDBMDocument.documentElementPos,
XDBMDocument.prologPos,
XDBMDocument.numberDatum,
first free block

The last entry in the magic block is a pointer to a block that
is not currently used.  All other free blocks are linked together in a
chain.

Node Blocks (Blocks 1+)
-----------------------

For simplicity all nodes that span more than one block have all of its
blocks in sequential order.  On UNIX systems this will usually result
in unfragmented blocks, but in MS-WIN systems there isn't really much you
can do to stop fragmentation.  This will increase file size as there will
be more empty blocks, but since most nodes will fit in a single block
this shouldn't be much of a problem.

unsigned int - number of blocks used by this node.
unsigned long Name hash
unsigned long value hash
double dewey
NodeType
long node.parentNode
long node.firstChild
long node.lastChild
long node.previousSibling
long node.nextSibling
long node.attributes.numberAttributes
long node.attributes.first.node.nodePos
long node.attributes.first.next.node.nodePos
...
size_t - the length of node.nodeName
Uchar* node.nodeName
size_t - the length of node.nodeValue
Uchar* node.nodeValue

The dewey is used to sort a list of nodes into tree traversal order.

Bibliography
============

[1] Deutsch, A., Fernandez, M., Florescu, D., Levy, A., 
    Suciu, D., "XML-QL: A Query Language For XML", 
    (http://www.w3.org/TR/NOTE-xml-ql)

xdbm Code

XDBM is an XQuery implementation on the W3C's EXI binary XML format.

Tree [r3] / Download Snapshot History

Read Me

Tree [r3] /

History