Menu

Tree [r3] /
 History

HTTPS access


File Date Author Commit
 doc 2013-12-31 mettw [r2] Deleted non-source files
 include 2013-12-31 mettw [r2] Deleted non-source files
 lib 2013-12-31 mettw [r2] Deleted non-source files
 utils 2013-12-31 mettw [r2] Deleted non-source files
 AUTHORS 2013-12-31 mettw [r1] Initial commit
 COPYING 2013-12-31 mettw [r1] Initial commit
 INSTALL 2013-12-31 mettw [r1] Initial commit
 Makefile.am 2013-12-31 mettw [r1] Initial commit
 README 2013-12-31 mettw [r3]
 config.guess 2013-12-31 mettw [r1] Initial commit
 config.sub 2013-12-31 mettw [r1] Initial commit
 configure 2013-12-31 mettw [r1] Initial commit
 configure.in 2013-12-31 mettw [r1] Initial commit
 copyright.w 2013-12-31 mettw [r1] Initial commit
 cweb.el 2013-12-31 mettw [r1] Initial commit
 install-sh 2013-12-31 mettw [r1] Initial commit
 ltconfig 2013-12-31 mettw [r1] Initial commit
 ltmain.sh 2013-12-31 mettw [r1] Initial commit
 missing 2013-12-31 mettw [r1] Initial commit
 mkinstalldirs 2013-12-31 mettw [r1] Initial commit
 stamp-h.in 2013-12-31 mettw [r1] Initial commit

Read Me

			XML DataBase Manager (XDBM)
			===========================

This library is both a validating XML parser (or will be) as well
as a database designed to hold XML data.  Because of the graph nature
of XML, pointers to the XML nodes are stored in a hashtable.  The 
XML data itself is stored in a pre-parsed format.
The advantage of this being: greater speed through not needing 
to parse the XML file; fast searching for elements; and less 
memory usage as only those parts that are needed are read into 
memory.  For compatibility with other XML programs, two utilities, 
xml2xdbm and xdbm2xml, are provided for converting XML data to and
from xdbm format. 

The library functions are documented in the doc/ directory,
I've included untangled .c files and tex files for those who don't
have CWEB installed.  If you want to contribute to XDBM however you
will have to use CWEB.  Don't be put off by this though, CWEB is really
very simple and is, in my opinion, the greatest advance in
programming since C. 

I was originally going to use the xmltok interface to expat but I've
had to use the xmlparse one instead.  Supposedly xmltok is documented
in the file xmltok.h - pig's arse! 

Utilities
=========

This library comes with three utilities:

xml2xdbm - converts text/xml to xdbm format.

xdbm2xml - convert an xdbm file to text/xml.

lessopen.sh - allows you to read xdbm files as plain text in less.

xdbm-mode.el - This is a derived mode of PSGML mode.  It converts an
	xdbm file to text/xml on opening and converts it back to
	xdbm format on saving.

Apache?? - Need an Apache module that converts xdbm files into text/xml
           for transmission.


XDBM file format
================

Requirements
------------

1)   Each node must maintain sibling order information as this can be critical
     for interpreting the document - this is defined in the DOM standard.

2)     The child nodes are in two forms:

2.a)   attribute nodes have no sibling information; 
2.a.i) All attribute nodes are leaf nodes.

2.b)   element nodes do contain sibling information.
2.b.i) Some element nodes (such as cdata) are always leaf nodes, some
       are always inner nodes (such as <ul> in HTML) and some may be
       either leaf or inner nodes and thier status as such may change.

2.c)   The only distinction DOM makes between attributes and elements
       is in where they hang off the parent node.

3)   Namespace collision will occur regularly, 

3.a) so each btree entry must store an array of pointers to the Nodes.

3.b) Must also store an array of correspoding parent nodes so that
     the path of the child can be determined without having to introduce
     the overhead of scaning each Node.

3.c) To determine if a Node has the correct parent, a comparison
     between the Nodes matching the correct parent name and the actual
     parents of the matching child Nodes must be done.  To speed this
     up all arrays in the btree nodes must be qsorted.

SEARCHING
---------

The ID/IDREF attributes turn XML into a graph rather than a tree,
but it still always holds a tree-like structure.  Most importantly,
all searches of the XML graph will have a tree like structure.  That is,
searches will consist of searching for a particular child of a 
parent.

We need to be able to search for name, value and type. 

Ther are three search lists:
	searchDatumList
		All nodes, sorted by nodePos
	searchTypeList
		Parent list is referenced by XDBMNodeType
		Child list is sorted by parentPos
	searchList
		Parent list is referenced by the node's name or value
		hash.
		Child list is sorted by parentPos.

Individual elements in these lists are pointers to an XDBMSearchDatum
structure

typedef struct __XDBMSearchDatum {
	struct __Node	*node;
	long		parentPos;
	long		nodePos;
	unsigned long	nameHash;
	unsigned long	valueHash;
	double dewey;
} XDBMSearchDatum;

the dewey member is used to sort nodes by tree traversal order.

ID/IDREF are accounted for by having a seperate SearchDatum for each
refering to the same node - only the parentPos member is different.

This should be small enough that it can be loaded into memory upon
opening the file, thereby also acheiving a significant performance
improvement on top of the fact that ID/IDREF attributes can be used.

So if you get an xml-ql [1] query of the form:

	WHERE
		<publisher>
		<book>
		<title>Data Structures and Algorithm Analysis in C</title>
		<author>$a</author>
		</book>
		</publisher>
	IN "www.foo.com/bibliography.xml"
	CONSTRUCT $a

Then to process the request you would:

	1) Do a Find for the key "Data Structures and Algorithm Analysis in C"
	2) Do Finds for the keys "book", "title", "author" and "publisher"
	3) Compare the fpos_t's for the node and thier parents to
	   find the appropriate entry.	

FILE FORMAT
-----------

To speed file accesses the file will be broken up into blocks of
64 bytes by default.  So we have

Magic = 1 block
each Node = 1+ block

A particular node may take up more than one block, but it is unlikely.

Magic (Block 0)
---------------
Every file starts with some magic to allow easy determination of file
type:

"XDBM%s\0", XDBM_FILE_VERSION

Following this is the XDBMDocument structure:

"%d%d%d%d", 
XDBMDocument.documentElementPos,
XDBMDocument.prologPos,
XDBMDocument.numberDatum,
first free block

The last entry in the magic block is a pointer to a block that
is not currently used.  All other free blocks are linked together in a
chain.

Node Blocks (Blocks 1+)
-----------------------

For simplicity all nodes that span more than one block have all of its
blocks in sequential order.  On UNIX systems this will usually result
in unfragmented blocks, but in MS-WIN systems there isn't really much you
can do to stop fragmentation.  This will increase file size as there will
be more empty blocks, but since most nodes will fit in a single block
this shouldn't be much of a problem.

unsigned int - number of blocks used by this node.
unsigned long Name hash
unsigned long value hash
double dewey
NodeType
long node.parentNode
long node.firstChild
long node.lastChild
long node.previousSibling
long node.nextSibling
long node.attributes.numberAttributes
long node.attributes.first.node.nodePos
long node.attributes.first.next.node.nodePos
...
size_t - the length of node.nodeName
Uchar* node.nodeName
size_t - the length of node.nodeValue
Uchar* node.nodeValue

The dewey is used to sort a list of nodes into tree traversal order.

Bibliography
============

[1] Deutsch, A., Fernandez, M., Florescu, D., Levy, A., 
    Suciu, D., "XML-QL: A Query Language For XML", 
    (http://www.w3.org/TR/NOTE-xml-ql)

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.