xdbm Code
XDBM is an XQuery implementation on the W3C's EXI binary XML format.
Status: Planning
Brought to you by:
mettw
XML DataBase Manager (XDBM) =========================== This library is both a validating XML parser (or will be) as well as a database designed to hold XML data. Because of the graph nature of XML, pointers to the XML nodes are stored in a hashtable. The XML data itself is stored in a pre-parsed format. The advantage of this being: greater speed through not needing to parse the XML file; fast searching for elements; and less memory usage as only those parts that are needed are read into memory. For compatibility with other XML programs, two utilities, xml2xdbm and xdbm2xml, are provided for converting XML data to and from xdbm format. The library functions are documented in the doc/ directory, I've included untangled .c files and tex files for those who don't have CWEB installed. If you want to contribute to XDBM however you will have to use CWEB. Don't be put off by this though, CWEB is really very simple and is, in my opinion, the greatest advance in programming since C. I was originally going to use the xmltok interface to expat but I've had to use the xmlparse one instead. Supposedly xmltok is documented in the file xmltok.h - pig's arse! Utilities ========= This library comes with three utilities: xml2xdbm - converts text/xml to xdbm format. xdbm2xml - convert an xdbm file to text/xml. lessopen.sh - allows you to read xdbm files as plain text in less. xdbm-mode.el - This is a derived mode of PSGML mode. It converts an xdbm file to text/xml on opening and converts it back to xdbm format on saving. Apache?? - Need an Apache module that converts xdbm files into text/xml for transmission. XDBM file format ================ Requirements ------------ 1) Each node must maintain sibling order information as this can be critical for interpreting the document - this is defined in the DOM standard. 2) The child nodes are in two forms: 2.a) attribute nodes have no sibling information; 2.a.i) All attribute nodes are leaf nodes. 2.b) element nodes do contain sibling information. 2.b.i) Some element nodes (such as cdata) are always leaf nodes, some are always inner nodes (such as <ul> in HTML) and some may be either leaf or inner nodes and thier status as such may change. 2.c) The only distinction DOM makes between attributes and elements is in where they hang off the parent node. 3) Namespace collision will occur regularly, 3.a) so each btree entry must store an array of pointers to the Nodes. 3.b) Must also store an array of correspoding parent nodes so that the path of the child can be determined without having to introduce the overhead of scaning each Node. 3.c) To determine if a Node has the correct parent, a comparison between the Nodes matching the correct parent name and the actual parents of the matching child Nodes must be done. To speed this up all arrays in the btree nodes must be qsorted. SEARCHING --------- The ID/IDREF attributes turn XML into a graph rather than a tree, but it still always holds a tree-like structure. Most importantly, all searches of the XML graph will have a tree like structure. That is, searches will consist of searching for a particular child of a parent. We need to be able to search for name, value and type. Ther are three search lists: searchDatumList All nodes, sorted by nodePos searchTypeList Parent list is referenced by XDBMNodeType Child list is sorted by parentPos searchList Parent list is referenced by the node's name or value hash. Child list is sorted by parentPos. Individual elements in these lists are pointers to an XDBMSearchDatum structure typedef struct __XDBMSearchDatum { struct __Node *node; long parentPos; long nodePos; unsigned long nameHash; unsigned long valueHash; double dewey; } XDBMSearchDatum; the dewey member is used to sort nodes by tree traversal order. ID/IDREF are accounted for by having a seperate SearchDatum for each refering to the same node - only the parentPos member is different. This should be small enough that it can be loaded into memory upon opening the file, thereby also acheiving a significant performance improvement on top of the fact that ID/IDREF attributes can be used. So if you get an xml-ql [1] query of the form: WHERE <publisher> <book> <title>Data Structures and Algorithm Analysis in C</title> <author>$a</author> </book> </publisher> IN "www.foo.com/bibliography.xml" CONSTRUCT $a Then to process the request you would: 1) Do a Find for the key "Data Structures and Algorithm Analysis in C" 2) Do Finds for the keys "book", "title", "author" and "publisher" 3) Compare the fpos_t's for the node and thier parents to find the appropriate entry. FILE FORMAT ----------- To speed file accesses the file will be broken up into blocks of 64 bytes by default. So we have Magic = 1 block each Node = 1+ block A particular node may take up more than one block, but it is unlikely. Magic (Block 0) --------------- Every file starts with some magic to allow easy determination of file type: "XDBM%s\0", XDBM_FILE_VERSION Following this is the XDBMDocument structure: "%d%d%d%d", XDBMDocument.documentElementPos, XDBMDocument.prologPos, XDBMDocument.numberDatum, first free block The last entry in the magic block is a pointer to a block that is not currently used. All other free blocks are linked together in a chain. Node Blocks (Blocks 1+) ----------------------- For simplicity all nodes that span more than one block have all of its blocks in sequential order. On UNIX systems this will usually result in unfragmented blocks, but in MS-WIN systems there isn't really much you can do to stop fragmentation. This will increase file size as there will be more empty blocks, but since most nodes will fit in a single block this shouldn't be much of a problem. unsigned int - number of blocks used by this node. unsigned long Name hash unsigned long value hash double dewey NodeType long node.parentNode long node.firstChild long node.lastChild long node.previousSibling long node.nextSibling long node.attributes.numberAttributes long node.attributes.first.node.nodePos long node.attributes.first.next.node.nodePos ... size_t - the length of node.nodeName Uchar* node.nodeName size_t - the length of node.nodeValue Uchar* node.nodeValue The dewey is used to sort a list of nodes into tree traversal order. Bibliography ============ [1] Deutsch, A., Fernandez, M., Florescu, D., Levy, A., Suciu, D., "XML-QL: A Query Language For XML", (http://www.w3.org/TR/NOTE-xml-ql)