From: Jim S. <ja...@ne...> - 2004-08-30 17:27:51
|
I've started sketching a utility, XLoad, to dump and load tables to and from XML files. I've had to make a large number of arbitrary (and reversable) decisions to get started. I'm just a few hours into it and nothing is set in jello, let alone concrete. I thought it would make good sense to expose what I've got as a foundation for where we want to go. The major questions revolve around the XML schema. The minor questions involve program structure, switch definitions, etc. The utility is built on the JDBC C++ binding defined in IscDbc. I'm using the original version of IscDbc that developed as part of the IBPhoenix ODBC driver. This version is maintained as part of Vulcan, though it has not yet been integrated in the automatic build procedure, which will have to wait until I have some time to devote ot it. For now, both XLoad and IscDbc are available in source form in the Vulcan CVS tree with project definitions for MSVC version 7 (aka Visual Studio .Not 2003). The guts of XLoad are in two classes XDump and XLoad. The former takes a database and some control information and creates an XML file. The latter takes a database and an XML file and populates the database. The intention is that XDump and XLoad will be usable in other contexts. XLoad is essentially driven by a ResultSet. One of these days I will extend it to take explicit table names (which will save all rows in the tables) and "all user tables", provided we can agree on what this means. My evolving XML schema looks like this: <?xml version="1.0" encoding="US-ASCII"?> <database> <metadata> <table name="MESSAGES"> <column name="TRANS_NOTES" type="blob"/> <column name="EXPLANATION" type="blob"/> <column name="ACTION" type="blob"/> <column name="TEXT" type="varchar" precision="118"/> <column name="CODE" type="int"/> <column name="FLAGS" type="smallint"/> <column name="NUMBER" type="smallint"/> <column name="FAC_CODE" type="smallint"/> <column name="SYMBOL" type="varchar" precision="32"/> <column name="ROUTINE" type="varchar" precision="32"/> <column name="MODULE" type="varchar" precision="32"/> </table> </metadata> <data> <rows table="MESSAGES"> <row TEXT="Do you want to roll back your updates?" CODE="10351" NUMBER="351" FAC_CODE="1" SYMBOL="" ROUTINE="process_statement"/> <row TEXT="gen_descriptor: dtype not recognized" CODE="10352" NUMBER="352" FAC_CODE="1" ROUTINE="gen_descriptor"/> <row TEXT="MOVQ_move: conversion not done" CODE="10047" NUMBER="47" FAC_CODE="1"/> <row TEXT="BLOB conversion is not supported" CODE="10048" NUMBER="48" FAC_CODE="1" SYMBOL="" ROUTINE=""/> <row TEXT="expected type" CODE="10000" NUMBER="0" FAC_CODE="1"/> </rows> </data> </database> The key questions, I think, are how data is presented. My starting point is: * A table row is represented as a single xml element * Each non-null column is presented by an xml attribute * Null columns are not represented * No special casing of blob values are supported * Column attributes are based on JDBC definitions Please keep in mind that this is not a replacement for gbak and is not intended to handle everything. In specific, very large databases are not supported. Both XLoad and XDump map between a generalized tree structure and XML and when the address space is blown, the address space is blown. Any application that finds this burdensome should not plan to use this. The metadata section obvious needs to be extended to represent primary and foreign keys, nullability, and many of the other good any valuable attributes. Since the original target problem is the message database, features required for messages will show up sooner than later. Features that aren't available through the Jdbc metadata objects will take even longer. (I'm sending this as both text and html for readability. For those morally opposed to html mail, get a life.) Comments? Suggestions? Criticisms? Brickbats? -- Jim Starkey Netfrastructure, Inc. 978 526-1376 |
From: Martijn T. <m.t...@up...> - 2004-08-31 09:54:19
|
Jim et all, How will binary blobs be handled? This goes for plain SQL as well -=20 We now can insert text-blobs via a simple (non-parameterized) SQL statement, but we cannot insert/update binary blobs yet. Other database systems can, for example, MS SQL Server: INSERT INTO ... VALUES (0x0011A301EF01 ... etc) With regards, Martijn Tonies Database Workbench - developer tool for InterBase, Firebird, MySQL & MS = SQL Server. Upscene Productions http://www.upscene.com My evolving XML schema looks like this: <?xml version=3D"1.0" encoding=3D"US-ASCII"?> <database> <metadata> <table name=3D"MESSAGES"> <column name=3D"TRANS_NOTES" type=3D"blob"/> <column name=3D"EXPLANATION" type=3D"blob"/> <column name=3D"ACTION" type=3D"blob"/> <column name=3D"TEXT" type=3D"varchar" precision=3D"118"/> <column name=3D"CODE" type=3D"int"/> <column name=3D"FLAGS" type=3D"smallint"/> <column name=3D"NUMBER" type=3D"smallint"/> <column name=3D"FAC_CODE" type=3D"smallint"/> <column name=3D"SYMBOL" type=3D"varchar" precision=3D"32"/> <column name=3D"ROUTINE" type=3D"varchar" precision=3D"32"/> <column name=3D"MODULE" type=3D"varchar" precision=3D"32"/> </table> </metadata> <data> <rows table=3D"MESSAGES"> <row TEXT=3D"Do you want to roll back your updates?" = CODE=3D"10351" NUMBER=3D"351" FAC_CODE=3D"1" SYMBOL=3D"" = ROUTINE=3D"process_statement"/> <row TEXT=3D"gen_descriptor: dtype not recognized" = CODE=3D"10352" NUMBER=3D"352" FAC_CODE=3D"1" = ROUTINE=3D"gen_descriptor"/> <row TEXT=3D"MOVQ_move: conversion not done" CODE=3D"10047" = NUMBER=3D"47" FAC_CODE=3D"1"/> <row TEXT=3D"BLOB conversion is not supported" CODE=3D"10048" = NUMBER=3D"48" FAC_CODE=3D"1" SYMBOL=3D"" ROUTINE=3D""/> <row TEXT=3D"expected type" CODE=3D"10000" NUMBER=3D"0" = FAC_CODE=3D"1"/> </rows> </data> </database> |
From: Jim S. <ja...@ne...> - 2004-08-31 14:04:09
|
Martijn Tonies wrote: > Jim et all, > > How will binary blobs be handled? > > This goes for plain SQL as well - > > We now can insert text-blobs via a simple (non-parameterized) SQL > statement, but we cannot insert/update binary blobs yet. > There's no problem on the database side; IscDbc takes care of it automatically. All it takes is a call to PreparedStatement::setBytes and IscDbc takes care of the rest. The harder part of the question is how a binary blob is represented in XML. The problem is not encoding -- base64 or anything similar suffices -- but specifying the encoding. Representing each value as an attribute on the "row" element doesn't leave much wiggle room. The solution is simple: Represent the binary blob as a separate "value" element under the "row" element like this: <row TEXT="Do you want to roll back your updates?" CODE="10351" NUMBER="351" > <column name="ICON" encoding="base64" value="<gook goes here>"/> </row> As escape mechanism it provides all sorts of opportunities for extensible mischief. -- Jim Starkey Netfrastructure, Inc. 978 526-1376 |
From: Arno B. <fir...@ab...> - 2004-08-31 15:16:10
|
Hi, Jim et all, How will binary blobs be handled? This goes for plain SQL as well -=20 We now can insert text-blobs via a simple (non-parameterized) SQL statement, but we cannot insert/update binary blobs yet. There's no problem on the database side; IscDbc takes care of it = automatically. All it takes is a call to PreparedStatement::setBytes = and IscDbc takes care of the rest. The harder part of the question is how a binary blob is represented in = XML. The problem is not encoding -- base64 or anything similar suffices = -- but specifying the encoding. Representing each value as an attribute = on the "row" element doesn't leave much wiggle room. The solution is = simple: Represent the binary blob as a separate "value" element under = the "row" element like this: <row TEXT=3D"Do you want to roll back your updates?" CODE=3D"10351" = NUMBER=3D"351" > <column name=3D"ICON" encoding=3D"base64" value=3D"<gook goes = here>"/> </row> =20 As escape mechanism it provides all sorts of opportunities for = extensible mischief. If you put blob in separate element i think you must put every field = inside separate element, but yes as already meantioned this will explode = xml file. for binary blobs use xml CDATA : http://www.w3schools.com/xml/xml_cdata.asp Regards, Arno Brinkman ABVisie -=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D- Firebird open source database (based on IB-OE) with many SQL-99 features = : http://www.firebirdsql.org http://www.firebirdsql.info http://www.fingerbird.de/ http://www.comunidade-firebird.org/ Support list for Interbase and Firebird users : fir...@ya... Nederlandse firebird nieuwsgroep : news://newsgroups.firebirdsql.info |
From: Jim S. <ja...@ne...> - 2004-08-31 15:49:30
|
Arno Brinkman wrote: > If you put blob in separate element i think you must put every field > inside separate element, but yes as already meantioned this will > explode xml file. I don't see why this is necessary or even desirable. Yes, there are two mechanism for representing values, one dense, one fluffy, but the rule is crystal clear: * If the column name exists as an attribute, use it. * If the column name doesn't exist as an attribute, look for a child element of type "column" and the attribute "name" * If there isn't an appropriately named child element either, the value is null The column element also handles the problem of field names that aren't valid XML attribute names. If we had to, we could easily go all column elements. Interestingly enough, the internal representation is almost identical -- both elements and attributes are represented by linked lists of Element objects. But I think the denser representation is more readable and, as you mention, much shorter. Performance, however, shouldn't be a significant consideration. > > for binary blobs use xml CDATA : Worth thinking about. The obvious problem is a field that contains the string "]]>" that can't be escaped in CDATA. Are multiple CDATA blocks generally allowed? Is CData allowed to have arbitrary gook? Do we really want to have text files with, say, raw JPEGs inside? Do you know what a text editor does with a JPEG? Or are you suggesting that CDATA be used only with an encoding? If so, what actual good is CDATA? Note: a string specified as an attribute value doesn't need quoting for anything except the quote character, which certainly contributes to readability. What are your thoughs about the relative merits of <column name='"HERRING" value="RED"/> and <column name='"HERRING">RED</column> -- Jim Starkey Netfrastructure, Inc. 978 526-1376 |
From: Olivier M. <om...@ti...> - 2004-08-31 21:06:14
|
Jim Starkey wrote: > What are your thoughs about the relative merits of > > <column name='"HERRING" value="RED"/> > > and > > <column name='"HERRING">RED</column> Both are incorrect. The opening single quote is a typo, I suppose. ;-) On an XML point of view, the second one is more logical. The 'column' element is meant to represent a column value, it has an attribute (its name) which qualifies it, and the 'value' comes naturally in the element text body. It imposes some more escapes ('<' for instance), that's all. Oh btw, '&' has to be escaped, both in an attribute value and in the element text. '&' is common for such escape. -- Olivier Mascia |
From: Jim S. <ja...@ne...> - 2004-08-31 21:55:55
|
Olivier Mascia wrote: > Jim Starkey wrote: > >> What are your thoughs about the relative merits of >> >> <column name='"HERRING" value="RED"/> >> >> and >> >> <column name='"HERRING">RED</column> > > > Both are incorrect. The opening single quote is a typo, I suppose. ;-) You have a future as a proof reader. On the other hand, I have no future as a typist. > > On an XML point of view, the second one is more logical. > The 'column' element is meant to represent a column value, it has an > attribute (its name) which qualifies it, and the 'value' comes > naturally in the element text body. It imposes some more escapes ('<' > for instance), that's all. Significant data between open and close tags makes sense when XML is used as a markup language, which, of course, we are not. While it is more or less arbitrary as to whether a data item is an attribute or enclosed between tags is pretty much arbitrary, using XML to represent a logical tree suggests that an attribute is more appropriate to my taste. Among other things, it makes the formatting of the XML file insignificant. And it puts us in the position of having to explain over and over that arbitrary white space is ok between some tags and not between others. I like to see my data safely enclosed in quotation marks looking like data. > Oh btw, '&' has to be escaped, both in an attribute value and in the > element text. '&' is common for such escape. And quotes, too, sir. -- Jim Starkey Netfrastructure, Inc. 978 526-1376 |
From: Todd F. <ta...@le...> - 2004-08-31 15:39:32
|
Jim Starkey wrote: > > > Martijn Tonies wrote: > >> Jim et all, >> >> How will binary blobs be handled? >> >> This goes for plain SQL as well - >> >> We now can insert text-blobs via a simple (non-parameterized) SQL >> statement, but we cannot insert/update binary blobs yet. >> > > There's no problem on the database side; IscDbc takes care of it > automatically. All it takes is a call to PreparedStatement::setBytes > and IscDbc takes care of the rest. > > The harder part of the question is how a binary blob is represented in > XML. The problem is not encoding -- base64 or anything similar > suffices -- but specifying the encoding. Representing each value as > an attribute on the "row" element doesn't leave much wiggle room. The > solution is simple: Represent the binary blob as a separate "value" > element under the "row" element like this: > ><row TEXT="Do you want to roll back your updates?" CODE="10351" NUMBER="351" > > <column name="ICON" encoding="base64" value="<gook goes here>"/> ></row> > > > As escape mechanism it provides all sorts of opportunities for > extensible mischief. Isn't this really just a simple matter of enclosing the blob in a <![CDATA[ binary data here ]]> block? todd |
From: Olivier M. <om...@ti...> - 2004-08-31 21:11:30
|
Jim Starkey wrote: > Comments? Suggestions? Criticisms? Brickbats? One must not forget about multiplicity of character sets found in a database. Many columns can stored strings using different character sets. I would suggest to map all individual character sets to Unicode (32 bits code point) and then ouput the whole xml file in UTF-8 encoding (which is the default anyway for an XML document). The <metadata/> could save the original character set, so that data can be mapped back to its original coding when processing the xml file. -- Olivier Mascia |
From: Jacqui C. <jac...@nt...> - 2004-08-31 21:57:49
|
Olivier Mascia wrote: > I would suggest to map all individual character sets to Unicode (32 bits > code point) and then ouput the whole xml file in UTF-8 encoding (which > is the default anyway for an XML document). The <metadata/> could save > the original character set, so that data can be mapped back to its > original coding when processing the xml file. This is a very sensible idea and avoids the mess. |
From: Jim S. <ja...@ne...> - 2004-08-31 22:00:41
|
Olivier Mascia wrote: > Jim Starkey wrote: > >> Comments? Suggestions? Criticisms? Brickbats? > > > One must not forget about multiplicity of character sets found in a > database. Many columns can stored strings using different character sets. > > I would suggest to map all individual character sets to Unicode (32 > bits code point) and then ouput the whole xml file in UTF-8 encoding > (which is the default anyway for an XML document). The <metadata/> > could save the original character set, so that data can be mapped back > to its original coding when processing the xml file. > But a database doesn't have a single encoding. UTF-8 does make sense, I guess. I don't know how friendly it is to editors, but anyone who uses stuff other than ascii has learned to cope with it. But I will need to teach IscDbc to coax UTF-8 out of Firebird. Ignorant question: is UTF-8 safely terminated with a null, or is a zero byte a legit byte in a multi-byte character? -- Jim Starkey Netfrastructure, Inc. 978 526-1376 |
From: Olivier M. <om...@ti...> - 2004-09-01 00:05:59
|
Jim Starkey wrote: > But a database doesn't have a single encoding. UTF-8 does make sense, I > guess. I don't know how friendly it is to editors, but anyone who uses > stuff other than ascii has learned to cope with it. But I will need to > teach IscDbc to coax UTF-8 out of Firebird. Ignorant question: is UTF-8 > safely terminated with a null, or is a zero byte a legit byte in a > multi-byte character? Safe. UTF-8 is clean. A character (32 bit value, but not all values are used, 'only' a bit more than 1.1 million) is encoded using 1, 2, 3 or 4 bytes. When more than 1 byte is used, not only none of them will ever be zero, but all will have MSB set. What's more by looking at the most significants bits of the first byte, you can immediately tell if you have a 2, 3 or 4 bytes sequence in front of you. ftp://ftp.rfc-editor.org/in-notes/std/std63.txt -- Olivier Mascia |
From: Derryck W. <der...@tn...> - 2004-09-01 12:21:25
|
About the encoding, this is something i use in a web-engine: function TSpecialChars.To_XmlValue_Characters(source: wideString): UTF8String; // encode content // This function replaces the following characters //http://www.wdvl.com/Authoring/HTML/Tutorial/special_characters.html // with their respective character references: // < < // > > // ? ? // & & // ' ' // " " // [ [ // ] ] // ` ` // ! ! (**************************************** Unused � -  Horizontal Tab 	 Line Feed Unused  -  Space   ! ! " " # # $ $ % % & & ' ' ( ( ) ) * * + + , , - - . . / / 0-9 0 - 9 : : ; ; < < = = > > ? ? @ @ A - Z A - Z [ [ \ \ ] ] ^ ^ _ _ ` ` a - z a - z { { | | } } ~ ~ Unused  -   ****************************) var i,l: integer; s: UTF8String; v:integer; begin result:= ''; s:=''; if source<>'' then try l:=length(source); for i:= 1 to l do begin v:=ord(source[i]); if ( v in [33,34,38,39,60,62,91,93,96]) then s:=s+'&#'+inttostr(v)+';' else if (i=l) and (v=0) then s := s + ' ' else s := s + source[i]; end; except S:='String Exception'; end; Result:= s; end; ----- Greetings, Derryck |