Menu

Build DTD from XML

Help
Peter
2007-01-21
2012-12-07
  • Peter

    Peter - 2007-01-21

    Do you have any plans to reverse engineer a DTD file from an XML file?

     
    • gnschmidt

      gnschmidt - 2007-01-21

      Yes, I think Matt has requested such a tool in the past.

      In one sense this already happens: if the program can't find a DTD, tag completion/validation-as-you-type and so on are based on the instance document.

      The data structures populated in this way are no different from those used when DTDs are parsed for this purpose (though they are only a shallow representation of the DTDs, built for editor speed rather than full DTD compliance).

      Do you think it would be useful to be able to write out a very basic, permissive DTD on this basis? A full DTD creator would require lots of flags (distinguish between parsed character data, numerals, boolean, white space, check for required attributes, and so on) and might be better served by a standalone library that can then be called by end-user applications. (If there's an open source project that already does this, I'd be very happy to use it!)

      -Gerald

       
    • David Håsäther

      Hi Gerald.

      Trang (written in Java) from no other than James Clark does this.
      It converts between DTD, RELAX NG (XML and compact syntax), and XML Schema.
      It can also take an XML document as input document and infer a schema/DTD from that.

      See http://www.thaiopensource.com/relaxng/trang.html

       
    • Peter

      Peter - 2007-01-24

      Gerald,

      I would be interested to know what you think about the difference between DTD and XML Schema. Essentially they do the same thing, but from what I can see DTD will be depreciated in the next year or so where as the schema seems like its here to stay. Either way, they're both something that can (and should) be derived automatically by well-formed XML. I know that's kinda backwards in that the schema is supposed to define the XML structure not the XML data define the schema, but in the real world that's generally the way it goes. Model the XML and build the schema to suit.

      Have you given any thought to XSLT?

      XML as data is one thing, but as soon as you want to do anything with it in its native format, you've got to have a schema and stylesheets. Perhaps that's an area to explore. WYSIWYG XSLT. XML would translate into tables within tables and could then be manipulated from there.

      <this>
        <first>hello</first>
        <second>there</second>
        <third>...</third>
        <fourth>
          <fourth-child-one>good</fourth-child-one>
          <fourth-child-two>morning</fourth-child-two>
        <fourth>
        <fourth>
          <fourth-child-one>anyone for</fourth-child-one>
          <fourth-child-two>coffee?</fourth-child-two>
        <fourth>
      </this>

      For example, naturally lends itself to...

      <html>
      <style>
      table{border: 1px solid Silver;}
      td{border: 1px solid Black;}
      </style>
      <body>
      <table>
        <tr><td>hello</td>
        <tr><td>there</td>
        <tr><td>...</td>
        </tr><tr><td>
          <table>
          <tr><td>good</td>
          <tr><td>morning</td>
          </tr></table>
          </tr></td><td>
          <table>
          <tr><td>anyone for</td></tr>
          <tr><td>coffee?</td></tr>
          </table>
        </tr></td>
      </table>
      </body>
      </html>

      Just a thought.............. I know it's easier said than done, a lot easier said than done.

      Cheers,
      Peter

       
    • Peter

      Peter - 2007-01-24

      ...dumb idea... just thinking out loud...

      I would be interested to hear what you think about DTD vs XML Schema though...

       
    • gnschmidt

      gnschmidt - 2007-01-24

      Not at all! I think this makes perfect sense.

      Personally I think DTDs will continue to be useful for many applications. I am not aware of a satisfactory alternative to catalogs, for example. They are fast and effective and cover most (though not all) of the things that schemas are used for.

      As for schemas, the main problem is that we have the two main schema languages: RELAX NG and XML Schema. Each can specify structures that the other can't (or with difficulty). RELAX NG has no standardised way of locating an attached schema, XML Schema does.

      As you know, the program supports DTD, RELAX NG and XML Schema. Validation-as-you-type for XML Schema is in the works; once that's done, adding RELAX NG shouldn't be too difficult.

      In my view, the main advantage of DTDs is that every validating parser knows how to handle them. XCE uses libxml for DTD/RELAX NG and Xerces-C for XML Schema (MSXML on Windows).

      As for XML to DTD conversions, David is right to say that James Clark's Trang already does this extremely well. My only problem is that it's a Java tool, and I am keen to keep the application in the C/C++ space for now.

      <Have you given any thought to XSLT?>
      How does the program's XSLT support work for you? It seems to produce correct output on my computer. The program uses libxslt.

      <WYSIWYG XSLT>
      I think this applies to all XML documents. Table views can help with many tasks. The reason I haven't implemented tree or table views in the past is that they don't seem to scale very well.

      Is what you have in mind an embedded browser window that displays custom HTML transformations (e.g. XSL to HTML tables or TEI to HTML)? That would be interesting.

      Thanks again for your suggestions!
      Gerald

       
      • David Håsäther

        <RELAX NG has no standardised way of locating an attached schema, XML Schema does.>

        Could XCE implement locating rules, the same as nxml-mode uses. It's basically just an XML file that tells you how to locate the corresponding schema file. So, for example, this rule would search for a RELAX NG compact schema in the same directory as the main file:

          <transformURI pathSuffix=".xml" replacePathSuffix=".rnc"/>

        So, for example, for "example.xml" the schema to use would be "example.rnc". You get the point.
        Can't find any official documentation (haven't tried much) but I found an example file (cached version): <http://72.14.203.104/search?q=cache%3Ainfohost.nmt.edu%2Ftcc%2Fhelp%2Fpubs%2Fnxml%2Fdefault-schemas.html>

        <Validation-as-you-type for XML Schema is in the works; once that's done, adding RELAX NG shouldn't be too difficult.>

        Sounds great, can't wait :-) Are you considering validation-as-you-type for RELAX NG compact syntax too? This would be a killer feature IMO. I get the impression that most people prefer the compact syntax over the XML syntax.

        <As for XML to DTD conversions, David is right to say that James Clark's Trang already does this extremely well. My only problem is that it's a Java tool, and I am keen to keep the application in the C/C++ space for now.>

        Would having a customizable command menu in XCE an option, like SciTE does? So, for example, I could make my own command named "Infer DTD from XML" and define it something like the following (using the same syntax for variables that SciTE does):

          java -jar trang $(FilePath) $(FilePath).dtd

        I realize these are not small suggestions, and I would not consider them high priority, but maybe eventually ;-)

        Gerald, if you want, I can start new threads for these suggestions, so that they can be discussed on their own, and for easier monitoring and so on.

         
        • gnschmidt

          gnschmidt - 2007-01-24

          Thanks David, yes: it would be a great help if you could log them under feature requests. A basic command interface like the one you describe shouldn't be too hard to write, and I agree it would be very useful.

          As it's the user's task to ensure the runtime environment and/or executable is installed on the system, this option could be used for any Java tools, Perl one-liners, xsltproc and lots of other useful applications.

          One possible option would be the ability to capture stdout output in a new document window. (Does that sound worthwhile to you? It's something I'd be keen to try myself.)

          PS Parsing rnc would be great, but I think I'd want libxml to do this for me. The reason I'm using Expat rather than libxml to parse the DTDs is that it's so easy to extract information from them (Expat fires all the relevant events). I haven't seen anything remotely as simple in XML Schema/RELAX NG processors, but perhaps I haven't looked hard enough.

           
          • David Håsäther

            <One possible option would be the ability to capture stdout output in a new document window. (Does that sound worthwhile to you? It's something I'd be keen to try myself.)>

            Yes, that is what I had in mind too. Sounds great :-)

            (I start new feature request threads later today, I don't have the time ATM).

             
    • Roger Sperberg

      Roger Sperberg - 2007-01-24

      I want to respond to a couple points here (and they're very different and I guess might be re-established in Feature Requests):

      - RNC/RNG

      I think James Clark anticipates that tools will behave something like Stylus Studio in how you have larger-than-the-xslt-stylesheet objects called scenarios. In the scenario you specify what XML file you want to transform and which engine you want to use and what the output file should be named and where located. Scenarios attach to stylesheets, so all of them are available when you open that stylesheet. (I seem to recall Stylus Studio's asking permission to store the information as comments at the end of the file.)

      One scenario lets you specify, say, a test XML file with Saxon 6.5.5 and another Saxon 8.7 and a third the full file you want to transform.

      I know that people may want to validate their document against more than one schema and against more than one schema type, perhaps at different points in a workflow. So there are good arguments for putting the association of XML document with RNG schema or W3C XML Schema or DTD outside the document.

      - XSLT 2.0

      Microsoft has said it will not be extending MSXSL to conform with XSLT 2.0. Sometimes I need 2.0's capabilities. Apart from Saxon (written in Java), the only XSLT 2.0 processor I've found is AltovaXML2007, which is free, permitted to be included in applications, and provides XML parser, XSLT 1.0 engine, schema-aware XSLT 2.0 engine and XQuery 1.0 engine (http://altova.com/altovaxml.html) (I don't have experience working with this.)

      So until there are more XSLT 2.0 engines, I think I would really want to be able to insert a user-customizable "tools" command that launches an outside process and optionally captures the output. (Similar to Tools in TextPad, I guess.)

      Roger

       
    • Peter

      Peter - 2007-01-26

      OK... in a perfect world... pie in the sky stuff... this is the way I think XSLT could be integrated...

      HTML -> CSS -> XSLT -> XML

      By now you're probably thinking "he's got it backwards, everyone knows you should start with your XML then your XSLT and from there move to your HTML like Stylus Studio and every other major editor out there has done for years," but perhaps this is where they've got it wrong. Perhaps they've made it too overly complicated because they're approaching this in the wrong order.

      HTML is the shell in which XML is viewed. XML is the substance while HTML is the container with CSS providing the formatting.

      Imagine this. WYSIWYG HTML editor. You drag-drop something that looks like a table with a single row and three TD cells. Position it on the right, give it a title, etc. You style it with CSS, fancy border, whatever. Then you drag-drop a node from your XML tree structure (not necessarily your data) into the HTML table. The editor looks at the node and realizes its got children and so automatically populates the table with a xsl:for-each or perhaps gives the user the choice of xsl:for-each or xsl:value-of. Maybe that node has four children so it populates four cells, etc. You can then rearrange them, etc, or drag in a text box and then link that to some xsl:value-of taking you back to your XML, etc.

      In other words, a two-step process with the HTML first. The HTML then is the container in which the XML is displayed via XSLT.

      Like I said, pie in the sky stuff, but all the other editors out there have this back-to-front and as a consequence, overly complicated. I just used this approach (manually) making up my HTML, formatting with CSS and then, once everything was refined and in placed, coming through with the XSLT tags to link it all back to the XML.

      Food for thought...

      Cheers,
      Peter

       
    • Edward Terry

      Edward Terry - 2007-09-22

      In response to your update on the feature request (below):  That sounds great.  I just discovered XML Copy Editor and I like it a lot.  The other XML editors I've tried seem too complex, so I've been using GEdit, but of course it doesn't validate.  XML Copy Editor has a simple, clean interface similar to GEdit, makes it easy to work directly with the code, and has strong validation.  If it could generate a W3C XML Schema from an XML document and vice versa, it would be the best of all possible worlds.  :-)  I appreciate your hard work on this!

      ----------------------------------------------------------------------

      > >Comment By: gnschmidt (gnschmidt)
      Date: 2007-09-22 08:46

      Message:
      Logged In: YES
      user_id=1298822
      Originator: NO

      Thanks falthon, I agree. I'm currently in the process of moving the
      Windows build across to VS2005 which should allow me to ditch MSXML in
      favour of Xerces-C, which should allow us to support XML Schema much more
      fully than we've done before. (The MSXML automation parts of the code are
      stone-age automation from Borland BCC32  :-)  That said, adjusting the code
      base for MS's compiler will take a little while as I haven't got VS2005 at
      home.

      Anyway, sorry to go on about this at length: I'm hoping to address your
      and Matt's request in the not too distant future!

      Gerald

      ----------------------------------------------------------------------

      Comment By: Falthon (falthon)
      Date: 2007-09-22 00:56

      Message:
      Logged In: YES
      user_id=1740341
      Originator: NO

      It would be very useful to be able to generate a W3C XML schema from an
      XML document and generate an XML document from a schema.

       
    • gnschmidt

      gnschmidt - 2007-09-22

      Yes, I can see that both of these are important. Not wanting to make lame excuses, but I've still to find a parser that does for schemas what Expat (my preferred non-validating parser) does for DTDs as a matter of course: fire events as a schema is parsed so applications can create their own data models.

      Xerces-C is my best bet for this so much depends on the conversion to Visual Studio 05 - I've wasted so many hours trying to get Xerces-C to work with g++ on Windows (the Linux port uses it without any trouble) I can't summon up the motivation to try again. Everything compiles fine and then it crashes with the vaguest possible error traces...

      Thanks for your comments!
      Gerald

       

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.