Menu

Get type info from an element ?

Help
mhugo
2016-03-09
2016-03-17
  • mhugo

    mhugo - 2016-03-09

    Hi,

    I've just discovered PyXB and find it very interesting, thanks for the hard work.
    I am looking for a way to link elements of an XML to their type definitions in a schema.

    Since PyXB is able to parse XSD schemas and generate Python classes out of them, I was thinking I could have access to the schema internal representation and walk through it and then find the type information of a given element in a XML tree.

    Is there a way to do that with PyXB ? From my understanding of the documentation and API it seems given a binding, I can access to its child "content" through orderedContent() (and _ElementMap ?) but is there a way to access the ElementDeclaration of an arbitrary DOM node ?

     
  • Peter A. Bigot

    Peter A. Bigot - 2016-03-09

    If you mean you have bindings for which everything could be resolved by PyXB then yes, start with the API, particularly the material in pyxb.binding. pyxb.binding.basis contains the classes that are the basis of instances, including simple and complex type definitions.

    orderedContent() produces lists that include members of type ElementContent() which will tell you the corresponding ElementDeclaration. You can also use methods like _element() on most binding instances. Generally element declaration will identify the type definition. All the binding classes have _ExpandedName and _XSDLocation that tell you where in what schema the information came from. That may be all you need; OTOH if you want the content model information, that'll have been transformed into a finite automaton with counters, so it'll be harder to figure out from the runtime structure.

    If by "arbitrary DOM node" you mean the special case where PyXB can't tell what the XML content is, so represents it in the generated binding by DOM nodes, then no. Best bet would be look at the parent element that has a PyXB binding and figure out what's possible from some type information.

    If you really need a representation of the information set from the schema, you'd probably need to work with the parsed schema with the material in pyxb.xmlschema. Most of that content is not present in the bindings.

    In short, probably it's all there, and a lot of it's documented, but it wasn't intended for end-user use so you'll have to dig around a bit, both in the documentation and in the code, generated bindings, and test suite looking for examples.

     
    • mhugo

      mhugo - 2016-03-09

      If you mean you have bindings for which everything could be resolved by PyXB then yes, start with the API, particularly the material in pyxb.binding. pyxb.binding.basis contains the classes that are the basis of instances, including simple and complex type definitions.

      orderedContent() produces lists that include members of type ElementContent() which will tell you the corresponding ElementDeclaration. You can also use methods like _element() on most binding instances. Generally element declaration will identify the type definition. All the binding classes have _ExpandedName and _XSDLocation that tell you where in what schema the information came from. That may be all you need; OTOH if you want the content model information, that'll have been transformed into a finite automaton with counters, so it'll be harder to figure out from the runtime structure.

      Thanks for your answer.
      Yes that what I meant. But this is only one way: from a (Python) type I can access the element declaration. Now say I have an xml file <a>hello<b>12</b></a> and an XML node object representing the <b> element, and the corresponding schema. Is there a way to know that the value of <b> is of type (say) integer ?

      From your answer it seems my best bet would be to find a way to instanciate something in pyxb.xmlschema from my schema and "walk through" it to find the type of <b> ...

       

      Last edit: mhugo 2016-03-09
  • Peter A. Bigot

    Peter A. Bigot - 2016-03-09

    For that simple case:

    <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
        <xs:element name="a">
            <xs:complexType mixed="true">
                <xs:sequence>
                    <xs:element name="b" type="xs:integer"/>
                </xs:sequence>
            </xs:complexType>
        </xs:element>
    </xs:schema>
    

    build with pyxbgen -u x.xsd -m x and run:

    import x
    a = x.CreateFromDocument('<a>hello<b>12</b></a>')
    print(a.b)
    print(type(a.b))
    print(type(a.b)._ExpandedName)
    

    which produces:

    12
    <class 'pyxb.binding.datatypes.integer'>
    {http://www.w3.org/2001/XMLSchema}integer
    

    So you shouldn't need to instantiate the schema; you can get the type information from the instances themselves.

     
  • mhugo

    mhugo - 2016-03-10

    Thanks.
    Now with something like this:

    from pyxb.utils import domutils
    dom = domutils.StringToDOM('<a>hello<b>12</b></a>')
    elt = dom.childNodes[0].childNodes[1] # the 'b' element
    # how to get the type of 'elt' from the schema ?
    
     
  • Peter A. Bigot

    Peter A. Bigot - 2016-03-10

    For DOM structures, you can't, at least not easily or completely. DOM is simply an expression of the XML document structure: it doesn't go through PyXB's parsing path so there's no information about its semantic structure. To do it reliably you would have to have loaded bindings for the schema for the document, and invoke CreateFromDOM on the DOM tree to get a bindings tree which carries the type information.

    All you can get from DOM is the name of the element. From that you could do what's below, but it's complicated because you have to dig for the local meaning of the child tagName (which may have an attached prefix you'd have to decode).

    import pyxb.utils.domutils
    import x
    
    dom = pyxb.utils.domutils.StringToDOM('<a>hello<b>12</b></a>')
    dn = dom.childNodes[0]
    
    # Retrieve the element associated with 'a' at the top level of
    # the namespace
    ea = x.Namespace.categoryMap('elementBinding')[dn.tagName]
    
    # Create a NSName for the first child of the 'a' node
    ebn = x.Namespace.createExpandedName(dn.childNodes[1].tagName)
    
    # Find the element declaration associated with that name in the
    # context of the type of element 'a'
    edb = ea.typeDefinition()._UseForTag(ebn)
    
    # Get the element associated with the declaration, and show its type
    eb = edb.elementBinding()
    print(eb.typeDefinition());
    

    In more complex documents the name may be part of a substitution group, and in the end you'd have to end up doing most of what PyXB does either when the bindings are generated or when a document is parsed. This really isn't how PyXB's intended to be used, so you're on your own if you follow this path.

     
  • mhugo

    mhugo - 2016-03-10

    Thank you very much for this detailed explanation.

    In more complex documents the name may be part of a substitution group, and in the end you'd have to end up doing most of what PyXB does either when the bindings are generated or when a document is parsed. This really isn't how PyXB's intended to be used, so you're on your own if you follow this path.

    I see. I think I have now enough material to dig the documentation / source for more and see if following this path is worth it.

     
    • Peter A. Bigot

      Peter A. Bigot - 2016-03-10

      Re-reading one of your previous responses and now understanding more about what you want to do, I think your best options are either (a) generate bindings from the document and get type information from the bindings, or (b) use the pyxb.xmlschema material and dynamically interpret the DOM document by following links in it, as opposed to approach (c) above where you walk the abstracted schema that's embedded in the binding classes. (a) is the closest to how PyXB is intended to be used; (b) is appropriate as long as you don't care about validation and your schemas don't have a lot of mixed namespaces, substitution groups, abstract types, etc.; and (c) makes sense only if you need the bindings anyway but for some reason don't want to use them to process the document.

       
  • mhugo

    mhugo - 2016-03-17

    Yes, I was looking for the b) solution. I know it's not the intended use, but PyXB is the only pure python module that I found able to deal with complex schemas.

    Here is what I have for now, if someone else is interested. I spent some time before understanding I had to call resolveDefinitions() on the namespace before walking through it. This is simplistic, but may be used as a start for a type resolver, assuming the document is valid.

    from pyxb.xmlschema.structures import Schema, ElementDeclaration, ComplexTypeDefinition, Particle, ModelGroup, SimpleTypeDefinition
    from pyxb.utils import domutils
    
    schema = Schema.CreateFromLocation(schema_location="x.xsd")
    ns = schema.targetNamespace()
    
    # must call resolve to have a walkable schema tree
    ns.resolveDefinitions()
    
    def print_tree(obj, lvl):
        print " " * lvl, obj.__class__.__name__,
        if isinstance(obj, ElementDeclaration):
            print "<" + obj.name() + ">",
            if obj.typeDefinition():
                print "typeDefinition ->"
                print_tree(obj.typeDefinition(), lvl+2)
            else:
                print
        elif isinstance(obj, ComplexTypeDefinition):
            contentType = obj.contentType()
            if contentType:
                print "contentType", contentType, "->"
                if isinstance(contentType, tuple):
                    print_tree(contentType[1], lvl+2)
                else:
                    print_tree(contentType, lvl+2)
        elif isinstance(obj, SimpleTypeDefinition):
            print obj.name()
        elif isinstance(obj, Particle):
            print obj.minOccurs(), "-", obj.maxOccurs(),
            if obj.term():
                print "term ->"
                print_tree(obj.term(), lvl+2)
            else:
                print
        elif isinstance(obj, ModelGroup):
            for p in obj.particles():
                print "particle ->"
                print_tree(p, lvl+2)
    
    doc_str = '<a>hello<b>12</b></a>'
    dom = domutils.StringToDOM(doc_str)
    a_node = dom.childNodes[0]
    print_tree(ns.elementDeclarations()[a_node.tagName], 0)
    

    which produces:

     ElementDeclaration <a> typeDefinition ->
       ComplexTypeDefinition contentType ('MIXED', <pyxb.xmlschema.structures.Particle object at 0x7f1dc7909ad0>) ->
         Particle 1 - 1 term ->
           ModelGroup particle ->
             Particle 1 - 1 term ->
               ElementDeclaration <b> typeDefinition ->
                 SimpleTypeDefinition integer
    integer
    
     

Log in to post a comment.