I've just discovered PyXB and find it very interesting, thanks for the hard work.
I am looking for a way to link elements of an XML to their type definitions in a schema.
Since PyXB is able to parse XSD schemas and generate Python classes out of them, I was thinking I could have access to the schema internal representation and walk through it and then find the type information of a given element in a XML tree.
Is there a way to do that with PyXB ? From my understanding of the documentation and API it seems given a binding, I can access to its child "content" through orderedContent() (and _ElementMap ?) but is there a way to access the ElementDeclaration of an arbitrary DOM node ?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
If you mean you have bindings for which everything could be resolved by PyXB then yes, start with the API, particularly the material in pyxb.binding. pyxb.binding.basis contains the classes that are the basis of instances, including simple and complex type definitions.
orderedContent() produces lists that include members of type ElementContent() which will tell you the corresponding ElementDeclaration. You can also use methods like _element() on most binding instances. Generally element declaration will identify the type definition. All the binding classes have _ExpandedName and _XSDLocation that tell you where in what schema the information came from. That may be all you need; OTOH if you want the content model information, that'll have been transformed into a finite automaton with counters, so it'll be harder to figure out from the runtime structure.
If by "arbitrary DOM node" you mean the special case where PyXB can't tell what the XML content is, so represents it in the generated binding by DOM nodes, then no. Best bet would be look at the parent element that has a PyXB binding and figure out what's possible from some type information.
If you really need a representation of the information set from the schema, you'd probably need to work with the parsed schema with the material in pyxb.xmlschema. Most of that content is not present in the bindings.
In short, probably it's all there, and a lot of it's documented, but it wasn't intended for end-user use so you'll have to dig around a bit, both in the documentation and in the code, generated bindings, and test suite looking for examples.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
If you mean you have bindings for which everything could be resolved by PyXB then yes, start with the API, particularly the material in pyxb.binding. pyxb.binding.basis contains the classes that are the basis of instances, including simple and complex type definitions.
orderedContent() produces lists that include members of type ElementContent() which will tell you the corresponding ElementDeclaration. You can also use methods like _element() on most binding instances. Generally element declaration will identify the type definition. All the binding classes have _ExpandedName and _XSDLocation that tell you where in what schema the information came from. That may be all you need; OTOH if you want the content model information, that'll have been transformed into a finite automaton with counters, so it'll be harder to figure out from the runtime structure.
Thanks for your answer.
Yes that what I meant. But this is only one way: from a (Python) type I can access the element declaration. Now say I have an xml file <a>hello<b>12</b></a> and an XML node object representing the <b> element, and the corresponding schema. Is there a way to know that the value of <b> is of type (say) integer ?
From your answer it seems my best bet would be to find a way to instanciate something in pyxb.xmlschema from my schema and "walk through" it to find the type of <b> ...
Last edit: mhugo 2016-03-09
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
frompyxb.utilsimportdomutilsdom=domutils.StringToDOM('<a>hello<b>12</b></a>')elt=dom.childNodes[0].childNodes[1]# the 'b' element# how to get the type of 'elt' from the schema ?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
For DOM structures, you can't, at least not easily or completely. DOM is simply an expression of the XML document structure: it doesn't go through PyXB's parsing path so there's no information about its semantic structure. To do it reliably you would have to have loaded bindings for the schema for the document, and invoke CreateFromDOM on the DOM tree to get a bindings tree which carries the type information.
All you can get from DOM is the name of the element. From that you could do what's below, but it's complicated because you have to dig for the local meaning of the child tagName (which may have an attached prefix you'd have to decode).
importpyxb.utils.domutilsimportxdom=pyxb.utils.domutils.StringToDOM('<a>hello<b>12</b></a>')dn=dom.childNodes[0]# Retrieve the element associated with 'a' at the top level of# the namespaceea=x.Namespace.categoryMap('elementBinding')[dn.tagName]# Create a NSName for the first child of the 'a' nodeebn=x.Namespace.createExpandedName(dn.childNodes[1].tagName)# Find the element declaration associated with that name in the# context of the type of element 'a'edb=ea.typeDefinition()._UseForTag(ebn)# Get the element associated with the declaration, and show its typeeb=edb.elementBinding()print(eb.typeDefinition());
In more complex documents the name may be part of a substitution group, and in the end you'd have to end up doing most of what PyXB does either when the bindings are generated or when a document is parsed. This really isn't how PyXB's intended to be used, so you're on your own if you follow this path.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thank you very much for this detailed explanation.
In more complex documents the name may be part of a substitution group, and in the end you'd have to end up doing most of what PyXB does either when the bindings are generated or when a document is parsed. This really isn't how PyXB's intended to be used, so you're on your own if you follow this path.
I see. I think I have now enough material to dig the documentation / source for more and see if following this path is worth it.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Re-reading one of your previous responses and now understanding more about what you want to do, I think your best options are either (a) generate bindings from the document and get type information from the bindings, or (b) use the pyxb.xmlschema material and dynamically interpret the DOM document by following links in it, as opposed to approach (c) above where you walk the abstracted schema that's embedded in the binding classes. (a) is the closest to how PyXB is intended to be used; (b) is appropriate as long as you don't care about validation and your schemas don't have a lot of mixed namespaces, substitution groups, abstract types, etc.; and (c) makes sense only if you need the bindings anyway but for some reason don't want to use them to process the document.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Yes, I was looking for the b) solution. I know it's not the intended use, but PyXB is the only pure python module that I found able to deal with complex schemas.
Here is what I have for now, if someone else is interested. I spent some time before understanding I had to call resolveDefinitions() on the namespace before walking through it. This is simplistic, but may be used as a start for a type resolver, assuming the document is valid.
frompyxb.xmlschema.structuresimportSchema,ElementDeclaration,ComplexTypeDefinition,Particle,ModelGroup,SimpleTypeDefinitionfrompyxb.utilsimportdomutilsschema=Schema.CreateFromLocation(schema_location="x.xsd")ns=schema.targetNamespace()# must call resolve to have a walkable schema treens.resolveDefinitions()defprint_tree(obj,lvl):print" "*lvl,obj.__class__.__name__,ifisinstance(obj,ElementDeclaration):print"<"+obj.name()+">",ifobj.typeDefinition():print"typeDefinition ->"print_tree(obj.typeDefinition(),lvl+2)else:printelifisinstance(obj,ComplexTypeDefinition):contentType=obj.contentType()ifcontentType:print"contentType",contentType,"->"ifisinstance(contentType,tuple):print_tree(contentType[1],lvl+2)else:print_tree(contentType,lvl+2)elifisinstance(obj,SimpleTypeDefinition):printobj.name()elifisinstance(obj,Particle):printobj.minOccurs(),"-",obj.maxOccurs(),ifobj.term():print"term ->"print_tree(obj.term(),lvl+2)else:printelifisinstance(obj,ModelGroup):forpinobj.particles():print"particle ->"print_tree(p,lvl+2)doc_str='<a>hello<b>12</b></a>'dom=domutils.StringToDOM(doc_str)a_node=dom.childNodes[0]print_tree(ns.elementDeclarations()[a_node.tagName],0)
Hi,
I've just discovered PyXB and find it very interesting, thanks for the hard work.
I am looking for a way to link elements of an XML to their type definitions in a schema.
Since PyXB is able to parse XSD schemas and generate Python classes out of them, I was thinking I could have access to the schema internal representation and walk through it and then find the type information of a given element in a XML tree.
Is there a way to do that with PyXB ? From my understanding of the documentation and API it seems given a binding, I can access to its child "content" through orderedContent() (and _ElementMap ?) but is there a way to access the ElementDeclaration of an arbitrary DOM node ?
If you mean you have bindings for which everything could be resolved by PyXB then yes, start with the API, particularly the material in
pyxb.binding
.pyxb.binding.basis
contains the classes that are the basis of instances, including simple and complex type definitions.orderedContent() produces lists that include members of type ElementContent() which will tell you the corresponding ElementDeclaration. You can also use methods like
_element()
on most binding instances. Generally element declaration will identify the type definition. All the binding classes have_ExpandedName
and_XSDLocation
that tell you where in what schema the information came from. That may be all you need; OTOH if you want the content model information, that'll have been transformed into a finite automaton with counters, so it'll be harder to figure out from the runtime structure.If by "arbitrary DOM node" you mean the special case where PyXB can't tell what the XML content is, so represents it in the generated binding by DOM nodes, then no. Best bet would be look at the parent element that has a PyXB binding and figure out what's possible from some type information.
If you really need a representation of the information set from the schema, you'd probably need to work with the parsed schema with the material in
pyxb.xmlschema
. Most of that content is not present in the bindings.In short, probably it's all there, and a lot of it's documented, but it wasn't intended for end-user use so you'll have to dig around a bit, both in the documentation and in the code, generated bindings, and test suite looking for examples.
Thanks for your answer.
Yes that what I meant. But this is only one way: from a (Python) type I can access the element declaration. Now say I have an xml file
<a>hello<b>12</b></a>
and an XML node object representing the<b>
element, and the corresponding schema. Is there a way to know that the value of<b>
is of type (say) integer ?From your answer it seems my best bet would be to find a way to instanciate something in pyxb.xmlschema from my schema and "walk through" it to find the type of
<b>
...Last edit: mhugo 2016-03-09
For that simple case:
build with
pyxbgen -u x.xsd -m x
and run:which produces:
So you shouldn't need to instantiate the schema; you can get the type information from the instances themselves.
Thanks.
Now with something like this:
For DOM structures, you can't, at least not easily or completely. DOM is simply an expression of the XML document structure: it doesn't go through PyXB's parsing path so there's no information about its semantic structure. To do it reliably you would have to have loaded bindings for the schema for the document, and invoke CreateFromDOM on the DOM tree to get a bindings tree which carries the type information.
All you can get from DOM is the name of the element. From that you could do what's below, but it's complicated because you have to dig for the local meaning of the child tagName (which may have an attached prefix you'd have to decode).
In more complex documents the name may be part of a substitution group, and in the end you'd have to end up doing most of what PyXB does either when the bindings are generated or when a document is parsed. This really isn't how PyXB's intended to be used, so you're on your own if you follow this path.
Thank you very much for this detailed explanation.
I see. I think I have now enough material to dig the documentation / source for more and see if following this path is worth it.
Re-reading one of your previous responses and now understanding more about what you want to do, I think your best options are either (a) generate bindings from the document and get type information from the bindings, or (b) use the
pyxb.xmlschema
material and dynamically interpret the DOM document by following links in it, as opposed to approach (c) above where you walk the abstracted schema that's embedded in the binding classes. (a) is the closest to how PyXB is intended to be used; (b) is appropriate as long as you don't care about validation and your schemas don't have a lot of mixed namespaces, substitution groups, abstract types, etc.; and (c) makes sense only if you need the bindings anyway but for some reason don't want to use them to process the document.Yes, I was looking for the b) solution. I know it's not the intended use, but PyXB is the only pure python module that I found able to deal with complex schemas.
Here is what I have for now, if someone else is interested. I spent some time before understanding I had to call resolveDefinitions() on the namespace before walking through it. This is simplistic, but may be used as a start for a type resolver, assuming the document is valid.
which produces: