PyXB: Python XML Schema Bindings / Discussion / Help: Partial serialization of elements (streaming)

Yury V. Zaytsev - 2016-02-11

Hi,

I'm trying to figure out whether I can stream elements to XML using PyXB. So far I have found only one serialization method toxml() which can be called on any element, not necessarily root element. This is exactly what I need, but how do I instruct PyXB to keep the element open? I couldn't find any such parameter to toxml().

To clarify a bit further, we are talking about XML documents of ~100G in size. I've been succesfully using CodeSynthesis XSD generated bindings to write them, and what they have there is a special streaming serializer, which can be told to keep the element open after serializing it. I can start with the root element, but keep it open, and then serialize ~5000 child elements one by one, and finally tell it to close all open elements.

Is there anything like this possible to achieve with PyXB?

Many thanks!

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Peter A. Bigot - 2016-02-11

PyXB is primarily intended for validation. An incomplete element cannot be validated.

If you don't care as much about validation at the root level, you could generate the start tag of the root element to your stream manually, then build up the top-level elements, convert them to unicode with .toxml(), and send them to the stream, closing with the root element end tag when you're done.

(You could also validate during this process as long as the schema was deterministic and you managed the automaton state manually; some of the unit tests show this technique.)

Alternatively .toxml() is a wrapper around .toDOM() so you could generate a DOM instance and find a streaming DOM serializer.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Yury V. Zaytsev - 2016-02-11

Hi Peter,

Thanks for getting back to me!

If you don't care as much about validation at the root level, you could generate the start tag of the root element to your stream manually, then build up the top-level elements, convert them to unicode with .toxml(), and send them to the stream, closing with the root element end tag when you're done.

Yes, I guess this would work; indeed, I don't care too much about validation at the root level (it's just a bit annoying to do it by hand due to constants and such), but validation of the sub-elements would be nice to have (something I don't get from CodeSynthesis XSD). I was just wondering if there is a more elegant built-in approach in PyXB...

Also, in order to avoid writing root elements myself manually or with the help of something like lxml.objectify, I guess I could try PyXB + disable the validation, but is there any less horrible hack to keep the tags open other than doing .toprettyxml() and removing the last N lines?

Alternatively .toxml() is a wrapper around .toDOM() so you could generate a DOM instance and find a streaming DOM serializer.

I'm afraid this wouldn't solve my problem, because if I understood you correctly, I'd still have to hold the complete DOM (~ hundreds of gigabytes) in RAM, which is why I'm interested in streaming in the first place.

Many thanks once again for your feedback!

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Peter A. Bigot - 2016-02-11

Converting to DOM doesn't require the whole document. e.toDOM() will generate a fragment DOM for the element e. PyXB's e.toxml() is nothing more than e.toDOM().toxml().

There's no concept of "keeping tags open" in PyXB; the closest is that you can add acceptable content to any existing element at any time, but you can't serialize the element until it's complete enough to be valid.

You might try generating a root document with none of the children you'd be serializing, convert it to DOM, and see if the Python DOM API has some way to generate text representations of the start and end tags independently. That's outside of PyXB's scope.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Yury V. Zaytsev - 2016-02-11

Converting to DOM doesn't require the whole document. e.toDOM() will generate a fragment DOM for the element e. PyXB's e.toxml() is nothing more than e.toDOM().toxml().

Right, I appreciate that; I thought you meant to say that I could convert the whole document into DOM and then stream element by element to XML, instead of serializing the whole document into XML, but we are on the same page now.

You might try generating a root document with none of the children you'd be serializing, convert it to DOM, and see if the Python DOM API has some way to generate text representations of the start and end tags independently. That's outside of PyXB's scope.

I see, I will investigate this possibility. Otherwise, I can always go for a hack as described above.

On an unrelated note, is there an easy way to see the possible arguments to the constructors of types in ipython? foo.bar? shows the documentation from the XSD in the docstring, which is very handy, and foo.bar. can show me the properties, but I'm not sure of the order of arguments. Is the best practice then just to supply what you see in properties as keyword constructor arguments?

Also, it would be great to be able to see in the docstrings the XSD type of the property, its restrictions / default values, etc. so that one could have documentation for huge schemas right in the IDE. Am I missing something obvious?

Many thanks!

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Peter A. Bigot - 2016-02-11

Objects can be constructed either with keyword arguments using the sub-element and attribute names, or as a sequence of content values that would be added in argument order and processed as if encountered in that order from a document. You can't mix the two approaches. You can also add content after the initial element has been constructed.

PyXB wasn't intended for IDE integration or to help learn the schema structure, so much of what you'd like for integrated help isn't available. There's some attempt to copy schema documentation into doc strings, but PyXB wouldn't synthesize anything about its own API into them. You really need to be familiar with the schema to use PyXB effectively.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Yury V. Zaytsev - 2016-02-12

Thanks for your clarifications, Peter! I'll continue exploring PyXB...

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Partial serialization of elements (streaming)

Forums

Help

Partial serialization of elements (streaming) document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Partial serialization of elements (streaming)