|
From: Guenter M. <mi...@us...> - 2021-05-05 09:28:26
|
On 2021-05-04, Mariusz Wasiluk wrote: > Thank you for the answer. Yes, I saw the release notes however I don't > understand how this behavior is in line with what is stated in > https://docutils.sourceforge.io/docs/ref/rst/restructuredtext.html#escaping-mechanism > "The backslash is removed from the output" suggests that there will be no > character in the output, even NULL. I agree that the documentation needs clarification. Escape characters are removed from output documents by the Docutils "writers". Your example uses a representation of the internal document tree (doctree). > Currently, my example produces invalid XML (I get parsing error from lxml). This should not happen and is probably a bug in the asdom() method for Text nodes. > So I wonder if my use case is invalid? Currently, the XML produced by .asdom().toxml() is not tested. While not invalid, you may consider it experimental/unsupported. Contributions improving test coverage are welcome. > Shall I remove \x00 from the XML output before further processing? Depending on your needs, you could either restore or remove escaping not "used up" by Docutils. Docutils provides the nodes.unescape() function for this purpose (which allows either restoring or removal and also caters for the special meaning of escaped whitespace). When you don't want the escapes, you may also consider using the "xml" writer instead of publish_doctree(). Fixing ...asdom() would also need to consider whether escapes should be restored or removed. Günter > wt., 4 maj 2021 o 19:53 Guenter Milde via Docutils-users < > doc...@li...> napisał(a): >> On 2021-05-04, Mariusz Wasiluk wrote: >> > Hello, >> > I have following snippet: >> > from docutils.core import publish_doctree >> > dom = publish_doctree(r'Foo\\bar').asdom() >> > print(repr(dom.toxml())) >> > with docutils>=0.16, I get: >> > u'<?xml version="1.0" ?><document >> > source="<string>"><paragraph>Foo\x00\\bar</paragraph></document>' >> > with previous versions I get: >> > u'<?xml version="1.0" ?><document >> > source="<string>"><paragraph>Foo\\bar</paragraph></document>' >> > Why with the newest docutils versions I'm getting \x00 in the output? >> This is an intended change: >> Until 0.16, backslashs were removed prior to storing a Text string in the >> document tree. Since 0.16 they are stored as NULL. >> See the HISTORY.txt entry for 0.16: >> - Keep `backslash escapes`__ in the document tree. Backslash characters >> in >> text are be represented by NULL characters in the ``text`` attribute of >> Doctree nodes and removed in the writing stage by the node's >> ``astext()`` method. >> __ >> http://docutils.sourceforge.net/docs/ref/rst/restructuredtext.html#escaping-mechanism >> This change was implemented in order to allow escaping "active characters" >> also in transforms. The RELEASE_NOTES list one example: >> [...] This allows, e.g., escaping of author-separators in >> `bibliographic fields`__. >> __ >> http://docutils.sourceforge.net/docs/ref/rst/restructuredtext.html#escaping-mechanism >> __ docs/ref/rst/restructuredtext.html#bibliographic-fields >> Another usage is escaping of characters that would otherwise be >> transformed by >> the smartquotes__ transform. >> __ https://docutils.sourceforge.io/docs/user/config.html#smart-quotes >> Günter >> _______________________________________________ >> Docutils-users mailing list >> Doc...@li... >> https://lists.sourceforge.net/lists/listinfo/docutils-users >> Please use "Reply All" to reply to the list. > [-- Skipped Type: text/html --] > [-- Type: text/plain, Encoding: 7bit --] > [-- Type: text/plain, Encoding: 7bit --] |