Re: [Python-markdown-discuss] GSoC ElementTree support

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

> cElementTree ~13% faster than NanoDOM, and uses memory in 4.5 times less.
> lxml is ~4% faster than cElementTree, but cElementTree wins in memory
> usage(two times less)
> ElementTree little bit faster then NanoDOM, and ElementTree also wins in
> memory usage(2.5 times less)

That's good!

> Concerning the html/xtml output, I discovered that this option  supports
> only by new versions of ElementTree(1.3) and lxlm(2.0), so it won't be
> available for now on standard Python 2.5 ElementTree. Maybe we can do it
> optional.

Again, I wouldn't worry too much about this.  If someone wants HTML
output, converting XHTML to HTML4 should be easy enough.

> There is one problem with lxml: misc/boldlinks test cause such error:
>
>  File "etree.pyx", line 693, in etree._Element.text.__set__
>  File "apihelpers.pxi", line 344, in etree._setNodeText
>  File "apihelpers.pxi", line 648, in etree._utf8
> AssertionError: All strings must be XML compatible, either Unicode or ASCII
>
> I suppose that is because in this test we trying to assign to el.text
> data, that contains placeholders, and maybe by some reason lxlm treats
> placeholders values(u'\u0001' and u'\u0002') as not unicode or ascii.

We could re-think our choice of placeholders if we know that this is
the reason.  But it sounds like elementTree is the way to go.

A few minor things.  The current version in git fails on non-ASCII
files (e.g., tests/misc/russian.txt).  That's because we end up
encoding the content too early: line 1889 writes etree to xml, utf8
encoded, after which we try to run textPostProcessors on it.  That's
not good.  This seems to fix it:

        xml = codecs.decode(etree.tostring(root, encoding="utf8"), "utf8")

(I am assuming that standard etree doesn't have an option of
serializing to non-encoded unicode.  If it does, use that instead.)

Note that in my experience there is only one way to use Unicode right
with Python: assume that all strings are unicode.  So, for this
reason, I've been following the policy of decoding data when it comes
my world and encoding it only when it comes out, without _ever_
passing encoded strings around.  Encoded strings are evil.

Another thing: lots of tests seem to fail now because of whitespace
differences.  I am guessing that the way to solve it is to first
extend test-markdown.py to add an option of reflowing XHTML before
diffing.  Then, once we know that all tests pass except for white
space differences, we can change the expected output.

  - yuri

-- 
http://sputnik.freewisdom.org/