From: Yuri T. <qar...@gm...> - 2008-07-08 05:44:09
|
> cElementTree ~13% faster than NanoDOM, and uses memory in 4.5 times less. > lxml is ~4% faster than cElementTree, but cElementTree wins in memory > usage(two times less) > ElementTree little bit faster then NanoDOM, and ElementTree also wins in > memory usage(2.5 times less) That's good! > Concerning the html/xtml output, I discovered that this option supports > only by new versions of ElementTree(1.3) and lxlm(2.0), so it won't be > available for now on standard Python 2.5 ElementTree. Maybe we can do it > optional. Again, I wouldn't worry too much about this. If someone wants HTML output, converting XHTML to HTML4 should be easy enough. > There is one problem with lxml: misc/boldlinks test cause such error: > > File "etree.pyx", line 693, in etree._Element.text.__set__ > File "apihelpers.pxi", line 344, in etree._setNodeText > File "apihelpers.pxi", line 648, in etree._utf8 > AssertionError: All strings must be XML compatible, either Unicode or ASCII > > I suppose that is because in this test we trying to assign to el.text > data, that contains placeholders, and maybe by some reason lxlm treats > placeholders values(u'\u0001' and u'\u0002') as not unicode or ascii. We could re-think our choice of placeholders if we know that this is the reason. But it sounds like elementTree is the way to go. A few minor things. The current version in git fails on non-ASCII files (e.g., tests/misc/russian.txt). That's because we end up encoding the content too early: line 1889 writes etree to xml, utf8 encoded, after which we try to run textPostProcessors on it. That's not good. This seems to fix it: xml = codecs.decode(etree.tostring(root, encoding="utf8"), "utf8") (I am assuming that standard etree doesn't have an option of serializing to non-encoded unicode. If it does, use that instead.) Note that in my experience there is only one way to use Unicode right with Python: assume that all strings are unicode. So, for this reason, I've been following the policy of decoding data when it comes my world and encoding it only when it comes out, without _ever_ passing encoded strings around. Encoded strings are evil. Another thing: lots of tests seem to fail now because of whitespace differences. I am guessing that the way to solve it is to first extend test-markdown.py to add an option of reflowing XHTML before diffing. Then, once we know that all tests pass except for white space differences, we can change the expected output. - yuri -- http://sputnik.freewisdom.org/ |