From: Artem Y. <ne...@gm...> - 2008-07-07 22:25:30
|
I added ElementTree support, the results are: cElementTree ~13% faster than NanoDOM, and uses memory in 4.5 times less. lxml is ~4% faster than cElementTree, but cElementTree wins in memory usage(two times less) ElementTree little bit faster then NanoDOM, and ElementTree also wins in memory usage(2.5 times less) Concerning the html/xtml output, I discovered that this option supports only by new versions of ElementTree(1.3) and lxlm(2.0), so it won't be available for now on standard Python 2.5 ElementTree. Maybe we can do it optional. There is one problem with lxml: misc/boldlinks test cause such error: File "etree.pyx", line 693, in etree._Element.text.__set__ File "apihelpers.pxi", line 344, in etree._setNodeText File "apihelpers.pxi", line 648, in etree._utf8 AssertionError: All strings must be XML compatible, either Unicode or ASCII I suppose that is because in this test we trying to assign to el.text data, that contains placeholders, and maybe by some reason lxlm treats placeholders values(u'\u0001' and u'\u0002') as not unicode or ascii. New markdown with ElementTree: construction:0.000000:0.000000 amps-and-angle-encoding:0.070000:0.000000 auto-links:0.080000:0.000000 backlash-escapes:0.200000:0.000000 blockquotes-with-dode-blocks:0.030000:0.000000 hard-wrapped:0.010000:0.000000 horizontal-rules:0.160000:0.000000 inline-html-advanced:0.030000:0.000000 inline-html-comments:0.030000:0.000000 inline-html-simple:0.140000:0.000000 links-inline:0.050000:0.000000 links-reference:0.070000:0.000000 literal-quotes:0.030000:0.000000 markdown-documentation-basics:0.440000:0.000000 markdown-syntax:1.980000:1908736.000000 nested-blockquotes:0.030000:0.000000 ordered-and-unordered-list:0.310000:0.000000 strong-and-em-together:0.040000:0.000000 tabs:0.040000:0.000000 tidyness:0.040000:0.000000 New markdown with cElementTree: construction:0.000000:0.000000 amps-and-angle-encoding:0.050000:135168.000000 auto-links:0.070000:0.000000 backlash-escapes:0.190000:0.000000 blockquotes-with-dode-blocks:0.020000:0.000000 hard-wrapped:0.020000:0.000000 horizontal-rules:0.140000:0.000000 inline-html-advanced:0.020000:0.000000 inline-html-comments:0.030000:0.000000 inline-html-simple:0.120000:0.000000 links-inline:0.050000:0.000000 links-reference:0.060000:0.000000 literal-quotes:0.020000:0.000000 markdown-documentation-basics:0.410000:274432.000000 markdown-syntax:1.810000:1138688.000000 nested-blockquotes:0.020000:0.000000 ordered-and-unordered-list:0.260000:0.000000 strong-and-em-together:0.030000:0.000000 tabs:0.040000:0.000000 tidyness:0.030000:0.000000 New markdown with lxml: construction:0.000000:0.000000 amps-and-angle-encoding:0.060000:0.000000 auto-links:0.070000:147456.000000 backlash-escapes:0.170000:135168.000000 blockquotes-with-dode-blocks:0.020000:0.000000 hard-wrapped:0.010000:0.000000 horizontal-rules:0.140000:0.000000 inline-html-advanced:0.030000:0.000000 inline-html-comments:0.030000:0.000000 inline-html-simple:0.120000:0.000000 links-inline:0.060000:0.000000 links-reference:0.080000:0.000000 literal-quotes:0.030000:0.000000 markdown-documentation-basics:0.370000:450560.000000 markdown-syntax:1.750000:2011136.000000 nested-blockquotes:0.020000:0.000000 ordered-and-unordered-list:0.250000:0.000000 strong-and-em-together:0.030000:0.000000 tabs:0.040000:0.000000 tidyness:0.030000:0.000000 New markdown with NanoDOM: construction:0.000000:0.000000 amps-and-angle-encoding:0.060000:0.000000 auto-links:0.070000:0.000000 backlash-escapes:0.220000:135168.000000 blockquotes-with-dode-blocks:0.020000:0.000000 hard-wrapped:0.020000:0.000000 horizontal-rules:0.150000:0.000000 inline-html-advanced:0.030000:0.000000 inline-html-comments:0.030000:0.000000 inline-html-simple:0.140000:0.000000 links-inline:0.050000:0.000000 links-reference:0.080000:0.000000 literal-quotes:0.030000:0.000000 markdown-documentation-basics:0.450000:868352.000000 markdown-syntax:2.080000:5160960.000000 nested-blockquotes:0.020000:0.000000 ordered-and-unordered-list:0.290000:0.000000 strong-and-em-together:0.030000:0.000000 tabs:0.040000:0.000000 tidyness:0.030000:0.000000 |
From: Yuri T. <qar...@gm...> - 2008-07-08 05:44:09
|
> cElementTree ~13% faster than NanoDOM, and uses memory in 4.5 times less. > lxml is ~4% faster than cElementTree, but cElementTree wins in memory > usage(two times less) > ElementTree little bit faster then NanoDOM, and ElementTree also wins in > memory usage(2.5 times less) That's good! > Concerning the html/xtml output, I discovered that this option supports > only by new versions of ElementTree(1.3) and lxlm(2.0), so it won't be > available for now on standard Python 2.5 ElementTree. Maybe we can do it > optional. Again, I wouldn't worry too much about this. If someone wants HTML output, converting XHTML to HTML4 should be easy enough. > There is one problem with lxml: misc/boldlinks test cause such error: > > File "etree.pyx", line 693, in etree._Element.text.__set__ > File "apihelpers.pxi", line 344, in etree._setNodeText > File "apihelpers.pxi", line 648, in etree._utf8 > AssertionError: All strings must be XML compatible, either Unicode or ASCII > > I suppose that is because in this test we trying to assign to el.text > data, that contains placeholders, and maybe by some reason lxlm treats > placeholders values(u'\u0001' and u'\u0002') as not unicode or ascii. We could re-think our choice of placeholders if we know that this is the reason. But it sounds like elementTree is the way to go. A few minor things. The current version in git fails on non-ASCII files (e.g., tests/misc/russian.txt). That's because we end up encoding the content too early: line 1889 writes etree to xml, utf8 encoded, after which we try to run textPostProcessors on it. That's not good. This seems to fix it: xml = codecs.decode(etree.tostring(root, encoding="utf8"), "utf8") (I am assuming that standard etree doesn't have an option of serializing to non-encoded unicode. If it does, use that instead.) Note that in my experience there is only one way to use Unicode right with Python: assume that all strings are unicode. So, for this reason, I've been following the policy of decoding data when it comes my world and encoding it only when it comes out, without _ever_ passing encoded strings around. Encoded strings are evil. Another thing: lots of tests seem to fail now because of whitespace differences. I am guessing that the way to solve it is to first extend test-markdown.py to add an option of reflowing XHTML before diffing. Then, once we know that all tests pass except for white space differences, we can change the expected output. - yuri -- http://sputnik.freewisdom.org/ |
From: Artem Y. <ne...@gm...> - 2008-07-09 00:59:12
|
Yuri Takhteyev wrote: > We could re-think our choice of placeholders if we know that this is > the reason. But it sounds like elementTree is the way to go. > Yes, I agree. > A few minor things. The current version in git fails on non-ASCII > files (e.g., tests/misc/russian.txt). That's because we end up > encoding the content too early: line 1889 writes etree to xml, utf8 > encoded, after which we try to run textPostProcessors on it. That's > not good. This seems to fix it: > > xml = codecs.decode(etree.tostring(root, encoding="utf8"), "utf8") > Strange, I didn't notice it on my version. Thanks for the fix. > (I am assuming that standard etree doesn't have an option of > serializing to non-encoded unicode. If it does, use that instead.) > > Note that in my experience there is only one way to use Unicode right > with Python: assume that all strings are unicode. So, for this > reason, I've been following the policy of decoding data when it comes > my world and encoding it only when it comes out, without _ever_ > passing encoded strings around. Encoded strings are evil. > Thanks for advice. > Another thing: lots of tests seem to fail now because of whitespace > differences. I am guessing that the way to solve it is to first > extend test-markdown.py to add an option of reflowing XHTML before > diffing. Then, once we know that all tests pass except for white > space differences, we can change the expected output. > Maybe we should straight away worry about whitespace, because anyway we'll need to fix failed tests. |
From: Blake W. <bw...@la...> - 2008-07-08 11:40:24
|
Yuri Takhteyev wrote: >> Concerning the html/xtml output, I discovered that this option supports >> only by new versions of ElementTree(1.3) and lxlm(2.0), so it won't be >> available for now on standard Python 2.5 ElementTree. Maybe we can do it >> optional. > Again, I wouldn't worry too much about this. If someone wants HTML > output, converting XHTML to HTML4 should be easy enough. Is this still true if you have inline not-necessarily-legal-XML blocks? (i.e. will it still be easy to convert: **Foo** <br> blah blah blah 'bar' ?) >> There is one problem with lxml: misc/boldlinks test cause such error: >> File "etree.pyx", line 693, in etree._Element.text.__set__ >> File "apihelpers.pxi", line 344, in etree._setNodeText >> File "apihelpers.pxi", line 648, in etree._utf8 >> AssertionError: All strings must be XML compatible, either Unicode or ASCII >> >> I suppose that is because in this test we trying to assign to el.text >> data, that contains placeholders, and maybe by some reason lxlm treats >> placeholders values(u'\u0001' and u'\u0002') as not unicode or ascii. > We could re-think our choice of placeholders if we know that this is > the reason. But it sounds like elementTree is the way to go. What if we went with the BOM character (oxFEFF) as the replacement? It's legal unicode, and _extremely_ unlikely to occur in the middle of text. The only thing to watch out for is having it occur at the start of the file. Later, Blake. |
From: Waylan L. <wa...@gm...> - 2008-07-08 13:52:13
|
On Tue, Jul 8, 2008 at 1:44 AM, Yuri Takhteyev <qar...@gm...> wrote: > But it sounds like elementTree is the way to go. > I agree. It doesn't appear that lxml adds any real value. Add in the trouble installing it, and I doubt many would ever use it. I'd say leave it out for now. If things improve in the future, it won't be that hard to add it back in. > Another thing: lots of tests seem to fail now because of whitespace > differences. I am guessing that the way to solve it is to first > extend test-markdown.py to add an option of reflowing XHTML before > diffing. Then, once we know that all tests pass except for white > space differences, we can change the expected output. Interestingly, I had started work on this at some point, but never got very far. My intended approach was to feed the output and expected output both into a x/html parser, normalize whitespace, and then diff the output of each. Thing is, I couldn't find a python tool that actually did that. Well, there always is BeautifulSoup, but that could very easily alter some of the html and hide bugs - defeating the purpose of testing. Considering that whitespace is insignificant in x/html and the number of x/html tools available in python, you'd think whitespace normalization would be a standard feature. Ah well. I thought about doing a simple whitespace normalization on the string using string.replace or re.sub. But then we'd lose all linebreaks so that the entire doc is on one line. That's kind of hard to diff. Looping through a dom and normalizing on each string was more than I wanted to do. I then found lxml's htmldiff tool [1], which provided an easy (better??) way to compare html docs, but it still hung up on some (not all) whitespace. Additionally, it didn't exactly provide an easily readable output to display in the test output. If your interested, I can forward the code I have - that is, if I can find it. What I'd consider doing is actually taking the most recent markdown with NanoDom and altering NanoDom's whitespace to match ET and run a little script that loops through all the tests and outputs new expected html files. It shouldn't be all that hard. [1]: http://codespeak.net/lxml/lxmlhtml.html#html-diff -- ---- Waylan Limberg wa...@gm... |
From: Yuri T. <qar...@gm...> - 2008-07-08 16:40:23
|
> Is this still true if you have inline not-necessarily-legal-XML blocks? > (i.e. will it still be easy to convert: > **Foo** > <br> > blah blah blah > 'bar' > ?) I meant a simple RE-based substitution. Correct me if I am wrong, but converting XHTML into HTML largely involves changing <$x/> and <$x></$x> to to <$x> for certain values of $x. > What if we went with the BOM character (oxFEFF) as the replacement? It's > legal unicode, and _extremely_ unlikely to occur in the middle of text. The > only thing to watch out for is having it occur at the start of the file. First, my original intention was to use not \u0001 and \u0002 but rather \u0002 and \u0003 - "start of text" (STX) and "end of text" (ETX). The nice thing about them is that they come as a pair - start and end. Also, if we use BOM we'll have to worry about HTML, etc. occuring in the beginning of the text. But this is an option to keep in mind. Alternatively, we can look into the private ranges, though then we have to make sure that our use does not conflict with possible private uses by the caller. - yuri -- http://sputnik.freewisdom.org/ |
From: Waylan L. <wa...@gm...> - 2008-07-08 17:52:55
|
On Tue, Jul 8, 2008 at 12:40 PM, Yuri Takhteyev <qar...@gm...> wrote: >> Is this still true if you have inline not-necessarily-legal-XML blocks? >> (i.e. will it still be easy to convert: >> **Foo** >> <br> >> blah blah blah >> 'bar' >> ?) > > I meant a simple RE-based substitution. Correct me if I am wrong, but > converting XHTML into HTML largely involves changing <$x/> and > <$x></$x> to to <$x> for certain values of $x. Yeah, that *should* cover the basics. Of course, anyone could always pass Markdown's output into uTidylib [1] or ElementTree Tidy [2] if they want a solid conversion. Unfortunately, it will likely slow things down to much to offer that option in Markdown directly. However, it may not be a bad idea to have an extension for those who want it. Hmm, now to get back on-subject - I wonder if either of those tools will do whitespace normalization only, without making any other changes to the output. It's worth exploring for the tests. [1]: http://utidylib.berlios.de/ [2]: http://effbot.org/zone/element-tidylib.htm > >> What if we went with the BOM character (oxFEFF) as the replacement? It's >> legal unicode, and _extremely_ unlikely to occur in the middle of text. The >> only thing to watch out for is having it occur at the start of the file. > > First, my original intention was to use not \u0001 and \u0002 but > rather \u0002 and \u0003 - "start of text" (STX) and "end of text" > (ETX). The nice thing about them is that they come as a pair - start > and end. Also, if we use BOM we'll have to worry about HTML, etc. > occuring in the beginning of the text. But this is an option to keep > in mind. Alternatively, we can look into the private ranges, though > then we have to make sure that our use does not conflict with possible > private uses by the caller. > > - yuri > > -- > http://sputnik.freewisdom.org/ > > ------------------------------------------------------------------------- > Sponsored by: SourceForge.net Community Choice Awards: VOTE NOW! > Studies have shown that voting for your favorite open source project, > along with a healthy diet, reduces your potential for chronic lameness > and boredom. Vote Now at http://www.sourceforge.net/community/cca08 > _______________________________________________ > Python-markdown-discuss mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/python-markdown-discuss > -- ---- Waylan Limberg wa...@gm... |
From: Artem Y. <ne...@gm...> - 2008-07-09 00:59:17
|
Waylan Limberg wrote: > I then found lxml's htmldiff tool [1], which provided an easy > (better??) way to compare html docs, but it still hung up on some (not > all) whitespace. Additionally, it didn't exactly provide an easily > readable output to display in the test output. If your interested, I > can forward the code I have - that is, if I can find it. > Yep, it would be interesting. > What I'd consider doing is actually taking the most recent markdown > with NanoDom and altering NanoDom's whitespace to match ET and run a > little script that loops through all the tests and outputs new > expected html files. It shouldn't be all that hard. > Yes, for now it seems reasonable solution. Also, ET don't have any output indentation, I wrote function that do some indentation for ET. Another one solution is to tune this function to match previous markdown output. I also tried to load data from tests html files to ET, and then serialize it, but there are some issues and I didn't succeeded in it. |
From: Waylan L. <wa...@gm...> - 2008-07-09 01:52:07
Attachments:
differ.py
|
On Tue, Jul 8, 2008 at 9:00 PM, Artem Yunusov <ne...@gm...> wrote: > Waylan Limberg wrote: >> I then found lxml's htmldiff tool [1], which provided an easy >> (better??) way to compare html docs, but it still hung up on some (not >> all) whitespace. Additionally, it didn't exactly provide an easily >> readable output to display in the test output. If your interested, I >> can forward the code I have - that is, if I can find it. >> > > Yep, it would be interesting. > Hmm, all I can find is a very simple little script that uses xmldiff [1]. I doubt this is very useful, but I've attached it anyway. [1]: http://www.logilab.org/859 -- ---- Waylan Limberg wa...@gm... |