You can subscribe to this list here.
2006 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(14) |
Aug
(5) |
Sep
|
Oct
|
Nov
|
Dec
(3) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2007 |
Jan
|
Feb
|
Mar
(7) |
Apr
(6) |
May
(25) |
Jun
(11) |
Jul
|
Aug
(5) |
Sep
(5) |
Oct
(39) |
Nov
(28) |
Dec
(6) |
2008 |
Jan
(4) |
Feb
(39) |
Mar
(14) |
Apr
(12) |
May
(14) |
Jun
(20) |
Jul
(60) |
Aug
(69) |
Sep
(20) |
Oct
(56) |
Nov
(41) |
Dec
(29) |
2009 |
Jan
(27) |
Feb
(21) |
Mar
(37) |
Apr
(18) |
May
(2) |
Jun
(6) |
Jul
(6) |
Aug
(5) |
Sep
(2) |
Oct
(12) |
Nov
(2) |
Dec
|
2010 |
Jan
(12) |
Feb
(13) |
Mar
(10) |
Apr
|
May
(6) |
Jun
(5) |
Jul
(10) |
Aug
(7) |
Sep
(8) |
Oct
(7) |
Nov
(1) |
Dec
|
2011 |
Jan
|
Feb
|
Mar
(6) |
Apr
(5) |
May
(6) |
Jun
(15) |
Jul
(2) |
Aug
(6) |
Sep
|
Oct
(1) |
Nov
(2) |
Dec
(5) |
2012 |
Jan
(6) |
Feb
|
Mar
(2) |
Apr
(2) |
May
(2) |
Jun
(1) |
Jul
|
Aug
(2) |
Sep
|
Oct
|
Nov
|
Dec
(20) |
2013 |
Jan
|
Feb
|
Mar
(5) |
Apr
(1) |
May
(1) |
Jun
(9) |
Jul
(3) |
Aug
(5) |
Sep
(5) |
Oct
|
Nov
(2) |
Dec
|
2014 |
Jan
(10) |
Feb
|
Mar
|
Apr
(2) |
May
|
Jun
|
Jul
|
Aug
(12) |
Sep
(9) |
Oct
(4) |
Nov
(8) |
Dec
(2) |
2015 |
Jan
(5) |
Feb
(5) |
Mar
(1) |
Apr
(1) |
May
(3) |
Jun
|
Jul
|
Aug
(9) |
Sep
|
Oct
|
Nov
|
Dec
|
2016 |
Jan
(2) |
Feb
(2) |
Mar
(9) |
Apr
(2) |
May
(6) |
Jun
|
Jul
|
Aug
(1) |
Sep
(7) |
Oct
(1) |
Nov
|
Dec
(1) |
2017 |
Jan
(9) |
Feb
|
Mar
(3) |
Apr
|
May
(14) |
Jun
|
Jul
(2) |
Aug
(1) |
Sep
|
Oct
|
Nov
(2) |
Dec
(5) |
2018 |
Jan
|
Feb
|
Mar
(3) |
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(1) |
Dec
(9) |
2019 |
Jan
(4) |
Feb
(1) |
Mar
|
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
(2) |
Oct
|
Nov
|
Dec
|
2020 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2024 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
(2) |
Sep
(1) |
Oct
(2) |
Nov
|
Dec
|
From: Yuri T. <qar...@gm...> - 2008-07-08 05:44:09
|
> cElementTree ~13% faster than NanoDOM, and uses memory in 4.5 times less. > lxml is ~4% faster than cElementTree, but cElementTree wins in memory > usage(two times less) > ElementTree little bit faster then NanoDOM, and ElementTree also wins in > memory usage(2.5 times less) That's good! > Concerning the html/xtml output, I discovered that this option supports > only by new versions of ElementTree(1.3) and lxlm(2.0), so it won't be > available for now on standard Python 2.5 ElementTree. Maybe we can do it > optional. Again, I wouldn't worry too much about this. If someone wants HTML output, converting XHTML to HTML4 should be easy enough. > There is one problem with lxml: misc/boldlinks test cause such error: > > File "etree.pyx", line 693, in etree._Element.text.__set__ > File "apihelpers.pxi", line 344, in etree._setNodeText > File "apihelpers.pxi", line 648, in etree._utf8 > AssertionError: All strings must be XML compatible, either Unicode or ASCII > > I suppose that is because in this test we trying to assign to el.text > data, that contains placeholders, and maybe by some reason lxlm treats > placeholders values(u'\u0001' and u'\u0002') as not unicode or ascii. We could re-think our choice of placeholders if we know that this is the reason. But it sounds like elementTree is the way to go. A few minor things. The current version in git fails on non-ASCII files (e.g., tests/misc/russian.txt). That's because we end up encoding the content too early: line 1889 writes etree to xml, utf8 encoded, after which we try to run textPostProcessors on it. That's not good. This seems to fix it: xml = codecs.decode(etree.tostring(root, encoding="utf8"), "utf8") (I am assuming that standard etree doesn't have an option of serializing to non-encoded unicode. If it does, use that instead.) Note that in my experience there is only one way to use Unicode right with Python: assume that all strings are unicode. So, for this reason, I've been following the policy of decoding data when it comes my world and encoding it only when it comes out, without _ever_ passing encoded strings around. Encoded strings are evil. Another thing: lots of tests seem to fail now because of whitespace differences. I am guessing that the way to solve it is to first extend test-markdown.py to add an option of reflowing XHTML before diffing. Then, once we know that all tests pass except for white space differences, we can change the expected output. - yuri -- http://sputnik.freewisdom.org/ |
From: Artem Y. <ne...@gm...> - 2008-07-07 22:25:30
|
I added ElementTree support, the results are: cElementTree ~13% faster than NanoDOM, and uses memory in 4.5 times less. lxml is ~4% faster than cElementTree, but cElementTree wins in memory usage(two times less) ElementTree little bit faster then NanoDOM, and ElementTree also wins in memory usage(2.5 times less) Concerning the html/xtml output, I discovered that this option supports only by new versions of ElementTree(1.3) and lxlm(2.0), so it won't be available for now on standard Python 2.5 ElementTree. Maybe we can do it optional. There is one problem with lxml: misc/boldlinks test cause such error: File "etree.pyx", line 693, in etree._Element.text.__set__ File "apihelpers.pxi", line 344, in etree._setNodeText File "apihelpers.pxi", line 648, in etree._utf8 AssertionError: All strings must be XML compatible, either Unicode or ASCII I suppose that is because in this test we trying to assign to el.text data, that contains placeholders, and maybe by some reason lxlm treats placeholders values(u'\u0001' and u'\u0002') as not unicode or ascii. New markdown with ElementTree: construction:0.000000:0.000000 amps-and-angle-encoding:0.070000:0.000000 auto-links:0.080000:0.000000 backlash-escapes:0.200000:0.000000 blockquotes-with-dode-blocks:0.030000:0.000000 hard-wrapped:0.010000:0.000000 horizontal-rules:0.160000:0.000000 inline-html-advanced:0.030000:0.000000 inline-html-comments:0.030000:0.000000 inline-html-simple:0.140000:0.000000 links-inline:0.050000:0.000000 links-reference:0.070000:0.000000 literal-quotes:0.030000:0.000000 markdown-documentation-basics:0.440000:0.000000 markdown-syntax:1.980000:1908736.000000 nested-blockquotes:0.030000:0.000000 ordered-and-unordered-list:0.310000:0.000000 strong-and-em-together:0.040000:0.000000 tabs:0.040000:0.000000 tidyness:0.040000:0.000000 New markdown with cElementTree: construction:0.000000:0.000000 amps-and-angle-encoding:0.050000:135168.000000 auto-links:0.070000:0.000000 backlash-escapes:0.190000:0.000000 blockquotes-with-dode-blocks:0.020000:0.000000 hard-wrapped:0.020000:0.000000 horizontal-rules:0.140000:0.000000 inline-html-advanced:0.020000:0.000000 inline-html-comments:0.030000:0.000000 inline-html-simple:0.120000:0.000000 links-inline:0.050000:0.000000 links-reference:0.060000:0.000000 literal-quotes:0.020000:0.000000 markdown-documentation-basics:0.410000:274432.000000 markdown-syntax:1.810000:1138688.000000 nested-blockquotes:0.020000:0.000000 ordered-and-unordered-list:0.260000:0.000000 strong-and-em-together:0.030000:0.000000 tabs:0.040000:0.000000 tidyness:0.030000:0.000000 New markdown with lxml: construction:0.000000:0.000000 amps-and-angle-encoding:0.060000:0.000000 auto-links:0.070000:147456.000000 backlash-escapes:0.170000:135168.000000 blockquotes-with-dode-blocks:0.020000:0.000000 hard-wrapped:0.010000:0.000000 horizontal-rules:0.140000:0.000000 inline-html-advanced:0.030000:0.000000 inline-html-comments:0.030000:0.000000 inline-html-simple:0.120000:0.000000 links-inline:0.060000:0.000000 links-reference:0.080000:0.000000 literal-quotes:0.030000:0.000000 markdown-documentation-basics:0.370000:450560.000000 markdown-syntax:1.750000:2011136.000000 nested-blockquotes:0.020000:0.000000 ordered-and-unordered-list:0.250000:0.000000 strong-and-em-together:0.030000:0.000000 tabs:0.040000:0.000000 tidyness:0.030000:0.000000 New markdown with NanoDOM: construction:0.000000:0.000000 amps-and-angle-encoding:0.060000:0.000000 auto-links:0.070000:0.000000 backlash-escapes:0.220000:135168.000000 blockquotes-with-dode-blocks:0.020000:0.000000 hard-wrapped:0.020000:0.000000 horizontal-rules:0.150000:0.000000 inline-html-advanced:0.030000:0.000000 inline-html-comments:0.030000:0.000000 inline-html-simple:0.140000:0.000000 links-inline:0.050000:0.000000 links-reference:0.080000:0.000000 literal-quotes:0.030000:0.000000 markdown-documentation-basics:0.450000:868352.000000 markdown-syntax:2.080000:5160960.000000 nested-blockquotes:0.020000:0.000000 ordered-and-unordered-list:0.290000:0.000000 strong-and-em-together:0.030000:0.000000 tabs:0.040000:0.000000 tidyness:0.030000:0.000000 |
From: David W. <wo...@cs...> - 2008-07-04 15:05:01
|
On 4-Jul-08, at 10:05 AM, Artem Yunusov wrote: > Yuri Takhteyev wrote: >> Interesting. It looks like lxml is way way faster than ElementTree. >> Also, the website for lxml seems to suggest that ElementTree has some >> serious problems in handling unicode >> (http://codespeak.net/lxml/compatibility.html, third bullet). This >> really worries me, more so than performance. This may not affect us, >> but we need to make sure that ElementTree can handle unicode properly >> if we would be using it. However, it looks like lxml is included >> with >> nothing at this point, and would require building stuff from C, which >> may raise the bar for using markdown... > lxml supports ElementTree API, so we could write something like this: > try: > from lxml import etree > print "running with lxml.etree" > ... > except ImportError: > print "Failed to import ElementTree from any known place" > > We can suggest to use lxml, but by default cElementTree will be > used on > python 2.5 I'd agree that this is the best way to go. From what I've read and heard, lxml is faster/better, but it's also not standard and I went through hell trying to install it about a month ago (I don't think I succeeded either...) Also, as far as ElementTree's handling of Unicode: in the twelve months I was working on DrProject, which uses ElementTree for all sorts of things, I can't remember any problems giving it Unicode (and part of my work was getting 100% Unicode support). |
From: Waylan L. <wa...@gm...> - 2008-07-04 14:06:28
|
On Fri, Jul 4, 2008 at 9:05 AM, Artem Yunusov <ne...@gm...> wrote: > > lxml supports ElementTree API, so we could write something like this: > > try: > from lxml import etree > print "running with lxml.etree" > except ImportError: > try: > # Python 2.5 > import xml.etree.cElementTree as etree > print "running with cElementTree on Python 2.5+" > except ImportError: > try: > # Python 2.5 > import xml.etree.ElementTree as etree > print "running with ElementTree on Python 2.5+" > except ImportError: > try: > # normal cElementTree install > import cElementTree as etree > print "running with cElementTree" > except ImportError: > try: > # normal ElementTree install > import elementtree.ElementTree as etree > print "running with ElementTree" > except ImportError: > print "Failed to import ElementTree from any known place" > > We can suggest to use lxml, but by default cElementTree will be used on > python 2.5 I like it. However, we should check that lxml is actually making a noticeable difference before we commit to that in a release. Unless Yuri objects, go ahead and implement it in a branch and well see how it goes. > I didn't get what the real problem with unicode is, there are some > general words at lxml site, and I think if the problem had been quite > serious, ElementTree wouldn't have included in standard Python library. > I tried some test with russian unicode data - didin't find any problems > yet, but I think this issue need more proper investigation. I get the impression that those comments were referring to the parser, not the serializer. If I'm understanding that correctly, then this should be a non-issue. But we should make sure. -- ---- Waylan Limberg wa...@gm... |
From: Artem Y. <ne...@gm...> - 2008-07-04 13:05:23
|
Yuri Takhteyev wrote: > Interesting. It looks like lxml is way way faster than ElementTree. > Also, the website for lxml seems to suggest that ElementTree has some > serious problems in handling unicode > (http://codespeak.net/lxml/compatibility.html, third bullet). This > really worries me, more so than performance. This may not affect us, > but we need to make sure that ElementTree can handle unicode properly > if we would be using it. However, it looks like lxml is included with > nothing at this point, and would require building stuff from C, which > may raise the bar for using markdown... > lxml supports ElementTree API, so we could write something like this: try: from lxml import etree print "running with lxml.etree" except ImportError: try: # Python 2.5 import xml.etree.cElementTree as etree print "running with cElementTree on Python 2.5+" except ImportError: try: # Python 2.5 import xml.etree.ElementTree as etree print "running with ElementTree on Python 2.5+" except ImportError: try: # normal cElementTree install import cElementTree as etree print "running with cElementTree" except ImportError: try: # normal ElementTree install import elementtree.ElementTree as etree print "running with ElementTree" except ImportError: print "Failed to import ElementTree from any known place" We can suggest to use lxml, but by default cElementTree will be used on python 2.5 I didn't get what the real problem with unicode is, there are some general words at lxml site, and I think if the problem had been quite serious, ElementTree wouldn't have included in standard Python library. I tried some test with russian unicode data - didin't find any problems yet, but I think this issue need more proper investigation. |
From: Artem Y. <ne...@gm...> - 2008-07-04 13:04:49
|
Yuri Takhteyev wrote: > Either way, Artem, make a branch from what you have now, to make sure > that you can continue pushing relevant bug fixes into the > pre-ElementTree version. After that, perhaps the thing to do is to > try ElementTree to see what it gives us in terms of performance (for > python2.5), without worrying too much about the tests. If it's a > substantial boost, you can then work on making sure that the output is > actually the same. > Ok, got it :) |
From: Yuri T. <qar...@gm...> - 2008-07-04 03:21:13
|
> Btw, if anyone is interested in performance of html serializers and > parsers in python, here's a decent comparison: > http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/ Interesting. It looks like lxml is way way faster than ElementTree. Also, the website for lxml seems to suggest that ElementTree has some serious problems in handling unicode (http://codespeak.net/lxml/compatibility.html, third bullet). This really worries me, more so than performance. This may not affect us, but we need to make sure that ElementTree can handle unicode properly if we would be using it. However, it looks like lxml is included with nothing at this point, and would require building stuff from C, which may raise the bar for using markdown... - yuri -- http://sputnik.freewisdom.org/ |
From: Waylan L. <wa...@gm...> - 2008-07-04 03:00:14
|
On Thu, Jul 3, 2008 at 8:09 PM, Yuri Takhteyev <qar...@gm...> wrote: >> On 3-Jul-08, at 5:39 PM, Artem Yunusov wrote: >>> As far as I understand all the HTML from input replacing by >>> placeholders, and then inserting back only after serialization. So, it >>> won't be a problem in this case. > > Yes, Artem is right, we are now not attempting to parse HTML submitted > by the user, we just pass it through. This is what most (all?) > markdown implementations do. This also means that if the user > supplies bad HTML (or HTML that is not XHTML), then they will get back > what they gave us. Garbage in, garbage out. > > The consensus on the markdown list seems to have been that policing > HTML submitted by the user (which would include looking out for XSS > attacks) should be left to the client, who should filter the output of > markdown. > >> For some reason I was under the impression that the "instant html/ >> xhtml output option" meant "html which includes html in the input". > Sorry if I misled anyone with that statement. I'm with Yuri (and the Markdown community at large). We don't fix bad input. In fact, the raw html never gets put into the DOM anyway. It's stored as plain text and added back in after the dom is converted into a string. Which means that we can't really pass the user a DOM object for them to do as they please because we aren't done with it yet. However, we could add a keyword to `Markdown.convert()` that specifies the output format of html or xhtml and pass that on to the DOM on serialization. Btw, if anyone is interested in performance of html serializers and parsers in python, here's a decent comparison: http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/ -- ---- Waylan Limberg wa...@gm... |
From: Yuri T. <qar...@gm...> - 2008-07-04 00:09:50
|
> On 3-Jul-08, at 5:39 PM, Artem Yunusov wrote: >> As far as I understand all the HTML from input replacing by >> placeholders, and then inserting back only after serialization. So, it >> won't be a problem in this case. Yes, Artem is right, we are now not attempting to parse HTML submitted by the user, we just pass it through. This is what most (all?) markdown implementations do. This also means that if the user supplies bad HTML (or HTML that is not XHTML), then they will get back what they gave us. Garbage in, garbage out. The consensus on the markdown list seems to have been that policing HTML submitted by the user (which would include looking out for XSS attacks) should be left to the client, who should filter the output of markdown. > For some reason I was under the impression that the "instant html/ > xhtml output option" meant "html which includes html in the input". Again, I think it would be nice to give the user an option for outputting HTML4 version of the tags that we add. User's HTML is their business, however Clients who want to convert markdown text with pieces of XHTML to HTML4 should use markdown to generate XHMLT (with their pieces of XHTML embedded verbatim) and then find a way to convert this to HTML4. Or they can convert their chunks of XHTML to HTML4 first.) That said, if we offer the client an option of just getting an ElementTree back from markdown (which they can then manipulate to their heart's content), then this option would surely be more attractive if we could also parse their (X)HTML. Now, onto the question of performance. I would like to continue supporting python 2.3 and python 2.4, but I think we can compromise here. If ElementTree is easy to install for python 2.3 and 2.4 and if our script can fallback gracefully from cElementTree to ElementTree then this seems like a reasonable option. I am ok with suffering a (small) reduction in performance on earlier versions of python if this will gain performance on python2.5. (As I mentioned before, I would prefer to offer 2.3 support as long as frameworks like django do so.) Either way, Artem, make a branch from what you have now, to make sure that you can continue pushing relevant bug fixes into the pre-ElementTree version. After that, perhaps the thing to do is to try ElementTree to see what it gives us in terms of performance (for python2.5), without worrying too much about the tests. If it's a substantial boost, you can then work on making sure that the output is actually the same. - yuri -- http://sputnik.freewisdom.org/ |
From: Trent M. <tr...@gm...> - 2008-07-03 23:53:24
|
> As far as I can tell, the only way to get around that sort of problem > would be using BeautifulSoup to parse the input... But there goes any > hope of a significant speedup. Or html5lib: http://code.google.com/p/html5lib/wiki/UserDocumentation There is an example there that will build an ElementTree. I've had good success with html5lib parsing HTML that isn't XHTML. I'm not sure what the speed implications are though. Could perhaps fallback to html5lib (or BeautifulSoup) if there is an input XML parsing problem. Trent -- Trent Mick tr...@gm... |
From: David W. <wo...@cs...> - 2008-07-03 21:00:40
|
On 3-Jul-08, at 5:39 PM, Artem Yunusov wrote: > As far as I understand all the HTML from input replacing by > placeholders, and then inserting back only after serialization. So, it > won't be a problem in this case. Ah, yes, right. For some reason I was under the impression that the "instant html/ xhtml output option" meant "html which includes html in the input". D'oh. |
From: Artem Y. <ne...@gm...> - 2008-07-03 20:39:37
|
David Wolever wrote: > On 3-Jul-08, at 4:57 PM, Waylan Limberg wrote: >> Just remember the c variation will not work on IronPython, Jython or >> other python implementations. Additionally, there may be (shared) web >> hosts which will allow a user to copy a pure python module to the >> server, but not compile a c module. So if we do switch, the import >> should be in a try...except block and import the non-c variation if c >> is not available. In such a situation, would ElementTree give us any >> advantage? > My marginally educated but entirely untested guess is that it would > still be a bit faster... And certainly within the same order of > magnitude. > > Now, the problem: ElementTree will get really, really upset about > invalid XHTML. > Really upset. > So you'll need to figure out how you'll handle bad input (unclosed > tags, improper quoting, etc): As far as I understand all the HTML from input replacing by placeholders, and then inserting back only after serialization. So, it won't be a problem in this case. |
From: Artem Y. <ne...@gm...> - 2008-07-03 20:31:30
|
Waylan Limberg wrote: >> cElementTree and ElementTree was includeded in Python std lib since >> Python 2.5, but cElementTree itself can be used with Python 1.5.2 and later. >> >> > > Just remember the c variation will not work on IronPython, Jython or > other python implementations. Additionally, there may be (shared) web > hosts which will allow a user to copy a pure python module to the > server, but not compile a c module. So if we do switch, the import > should be in a try...except block and import the non-c variation if c > is not available. Sure, thanks for pointing that out. > In such a situation, would ElementTree give us any > advantage? > > Although, I believe that would give us an instant html/xhtml output > option so it may make sense regardless of any speed increase - that is > as long as it doesn't slow things down. > Yes, ElementTree suports html/xhtml output, so it'll be advantage, but I think pure python version, will be little bit slower than NanoDOM implementation. The best way to know exactly is to test it :) |
From: David W. <wo...@cs...> - 2008-07-03 20:30:13
|
On 3-Jul-08, at 4:57 PM, Waylan Limberg wrote: > Just remember the c variation will not work on IronPython, Jython or > other python implementations. Additionally, there may be (shared) web > hosts which will allow a user to copy a pure python module to the > server, but not compile a c module. So if we do switch, the import > should be in a try...except block and import the non-c variation if c > is not available. In such a situation, would ElementTree give us any > advantage? My marginally educated but entirely untested guess is that it would still be a bit faster... And certainly within the same order of magnitude. Now, the problem: ElementTree will get really, really upset about invalid XHTML. Really upset. So you'll need to figure out how you'll handle bad input (unclosed tags, improper quoting, etc): >>> from elementtree import ElementTree as ET >>> from StringIO import StringIO as S >>> ET.parse(S("<a name=foo>bar</a>")) ... ExpatError: not well-formed (invalid token): line 1, column 8 Not happy :( As far as I can tell, the only way to get around that sort of problem would be using BeautifulSoup to parse the input... But there goes any hope of a significant speedup. (now, of course, something like this is always possible: try: doc = ET.parse(input) except ExpatError: doc = bs_to_et(BeautifulSoup(input)) But I can't speak of the advantages or disadvantages with any more authority than you could) |
From: Waylan L. <wa...@gm...> - 2008-07-03 19:57:33
|
On Thu, Jul 3, 2008 at 3:28 PM, Artem Yunusov <ne...@gm...> wrote: > >>> I think we also could speed up markdown by switching to cElementTree >>> instead of NanoDOM. >>> >> >> Which version of python would this require? >> > > cElementTree and ElementTree was includeded in Python std lib since > Python 2.5, but cElementTree itself can be used with Python 1.5.2 and later. > Just remember the c variation will not work on IronPython, Jython or other python implementations. Additionally, there may be (shared) web hosts which will allow a user to copy a pure python module to the server, but not compile a c module. So if we do switch, the import should be in a try...except block and import the non-c variation if c is not available. In such a situation, would ElementTree give us any advantage? Although, I believe that would give us an instant html/xhtml output option so it may make sense regardless of any speed increase - that is as long as it doesn't slow things down. -- ---- Waylan Limberg wa...@gm... |
From: Artem Y. <ne...@gm...> - 2008-07-03 19:28:55
|
Yuri Takhteyev wrote: >> markdown-syntax:2.120000:5246976.000000 >> markdown-syntax:2.500000:2031616.000000 >> > > So, that's, 26% reduction in time, right? Not bad. But are we now > using twice as much memory? Any reason for this? > > I'll try to investigate this. Maybe it's because we are putting all InlinePatterns nodes to dict. And I think two-step approach eats part of memory too. >> I think we also could speed up markdown by switching to cElementTree >> instead of NanoDOM. >> > > Which version of python would this require? > cElementTree and ElementTree was includeded in Python std lib since Python 2.5, but cElementTree itself can be used with Python 1.5.2 and later. |
From: Yuri T. <qar...@gm...> - 2008-07-03 18:59:48
|
> So, all the tests from test suite are working now. Great. > markdown-syntax:2.120000:5246976.000000 > markdown-syntax:2.500000:2031616.000000 So, that's, 26% reduction in time, right? Not bad. But are we now using twice as much memory? Any reason for this? > I think we also could speed up markdown by switching to cElementTree > instead of NanoDOM. Which version of python would this require? - yuri -- http://sputnik.freewisdom.org/ |
From: Artem Y. <ne...@gm...> - 2008-07-03 18:04:01
|
So, all the tests from test suite are working now. You can view it here: http://gitorious.org/projects/python-markdown/repos/gsoc2008 Here is timings for new markdown implementation(tests/markdown-test): construction:0.000000:0.000000 amps-and-angle-encoding:0.070000:131072.000000 auto-links:0.070000:0.000000 backlash-escapes:0.210000:516096.000000 blockquotes-with-dode-blocks:0.010000:0.000000 hard-wrapped:0.020000:0.000000 horizontal-rules:0.140000:0.000000 inline-html-advanced:0.020000:0.000000 inline-html-comments:0.030000:0.000000 inline-html-simple:0.160000:0.000000 links-inline:0.040000:0.000000 links-reference:0.080000:0.000000 literal-quotes:0.030000:0.000000 markdown-documentation-basics:0.430000:1265664.000000 markdown-syntax:2.120000:5246976.000000 nested-blockquotes:0.020000:0.000000 ordered-and-unordered-list:0.300000:0.000000 strong-and-em-together:0.040000:0.000000 tabs:0.030000:0.000000 tidyness:0.030000:0.000000 and for old markdown implementation: construction:0.000000:0.000000 amps-and-angle-encoding:0.080000:0.000000 auto-links:0.080000:0.000000 backlash-escapes:0.290000:397312.000000 blockquotes-with-dode-blocks:0.020000:0.000000 hard-wrapped:0.030000:0.000000 horizontal-rules:0.160000:0.000000 inline-html-advanced:0.030000:0.000000 inline-html-comments:0.040000:0.000000 inline-html-simple:0.150000:0.000000 links-inline:0.060000:0.000000 links-reference:0.090000:0.000000 literal-quotes:0.030000:0.000000 markdown-documentation-basics:0.520000:1384448.000000 markdown-syntax:2.500000:2031616.000000 nested-blockquotes:0.020000:0.000000 ordered-and-unordered-list:0.330000:0.000000 strong-and-em-together:0.020000:0.000000 tabs:0.040000:0.000000 tidyness:0.030000:0.000000 I think we also could speed up markdown by switching to cElementTree instead of NanoDOM. |
From: Artem <ne...@gm...> - 2008-07-02 23:15:18
|
Yuri Takhteyev wrote: > After another second's thought, if we can stick with alphanumeric IDs > for placeholders, perhaps this would be more parsimonious: > > START+<type>+":"+<id>+END > > (Type should also be alphanumeric.) > > http://gitorious.org/projects/python-markdown/repos/mainline/commits/2edd84e > > Done :) Also fixed some bugs. And also I noticed that some tests from our test suite is wrong, for instance(stronintags): this is a [**test**](http://example.com/) this is a second **[test](http://example.com)** reference **[test][]** reference [**test**][] second link should be strong, but it splits in two ems: <p>this is a <a href="http://example.com/"><strong>test</strong></a> </p> <p>this is a second <em></em><a href="http://example.com">test</a><em></em> </p> <p>reference <strong>[test][]</strong> reference [<strong>test</strong>][] </p> |
From: Yuri T. <qar...@gm...> - 2008-07-02 17:29:30
|
>> 2. I couldn´t find how to output a different format than XHTML. Is it >> possible to output HTML4? > > Unfortunately, this is currently not possible. Python-Markdown uses It's not currently supported, but it surely is _possible_. :) Can you provide a list of which specific tags cause problems for HTML4? Is it just a matter of skipping closing tags on empty elements? - yuri -- http://sputnik.freewisdom.org/ |
From: Yuri T. <qar...@gm...> - 2008-07-02 06:27:31
|
After another second's thought, if we can stick with alphanumeric IDs for placeholders, perhaps this would be more parsimonious: START+<type>+":"+<id>+END (Type should also be alphanumeric.) http://gitorious.org/projects/python-markdown/repos/mainline/commits/2edd84e - yuri > Perhaps you could use: > > self.prefix = START+"node"+NULL > self.suffix = END+"node"+NULL > > We could more generally go for this patter for placeholders of type <type>: > > START+<type>+NULL+<id>+END+<type>+NULL -- http://sputnik.freewisdom.org/ |
From: Yuri T. <qar...@gm...> - 2008-07-02 06:22:26
|
> I solved this issue already, now all of those examples works: > > [*test*](http://example.com) > *[test](http://example.com)* > **[*test*](http://example.com)** > __*[test](http://example.com)*__ This works too. You win. But let's not go further into the land of weird character combinations as placeholders. I am of course myself guilty of starting this with my handling of HTML, but I was hoping to do less of this in the future, not more. How about using unprintable control characters instead of weird combinatiosn of letters, and sanitizing user input from those characters? I tried doing this for the version in mainline for HTML placeholders: http://gitorious.org/projects/python-markdown/repos/mainline/commits/bb00fc58 Perhaps you could use: self.prefix = START+"node"+NULL self.suffix = END+"node"+NULL We could more generally go for this patter for placeholders of type <type>: START+<type>+NULL+<id>+END+<type>+NULL - yuri -- http://sputnik.freewisdom.org/ |
From: Artem <ne...@gm...> - 2008-07-02 00:12:37
|
Yuri Takhteyev wrote: > > The profit is from switching to something simpler, faster and more > accurate, while maintaining the extensibility. My guess is that > changing inline patterns to be text-in-text-out will make them simpler > (some will just become regular expressions), easier to understand, > will likely be faster, and will allow us to handle them in the same > way as all other implementations do. (We might just borrow their > regular expressions in some cases.) > > But there is on big minus, we won't get valid DOM document. And I don't think that this will give big performance boost, since we already have(in new implementation) string(and not list) processing mechanism. > The current implementation (using DOM) creates a few serious issues > for inline patterns, which I don't think can be solved in any easy > way. We are relying on regular expressions to match patterns, but > those cannot span multiple nodes. This means, for instance, that we > can either make **[foo](/foo.html]** work, or [**foo**](foo.html), but > not both. If we run the link pattern first, then the first string > turns into ("**", <a dom node>, "**"). We now cannot run a regular > expression across this. If we apply the **...** pattern first, then > the second expression becomes ("[", <a dom node>, "](foo.html)") and > now we cannot match the link pattern. Which is why my suggestion > (which I haven't had time to implement) has been to switch to simple > text-in-text-out implementation of the patterns: > I solved this issue already, now all of those examples works: [*test*](http://example.com) *[test](http://example.com)* **[*test*](http://example.com)** __*[test](http://example.com)*__ And we still have valid DOM tree. You can try it from repository. |
From: Yuri T. <qar...@gm...> - 2008-07-01 23:44:21
|
> What is the profit of dropping current inline pasterns? We still need > some functions, that would create DOM element from regexp match object. > Maybe we should just refactor current inline patterns regexps ? > Now I'm fixing bugs, and some of them require Inline patterns changes. The profit is from switching to something simpler, faster and more accurate, while maintaining the extensibility. My guess is that changing inline patterns to be text-in-text-out will make them simpler (some will just become regular expressions), easier to understand, will likely be faster, and will allow us to handle them in the same way as all other implementations do. (We might just borrow their regular expressions in some cases.) The current implementation (using DOM) creates a few serious issues for inline patterns, which I don't think can be solved in any easy way. We are relying on regular expressions to match patterns, but those cannot span multiple nodes. This means, for instance, that we can either make **[foo](/foo.html]** work, or [**foo**](foo.html), but not both. If we run the link pattern first, then the first string turns into ("**", <a dom node>, "**"). We now cannot run a regular expression across this. If we apply the **...** pattern first, then the second expression becomes ("[", <a dom node>, "](foo.html)") and now we cannot match the link pattern. Which is why my suggestion (which I haven't had time to implement) has been to switch to simple text-in-text-out implementation of the patterns: 1. link pattern turns "**[foo](/foo.html]**" into "**<a href='/foo.html'>foo</a>**" 2. bold pattern turns "**<a href='/foo.html'>foo</a>**" into "<b><a href='/foo.html'>foo</a></b>" By only using this method for inline patterns I think we would get the best of both worlds. In particular, we retain the concept of the tree for high level elements and make it possible to write sophisticated extensions, which do certain changes to certain kinds of blocks. But we also will be handling basic tags like other implementations. - yuri -- http://sputnik.freewisdom.org/ |
From: Artem <ne...@gm...> - 2008-07-01 23:07:29
|
Yuri Takhteyev wrote: > Interesting. Not sure what exactly we would gain from it, but it > should be possible. Let's not worry about this for now, though. I > think the next step should be dropping all the current inline patterns > and replacing them with simple regular expression substitutions, > perhaps borrowed from Trent's code. In fact, we should see if > applyInlinePatterns could just use a chunk of Trent's code "as is". > What is the profit of dropping current inline pasterns? We still need some functions, that would create DOM element from regexp match object. Maybe we should just refactor current inline patterns regexps ? Now I'm fixing bugs, and some of them require Inline patterns changes. |