From: Artem <ne...@gm...> - 2008-06-30 23:46:58
|
Hi all, Now Markdown.convert method uses two-step approach. 1 - it converts markdown to DOM tree, without applying inline patterns, 2 - it parses DOM tree and applying inline patterns. The main problem with inline patterns is solved. I re-implemented Inline patterns mechanism, using string with placeholders for DOM elements. Now all of those examples: [*test*](http://example.com) *[test](http://example.com)* **[*test*](http://example.com)** __*[test](http://example.com)*__ works fine: <p><a href="http://example.com"><em>test</em></a> <em><a href="http://example.com">test</a></em> <strong><a href="http://example.com"><em>test</em></a></strong> <strong><em><a href="http://example.com">test</a></em></strong> </p> The new version is 10-20% faster. Not all tests from the test suite is working now, but I'm working on it. |
From: Waylan L. <wa...@gm...> - 2008-07-01 00:55:02
|
Cool. Just one question though. Why did you split convert into two methods (convert & markdownToTree) when all you needed to do was add one additional line to convert for the extra step to run inline patterns? We have enough methods as it is. In fact, I've considered combining convert and _transform as they are never called any different way. Which, makes me think - if applyInlinePatterns operates on a DOM of the entire document, why not make it a postprocessor? That may be a little drastic, but it would open up a few more possibilities for overriding behavior. Just a thought. Now to find the time to play with it... On Mon, Jun 30, 2008 at 7:47 PM, Artem <ne...@gm...> wrote: > Hi all, > > Now Markdown.convert method uses two-step approach. 1 - it converts > markdown to DOM tree, without applying inline patterns, 2 - it parses > DOM tree and applying inline patterns. > > The main problem with inline patterns is solved. I re-implemented Inline > patterns mechanism, using string with placeholders for DOM elements. Now > all of those examples: > > [*test*](http://example.com) > *[test](http://example.com)* > **[*test*](http://example.com)** > __*[test](http://example.com)*__ > > works fine: > > <p><a href="http://example.com"><em>test</em></a> > <em><a href="http://example.com">test</a></em> > <strong><a href="http://example.com"><em>test</em></a></strong> > <strong><em><a href="http://example.com">test</a></em></strong> > </p> > > The new version is 10-20% faster. Not all tests from the test suite is > working now, but I'm working on it. > > ------------------------------------------------------------------------- > Check out the new SourceForge.net Marketplace. > It's the best place to buy or sell services for > just about anything Open Source. > http://sourceforge.net/services/buy/index.php > _______________________________________________ > Python-markdown-discuss mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/python-markdown-discuss > -- ---- Waylan Limberg wa...@gm... |
From: Artem <ne...@gm...> - 2008-07-01 23:07:19
|
Waylan Limberg wrote: > Now to find the time to play with it... > > Committed it to repository. Markdown._processPlaceholders method looks not very clear, I tried to implement it using iterator object or using re.finditer, but those approaches was quite slow, so I decided to leave current implementation. |
From: Yuri T. <qar...@gm...> - 2008-07-01 09:12:35
|
> Just one question though. Why did you split convert into two methods > (convert & markdownToTree) when all you needed to do was add one > additional line to convert for the extra step to run inline patterns? > We have enough methods as it is. In fact, I've considered combining > convert and _transform as they are never called any different way. This was my suggestion. We'll need to decide whether to consider the new markdownToTree() public or private, but I do think it will be useful to clearly separate the conversion into those two steps. > Which, makes me think - if applyInlinePatterns operates on a DOM of > the entire document, why not make it a postprocessor? That may be a > little drastic, but it would open up a few more possibilities for > overriding behavior. Just a thought. Interesting. Not sure what exactly we would gain from it, but it should be possible. Let's not worry about this for now, though. I think the next step should be dropping all the current inline patterns and replacing them with simple regular expression substitutions, perhaps borrowed from Trent's code. In fact, we should see if applyInlinePatterns could just use a chunk of Trent's code "as is". - yuri -- http://sputnik.freewisdom.org/ |
From: Artem <ne...@gm...> - 2008-07-01 23:07:29
|
Yuri Takhteyev wrote: > Interesting. Not sure what exactly we would gain from it, but it > should be possible. Let's not worry about this for now, though. I > think the next step should be dropping all the current inline patterns > and replacing them with simple regular expression substitutions, > perhaps borrowed from Trent's code. In fact, we should see if > applyInlinePatterns could just use a chunk of Trent's code "as is". > What is the profit of dropping current inline pasterns? We still need some functions, that would create DOM element from regexp match object. Maybe we should just refactor current inline patterns regexps ? Now I'm fixing bugs, and some of them require Inline patterns changes. |
From: David W. <wo...@cs...> - 2008-07-03 20:30:13
|
On 3-Jul-08, at 4:57 PM, Waylan Limberg wrote: > Just remember the c variation will not work on IronPython, Jython or > other python implementations. Additionally, there may be (shared) web > hosts which will allow a user to copy a pure python module to the > server, but not compile a c module. So if we do switch, the import > should be in a try...except block and import the non-c variation if c > is not available. In such a situation, would ElementTree give us any > advantage? My marginally educated but entirely untested guess is that it would still be a bit faster... And certainly within the same order of magnitude. Now, the problem: ElementTree will get really, really upset about invalid XHTML. Really upset. So you'll need to figure out how you'll handle bad input (unclosed tags, improper quoting, etc): >>> from elementtree import ElementTree as ET >>> from StringIO import StringIO as S >>> ET.parse(S("<a name=foo>bar</a>")) ... ExpatError: not well-formed (invalid token): line 1, column 8 Not happy :( As far as I can tell, the only way to get around that sort of problem would be using BeautifulSoup to parse the input... But there goes any hope of a significant speedup. (now, of course, something like this is always possible: try: doc = ET.parse(input) except ExpatError: doc = bs_to_et(BeautifulSoup(input)) But I can't speak of the advantages or disadvantages with any more authority than you could) |
From: Trent M. <tr...@gm...> - 2008-07-03 23:53:24
|
> As far as I can tell, the only way to get around that sort of problem > would be using BeautifulSoup to parse the input... But there goes any > hope of a significant speedup. Or html5lib: http://code.google.com/p/html5lib/wiki/UserDocumentation There is an example there that will build an ElementTree. I've had good success with html5lib parsing HTML that isn't XHTML. I'm not sure what the speed implications are though. Could perhaps fallback to html5lib (or BeautifulSoup) if there is an input XML parsing problem. Trent -- Trent Mick tr...@gm... |
From: Yuri T. <qar...@gm...> - 2008-07-01 23:44:21
|
> What is the profit of dropping current inline pasterns? We still need > some functions, that would create DOM element from regexp match object. > Maybe we should just refactor current inline patterns regexps ? > Now I'm fixing bugs, and some of them require Inline patterns changes. The profit is from switching to something simpler, faster and more accurate, while maintaining the extensibility. My guess is that changing inline patterns to be text-in-text-out will make them simpler (some will just become regular expressions), easier to understand, will likely be faster, and will allow us to handle them in the same way as all other implementations do. (We might just borrow their regular expressions in some cases.) The current implementation (using DOM) creates a few serious issues for inline patterns, which I don't think can be solved in any easy way. We are relying on regular expressions to match patterns, but those cannot span multiple nodes. This means, for instance, that we can either make **[foo](/foo.html]** work, or [**foo**](foo.html), but not both. If we run the link pattern first, then the first string turns into ("**", <a dom node>, "**"). We now cannot run a regular expression across this. If we apply the **...** pattern first, then the second expression becomes ("[", <a dom node>, "](foo.html)") and now we cannot match the link pattern. Which is why my suggestion (which I haven't had time to implement) has been to switch to simple text-in-text-out implementation of the patterns: 1. link pattern turns "**[foo](/foo.html]**" into "**<a href='/foo.html'>foo</a>**" 2. bold pattern turns "**<a href='/foo.html'>foo</a>**" into "<b><a href='/foo.html'>foo</a></b>" By only using this method for inline patterns I think we would get the best of both worlds. In particular, we retain the concept of the tree for high level elements and make it possible to write sophisticated extensions, which do certain changes to certain kinds of blocks. But we also will be handling basic tags like other implementations. - yuri -- http://sputnik.freewisdom.org/ |
From: Artem <ne...@gm...> - 2008-07-02 00:12:37
|
Yuri Takhteyev wrote: > > The profit is from switching to something simpler, faster and more > accurate, while maintaining the extensibility. My guess is that > changing inline patterns to be text-in-text-out will make them simpler > (some will just become regular expressions), easier to understand, > will likely be faster, and will allow us to handle them in the same > way as all other implementations do. (We might just borrow their > regular expressions in some cases.) > > But there is on big minus, we won't get valid DOM document. And I don't think that this will give big performance boost, since we already have(in new implementation) string(and not list) processing mechanism. > The current implementation (using DOM) creates a few serious issues > for inline patterns, which I don't think can be solved in any easy > way. We are relying on regular expressions to match patterns, but > those cannot span multiple nodes. This means, for instance, that we > can either make **[foo](/foo.html]** work, or [**foo**](foo.html), but > not both. If we run the link pattern first, then the first string > turns into ("**", <a dom node>, "**"). We now cannot run a regular > expression across this. If we apply the **...** pattern first, then > the second expression becomes ("[", <a dom node>, "](foo.html)") and > now we cannot match the link pattern. Which is why my suggestion > (which I haven't had time to implement) has been to switch to simple > text-in-text-out implementation of the patterns: > I solved this issue already, now all of those examples works: [*test*](http://example.com) *[test](http://example.com)* **[*test*](http://example.com)** __*[test](http://example.com)*__ And we still have valid DOM tree. You can try it from repository. |
From: David W. <wo...@cs...> - 2008-07-03 21:00:40
|
On 3-Jul-08, at 5:39 PM, Artem Yunusov wrote: > As far as I understand all the HTML from input replacing by > placeholders, and then inserting back only after serialization. So, it > won't be a problem in this case. Ah, yes, right. For some reason I was under the impression that the "instant html/ xhtml output option" meant "html which includes html in the input". D'oh. |
From: Yuri T. <qar...@gm...> - 2008-07-04 00:09:50
|
> On 3-Jul-08, at 5:39 PM, Artem Yunusov wrote: >> As far as I understand all the HTML from input replacing by >> placeholders, and then inserting back only after serialization. So, it >> won't be a problem in this case. Yes, Artem is right, we are now not attempting to parse HTML submitted by the user, we just pass it through. This is what most (all?) markdown implementations do. This also means that if the user supplies bad HTML (or HTML that is not XHTML), then they will get back what they gave us. Garbage in, garbage out. The consensus on the markdown list seems to have been that policing HTML submitted by the user (which would include looking out for XSS attacks) should be left to the client, who should filter the output of markdown. > For some reason I was under the impression that the "instant html/ > xhtml output option" meant "html which includes html in the input". Again, I think it would be nice to give the user an option for outputting HTML4 version of the tags that we add. User's HTML is their business, however Clients who want to convert markdown text with pieces of XHTML to HTML4 should use markdown to generate XHMLT (with their pieces of XHTML embedded verbatim) and then find a way to convert this to HTML4. Or they can convert their chunks of XHTML to HTML4 first.) That said, if we offer the client an option of just getting an ElementTree back from markdown (which they can then manipulate to their heart's content), then this option would surely be more attractive if we could also parse their (X)HTML. Now, onto the question of performance. I would like to continue supporting python 2.3 and python 2.4, but I think we can compromise here. If ElementTree is easy to install for python 2.3 and 2.4 and if our script can fallback gracefully from cElementTree to ElementTree then this seems like a reasonable option. I am ok with suffering a (small) reduction in performance on earlier versions of python if this will gain performance on python2.5. (As I mentioned before, I would prefer to offer 2.3 support as long as frameworks like django do so.) Either way, Artem, make a branch from what you have now, to make sure that you can continue pushing relevant bug fixes into the pre-ElementTree version. After that, perhaps the thing to do is to try ElementTree to see what it gives us in terms of performance (for python2.5), without worrying too much about the tests. If it's a substantial boost, you can then work on making sure that the output is actually the same. - yuri -- http://sputnik.freewisdom.org/ |
From: Artem Y. <ne...@gm...> - 2008-07-04 13:04:49
|
Yuri Takhteyev wrote: > Either way, Artem, make a branch from what you have now, to make sure > that you can continue pushing relevant bug fixes into the > pre-ElementTree version. After that, perhaps the thing to do is to > try ElementTree to see what it gives us in terms of performance (for > python2.5), without worrying too much about the tests. If it's a > substantial boost, you can then work on making sure that the output is > actually the same. > Ok, got it :) |
From: Yuri T. <qar...@gm...> - 2008-07-02 06:22:26
|
> I solved this issue already, now all of those examples works: > > [*test*](http://example.com) > *[test](http://example.com)* > **[*test*](http://example.com)** > __*[test](http://example.com)*__ This works too. You win. But let's not go further into the land of weird character combinations as placeholders. I am of course myself guilty of starting this with my handling of HTML, but I was hoping to do less of this in the future, not more. How about using unprintable control characters instead of weird combinatiosn of letters, and sanitizing user input from those characters? I tried doing this for the version in mainline for HTML placeholders: http://gitorious.org/projects/python-markdown/repos/mainline/commits/bb00fc58 Perhaps you could use: self.prefix = START+"node"+NULL self.suffix = END+"node"+NULL We could more generally go for this patter for placeholders of type <type>: START+<type>+NULL+<id>+END+<type>+NULL - yuri -- http://sputnik.freewisdom.org/ |
From: Yuri T. <qar...@gm...> - 2008-07-02 06:27:31
|
After another second's thought, if we can stick with alphanumeric IDs for placeholders, perhaps this would be more parsimonious: START+<type>+":"+<id>+END (Type should also be alphanumeric.) http://gitorious.org/projects/python-markdown/repos/mainline/commits/2edd84e - yuri > Perhaps you could use: > > self.prefix = START+"node"+NULL > self.suffix = END+"node"+NULL > > We could more generally go for this patter for placeholders of type <type>: > > START+<type>+NULL+<id>+END+<type>+NULL -- http://sputnik.freewisdom.org/ |
From: Artem <ne...@gm...> - 2008-07-02 23:15:18
|
Yuri Takhteyev wrote: > After another second's thought, if we can stick with alphanumeric IDs > for placeholders, perhaps this would be more parsimonious: > > START+<type>+":"+<id>+END > > (Type should also be alphanumeric.) > > http://gitorious.org/projects/python-markdown/repos/mainline/commits/2edd84e > > Done :) Also fixed some bugs. And also I noticed that some tests from our test suite is wrong, for instance(stronintags): this is a [**test**](http://example.com/) this is a second **[test](http://example.com)** reference **[test][]** reference [**test**][] second link should be strong, but it splits in two ems: <p>this is a <a href="http://example.com/"><strong>test</strong></a> </p> <p>this is a second <em></em><a href="http://example.com">test</a><em></em> </p> <p>reference <strong>[test][]</strong> reference [<strong>test</strong>][] </p> |
From: Artem Y. <ne...@gm...> - 2008-07-03 18:04:01
|
So, all the tests from test suite are working now. You can view it here: http://gitorious.org/projects/python-markdown/repos/gsoc2008 Here is timings for new markdown implementation(tests/markdown-test): construction:0.000000:0.000000 amps-and-angle-encoding:0.070000:131072.000000 auto-links:0.070000:0.000000 backlash-escapes:0.210000:516096.000000 blockquotes-with-dode-blocks:0.010000:0.000000 hard-wrapped:0.020000:0.000000 horizontal-rules:0.140000:0.000000 inline-html-advanced:0.020000:0.000000 inline-html-comments:0.030000:0.000000 inline-html-simple:0.160000:0.000000 links-inline:0.040000:0.000000 links-reference:0.080000:0.000000 literal-quotes:0.030000:0.000000 markdown-documentation-basics:0.430000:1265664.000000 markdown-syntax:2.120000:5246976.000000 nested-blockquotes:0.020000:0.000000 ordered-and-unordered-list:0.300000:0.000000 strong-and-em-together:0.040000:0.000000 tabs:0.030000:0.000000 tidyness:0.030000:0.000000 and for old markdown implementation: construction:0.000000:0.000000 amps-and-angle-encoding:0.080000:0.000000 auto-links:0.080000:0.000000 backlash-escapes:0.290000:397312.000000 blockquotes-with-dode-blocks:0.020000:0.000000 hard-wrapped:0.030000:0.000000 horizontal-rules:0.160000:0.000000 inline-html-advanced:0.030000:0.000000 inline-html-comments:0.040000:0.000000 inline-html-simple:0.150000:0.000000 links-inline:0.060000:0.000000 links-reference:0.090000:0.000000 literal-quotes:0.030000:0.000000 markdown-documentation-basics:0.520000:1384448.000000 markdown-syntax:2.500000:2031616.000000 nested-blockquotes:0.020000:0.000000 ordered-and-unordered-list:0.330000:0.000000 strong-and-em-together:0.020000:0.000000 tabs:0.040000:0.000000 tidyness:0.030000:0.000000 I think we also could speed up markdown by switching to cElementTree instead of NanoDOM. |
From: Waylan L. <wa...@gm...> - 2008-07-04 03:00:14
|
On Thu, Jul 3, 2008 at 8:09 PM, Yuri Takhteyev <qar...@gm...> wrote: >> On 3-Jul-08, at 5:39 PM, Artem Yunusov wrote: >>> As far as I understand all the HTML from input replacing by >>> placeholders, and then inserting back only after serialization. So, it >>> won't be a problem in this case. > > Yes, Artem is right, we are now not attempting to parse HTML submitted > by the user, we just pass it through. This is what most (all?) > markdown implementations do. This also means that if the user > supplies bad HTML (or HTML that is not XHTML), then they will get back > what they gave us. Garbage in, garbage out. > > The consensus on the markdown list seems to have been that policing > HTML submitted by the user (which would include looking out for XSS > attacks) should be left to the client, who should filter the output of > markdown. > >> For some reason I was under the impression that the "instant html/ >> xhtml output option" meant "html which includes html in the input". > Sorry if I misled anyone with that statement. I'm with Yuri (and the Markdown community at large). We don't fix bad input. In fact, the raw html never gets put into the DOM anyway. It's stored as plain text and added back in after the dom is converted into a string. Which means that we can't really pass the user a DOM object for them to do as they please because we aren't done with it yet. However, we could add a keyword to `Markdown.convert()` that specifies the output format of html or xhtml and pass that on to the DOM on serialization. Btw, if anyone is interested in performance of html serializers and parsers in python, here's a decent comparison: http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/ -- ---- Waylan Limberg wa...@gm... |
From: Yuri T. <qar...@gm...> - 2008-07-03 18:59:48
|
> So, all the tests from test suite are working now. Great. > markdown-syntax:2.120000:5246976.000000 > markdown-syntax:2.500000:2031616.000000 So, that's, 26% reduction in time, right? Not bad. But are we now using twice as much memory? Any reason for this? > I think we also could speed up markdown by switching to cElementTree > instead of NanoDOM. Which version of python would this require? - yuri -- http://sputnik.freewisdom.org/ |
From: Artem Y. <ne...@gm...> - 2008-07-03 19:28:55
|
Yuri Takhteyev wrote: >> markdown-syntax:2.120000:5246976.000000 >> markdown-syntax:2.500000:2031616.000000 >> > > So, that's, 26% reduction in time, right? Not bad. But are we now > using twice as much memory? Any reason for this? > > I'll try to investigate this. Maybe it's because we are putting all InlinePatterns nodes to dict. And I think two-step approach eats part of memory too. >> I think we also could speed up markdown by switching to cElementTree >> instead of NanoDOM. >> > > Which version of python would this require? > cElementTree and ElementTree was includeded in Python std lib since Python 2.5, but cElementTree itself can be used with Python 1.5.2 and later. |
From: Waylan L. <wa...@gm...> - 2008-07-03 19:57:33
|
On Thu, Jul 3, 2008 at 3:28 PM, Artem Yunusov <ne...@gm...> wrote: > >>> I think we also could speed up markdown by switching to cElementTree >>> instead of NanoDOM. >>> >> >> Which version of python would this require? >> > > cElementTree and ElementTree was includeded in Python std lib since > Python 2.5, but cElementTree itself can be used with Python 1.5.2 and later. > Just remember the c variation will not work on IronPython, Jython or other python implementations. Additionally, there may be (shared) web hosts which will allow a user to copy a pure python module to the server, but not compile a c module. So if we do switch, the import should be in a try...except block and import the non-c variation if c is not available. In such a situation, would ElementTree give us any advantage? Although, I believe that would give us an instant html/xhtml output option so it may make sense regardless of any speed increase - that is as long as it doesn't slow things down. -- ---- Waylan Limberg wa...@gm... |
From: Artem Y. <ne...@gm...> - 2008-07-03 20:31:30
|
Waylan Limberg wrote: >> cElementTree and ElementTree was includeded in Python std lib since >> Python 2.5, but cElementTree itself can be used with Python 1.5.2 and later. >> >> > > Just remember the c variation will not work on IronPython, Jython or > other python implementations. Additionally, there may be (shared) web > hosts which will allow a user to copy a pure python module to the > server, but not compile a c module. So if we do switch, the import > should be in a try...except block and import the non-c variation if c > is not available. Sure, thanks for pointing that out. > In such a situation, would ElementTree give us any > advantage? > > Although, I believe that would give us an instant html/xhtml output > option so it may make sense regardless of any speed increase - that is > as long as it doesn't slow things down. > Yes, ElementTree suports html/xhtml output, so it'll be advantage, but I think pure python version, will be little bit slower than NanoDOM implementation. The best way to know exactly is to test it :) |
From: Artem Y. <ne...@gm...> - 2008-07-03 20:39:37
|
David Wolever wrote: > On 3-Jul-08, at 4:57 PM, Waylan Limberg wrote: >> Just remember the c variation will not work on IronPython, Jython or >> other python implementations. Additionally, there may be (shared) web >> hosts which will allow a user to copy a pure python module to the >> server, but not compile a c module. So if we do switch, the import >> should be in a try...except block and import the non-c variation if c >> is not available. In such a situation, would ElementTree give us any >> advantage? > My marginally educated but entirely untested guess is that it would > still be a bit faster... And certainly within the same order of > magnitude. > > Now, the problem: ElementTree will get really, really upset about > invalid XHTML. > Really upset. > So you'll need to figure out how you'll handle bad input (unclosed > tags, improper quoting, etc): As far as I understand all the HTML from input replacing by placeholders, and then inserting back only after serialization. So, it won't be a problem in this case. |
From: Yuri T. <qar...@gm...> - 2008-07-04 03:21:13
|
> Btw, if anyone is interested in performance of html serializers and > parsers in python, here's a decent comparison: > http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/ Interesting. It looks like lxml is way way faster than ElementTree. Also, the website for lxml seems to suggest that ElementTree has some serious problems in handling unicode (http://codespeak.net/lxml/compatibility.html, third bullet). This really worries me, more so than performance. This may not affect us, but we need to make sure that ElementTree can handle unicode properly if we would be using it. However, it looks like lxml is included with nothing at this point, and would require building stuff from C, which may raise the bar for using markdown... - yuri -- http://sputnik.freewisdom.org/ |
From: Artem Y. <ne...@gm...> - 2008-07-04 13:05:23
|
Yuri Takhteyev wrote: > Interesting. It looks like lxml is way way faster than ElementTree. > Also, the website for lxml seems to suggest that ElementTree has some > serious problems in handling unicode > (http://codespeak.net/lxml/compatibility.html, third bullet). This > really worries me, more so than performance. This may not affect us, > but we need to make sure that ElementTree can handle unicode properly > if we would be using it. However, it looks like lxml is included with > nothing at this point, and would require building stuff from C, which > may raise the bar for using markdown... > lxml supports ElementTree API, so we could write something like this: try: from lxml import etree print "running with lxml.etree" except ImportError: try: # Python 2.5 import xml.etree.cElementTree as etree print "running with cElementTree on Python 2.5+" except ImportError: try: # Python 2.5 import xml.etree.ElementTree as etree print "running with ElementTree on Python 2.5+" except ImportError: try: # normal cElementTree install import cElementTree as etree print "running with cElementTree" except ImportError: try: # normal ElementTree install import elementtree.ElementTree as etree print "running with ElementTree" except ImportError: print "Failed to import ElementTree from any known place" We can suggest to use lxml, but by default cElementTree will be used on python 2.5 I didn't get what the real problem with unicode is, there are some general words at lxml site, and I think if the problem had been quite serious, ElementTree wouldn't have included in standard Python library. I tried some test with russian unicode data - didin't find any problems yet, but I think this issue need more proper investigation. |
From: Waylan L. <wa...@gm...> - 2008-07-04 14:06:28
|
On Fri, Jul 4, 2008 at 9:05 AM, Artem Yunusov <ne...@gm...> wrote: > > lxml supports ElementTree API, so we could write something like this: > > try: > from lxml import etree > print "running with lxml.etree" > except ImportError: > try: > # Python 2.5 > import xml.etree.cElementTree as etree > print "running with cElementTree on Python 2.5+" > except ImportError: > try: > # Python 2.5 > import xml.etree.ElementTree as etree > print "running with ElementTree on Python 2.5+" > except ImportError: > try: > # normal cElementTree install > import cElementTree as etree > print "running with cElementTree" > except ImportError: > try: > # normal ElementTree install > import elementtree.ElementTree as etree > print "running with ElementTree" > except ImportError: > print "Failed to import ElementTree from any known place" > > We can suggest to use lxml, but by default cElementTree will be used on > python 2.5 I like it. However, we should check that lxml is actually making a noticeable difference before we commit to that in a release. Unless Yuri objects, go ahead and implement it in a branch and well see how it goes. > I didn't get what the real problem with unicode is, there are some > general words at lxml site, and I think if the problem had been quite > serious, ElementTree wouldn't have included in standard Python library. > I tried some test with russian unicode data - didin't find any problems > yet, but I think this issue need more proper investigation. I get the impression that those comments were referring to the parser, not the serializer. If I'm understanding that correctly, then this should be a non-issue. But we should make sure. -- ---- Waylan Limberg wa...@gm... |