From: Waylan L. <wa...@gm...> - 2008-02-18 23:58:56
|
We have a few bugs in our tracker that highlight a limitation of the inlinePatterns. I'd like some feedback about which behavior is preferred or if we should look for a different way of doing things. Currently, it's only possible for one of the following to work in python-markdown: A bold [**link**](http://example.com) currently works fine. A **bold [link](http://example.com)** currently does not work. For those that care, here's why: Markdown parses the first line and finds the link. The label `**link**` is then run through all remaining inlinePatterns and properly identified as bold text. The second line is parsed and, as the link pattern is run first, the link is found and a link element is created. That line of text is now represented as the following list in python: ["A **bold ", <markdown.element>, "** currently does not work.\n"] Any remaining patterns are then run against each string in that list. The problem should be obvious by now. The opening and closing `**` are in separate strings split by the link element, so no match is made. Finally, the list is looped through and any remaining strings are converted to textnodes and the entire thing is added to the dom inside a paragraph element. The easy solution is to reverse the order of the inlinePatterns. But then we can't do the first example as the link syntax is broken up in the same way. Now, if no one ever uses that syntax, that would be fine. Of course, both should work, so we may need a new approach to the inlinePatterns. Any ideas? -- ---- Waylan Limberg wa...@gm... |
From: Yuri T. <qar...@gm...> - 2008-02-19 08:07:20
|
> The easy solution is to reverse the order of the inlinePatterns. But > then we can't do the first example as the link syntax is broken up in > the same way. Now, if no one ever uses that syntax, that would be > fine. Of course, both should work, so we may need a new approach to > the inlinePatterns. Any ideas? I've thought about this issue before and I think there are basically two solutions (apart from the zeroth solution of just dealing with it). One, somewhat tricky, would be implement some kind of data structure contains a mixture of strings and dom nodes and works with RE. It's not impossible, and I got half-way there implementing it in 2006, but then didn't have time to finish. What I tried at the time was storing a sting which uses a special Unicode character to mark the positions where the nodes are supposed to be included. I.e., if "⊙" is the special character, we could store something like: ["A **⊙** currently does not work.", <link>] This would allow us to run REs (if we are careful) and still get the dom tree in the end. Another possibility is to only use dom trees for high-level elements (lists, code blocks, quotes, etc), and do reduce inline patterns to simple REs (each run on one element of the larger tree at a time). The second solution would break some old extensions, but I think it's overall simpler and better. To give credit where credit is due, this is basically Ben Wilson's suggestion from last summer: https://sourceforge.net/mailarchive/forum.php?thread_name=cc6097050704100456x4daa81f0i9ca0137b6c484ba4%40mail.gmail.com&forum_name=python-markdown-discuss I don't have time at the moment for such a major overhaul (this would basically be Python-Markdown 2.0), but if someone else does then I think this is the way to go. I am also pretty sure that this would give us a sizeable performance boost. - yuri |
From: Blake W. <bw...@la...> - 2008-02-19 15:29:24
|
Yuri Takhteyev wrote: >> Of course, both should work, so we may need a new approach to >> the inlinePatterns. Any ideas? > What I tried at the time > was storing a sting which uses a special Unicode character to mark the > positions where the nodes are supposed to be included. I.e., if "⊙" > is the special character, we could store something like: > ["A **⊙** currently does not work.", <link>] If your Unicode "character" were instead "%s", you could put the doms in a list, and repeatedly string-interpolate them... i.e. you would end up with (using pretend dom syntax): values = ("A %s currently does not work", ((dom("b","%s")), (dom("a", {'href':'index.html'}, "foo")) )) and then you could loop through, doing something like: template = values[0] substitutions = values[1] for subs in substitutions: template %= subs and, as long as your %s were escaped the correct number of times, you should be good to go. (If you went down that road, I might suggest using a dictionary, so that it was easier to see what was going on. The data in that case (if you just used strings) would look more like: values = ("A %(bold)s currently does not work", ({'bold':"<b>%(code)s</b>"), {'code':"<a href=`index.html`>foo</a>"} )) Where each processor got to choose its own namespace. > This would allow us to run REs (if we are careful) and still get the > dom tree in the end. Hmmm... Thinking about that a little started me wondering... If you end up with stuff in the wrong order it still wouldn't work. Unless you ran the inline parsers on the data of each substitution, which is probably a good idea, come to think of it. (And then override that method in the CodeProcessor to not call Markdown on its internal data.) > Another possibility is to only use dom trees for high-level elements > (lists, code blocks, quotes, etc), and do reduce inline patterns to > simple REs (each run on one element of the larger tree at a time). The nice property that you lose here is that you can't guarantee you'll always generate valid html/xml. Of course, you might not care about that, since Markdown will include any old stuff from the user, but if you cared, using dom trees gives you that guarantee. > I don't have time at the moment for such a major overhaul (this would > basically be Python-Markdown 2.0), but if someone else does then I > think this is the way to go. I am also pretty sure that this would > give us a sizeable performance boost. You've got to love performance boosts. :) Later, Blake. |
From: Yuri T. <qar...@gm...> - 2008-02-19 18:17:45
|
> The nice property that you lose here is that you can't guarantee you'll always > generate valid html/xml. Of course, you might not care about that, since > Markdown will include any old stuff from the user, but if you cared, using dom > trees gives you that guarantee. Well, as you noted, since Markdown allows you to insert anything that vaguely looks like html, you already have no guarantee of getting valid HTML in the end. (Unless you are using the "safe" option and discarding HTML.) Note that this approach should still give you valid HTML for any correct input. - yuri |
From: Waylan L. <wa...@gm...> - 2008-02-19 21:30:01
|
After posting this last night I spent some time playing with an idea I had. What if the inlinepatterns had two stages? In the first stage, the regex was run against the text and any resulting matches are stored for later retrieval. Throughout this process the text remains a single string. Then, only after all the patterns have run and all the matches found do we modify the string by looping through the matches and call the handleMatch method of each pattern. The result is here [1]. It doesn't currently handle nesting well (or at all), but that should be fairly easy, the api for storage is ugly (really ugly) and it's probably terribly slow. I'm also not using the dom, but that should be easy to change. It certainly is not ready for public consumption. But what do you think? It is worth further efforts? What I find most compelling is that with dom support added back in, this continues to support the current extension api. In fact, there should be little to no need for adjustments to existing extensions In any event, there are some other good ideas here. Perhaps with a little from everyone, we'll see something that works. And I'm half inclined to drop the dom from inlinepatterns as well. [1]: https://code.achinghead.com/browser/md_branches/inlinePatterns/patterns.py -- ---- Waylan Limberg wa...@gm... |