From: Yuri T. <qar...@gm...> - 2007-06-13 03:40:32
|
This looks good. It is however, tied to the question of how the processors work, so those two issues need to be discussed together. This implementation assumes that everything is text-in-text-out. While it is possible to do it this way (that's how Markdown.pl works, if I remember correctly), I think it will get pretty ugly if we try to do structural markup this way. But looking at your code I am starting to wonder if perhaps the thing to do is to strike a compromise and work with a tree at the structural level, while using regexp substitution for the low-level markup. This way, some handlers can return text but others can return a tree node: "__...__" -> returns "<em>...</em>" "## Title" -> returns a tree node for "H2", having applied the remaining handlers recursively to the text node of the child. I will try to think about this more next weekend. Another thing: Part of your code seems to implement a general register-deregister-sort logic which would potentially be useful for things other than markdown. Have you thought of wrapping it up into a separate module? This way inside python-markdown one would simply use: import treeregistry ## just making up a name for now r = treeregistry.Registry() r.register('fulltext','>_begin') r.register('split','>fulltext') ... r.register('[[', 'links', r'(\[\[\s*(.*?)\]\])(s?)', make_link) load_extension(r) processors = r.get_sorted() Then from here on we just use a list of pre-sorted processors. - yuri On 6/10/07, Ben Wilson <da...@gm...> wrote: > Yuri, > > Here is code demonstrating what I am referring to. I created a file > called 'src' which contained a snippet of marked up text, which was > converted into HTML. Perhaps the merits are clearer, and you'll be > able to adjust Markdown to use this processor organizer. Both are the > same, but I believe the latter is optimized. > > http://dausha.net/parse.py.txt > http://dausha.net/heap.py.txt > > Ben Wilson > > > On 6/9/07, Yuri Takhteyev <qar...@gm...> wrote: > > I am sorry I didn't follow up on this thread it. It came at a time > > when I was super busy and I then didn't get around to going back to > > it, though it's been on the back of my mind. > > > > I am willing to discuss the question of how post and pre-processing is > > organized, even if some of the solutions are not going to be backwards > > compatible. I wouldn't want to make such changes on a whim, but we > > can start thinking of version "2.0", which could potentially be quite > > different. I am not sure I will attempt to do a radical redesign on > > my own, but if there are other people interested, we could do it as a > > community project. > > > > Ben, can you send us a more detailed explanation of your proposal? > > > > However, if we start talking about a radical change ("2.0"), then i > > think we also need to talk about a more serious architectural problem, > > which is the uncomfortable mix of regular expressions and dom trees. > > The current parser is based on regular expressions, once a regular > > expression is applied we typically break the string in half, which > > prevents us from matching later regular expressions. E.g.: we start > > with "**[foo](x.html)**", and match the link pattern. This gives us a > > list ["**", DOM_FRAGMENT, "**"]. We now can't match the "**...**" > > now. > > > > I've thought of a few possible solutions for it: > > > > 1. Ditch the DOM and just do a bunch of strings-to-strings > > transformation. This might be the most straigh-forward solution, but > > very un-pythonic and not something I would be interested in doing > > personally. > > > > 2. Write a special data structure that can behave as a list or tree of > > DOM fragments while also fitting with the current RE library. One way > > to do that would be to represent the half-parsed document as a string > > and a list of DOM nodes, where the string would have placeholders for > > the DOM nodes. In this case, instead of ["**", DOM_FRAGMENT, "**"] we > > would have an object with fields str = "**\x0**", doms = > > [DOM_FRAGMENT]. We could then run doc.str through regular expression, > > check if any part of the match contains the placeholders, then work > > out the details. > > > > 3. Switch to some other method of parsing. Maybe something from this > > list: http://nedbatchelder.com/text/python-parsers.html > > > > Note that if we go for #3, then the whole preprocessors/postprocessors > > thing would end up looking very different. > > > > - yuri > > > > On 6/8/07, Ben Wilson <da...@gm...> wrote: > > > It's been a while since we discussed this (April), but I thought I'd > > > come back. I've looked at how PmWiki organizes the various markups as > > > compared to Markdown. > > > > > > In response to my statement that PmWiki had an elegant, ad-hoc method > > > for adding new markup, Waylan said: "And not very pythonic. I remember > > > the first time I realized how PmWiki did some very OO like things > > > without OO code. For PHP it was amazing - > > > and a pleasure to work with. Especially considering PHP's OO sytax. Uhg!" > > > > > > I've since taken the time to analyze how Patrick Michaud accomplished > > > this. Quite simply, he uses a hash-of-hashes to organize markup > > > relative to other markup (e.g., Strong before Emphasis). At > > > parse-time, he then passes this H-o-H through a custom heap algorithm > > > to divine the absolute parse order. I re-implemented his solution in > > > Python. It is very Pythonic since his custom heap exists in Python's > > > heapq library. This means the sorting is likely optimized in C. I > > > think Waylan "failed to see the forest for all of the trees" because > > > he allowed the confines of PHP to conceal the simple elegance of the > > > solution. > > > > > > He also focused on the big-picture, which was PmWiki, and did not see > > > the small facet I was focusing on, which was markup management. What > > > Patrick solved was how to allow a developer simply to insert new > > > markup into a markup tree. Rather than extend the class, or mess with > > > the internals of class Markdown, Patrick's solution allows flexibility > > > in the class. The way Markdown is now, in order for me to add some > > > behavior I wanted, I had to tinker with Markdown class' internals. > > > Now, to add markup, all I need to do is tell my parser that I want it > > > to occur during inline, or even that it must occur before Emphasis. > > > Thus, for a wiki engine that allows developers to insert/change markup > > > by plug-in, the process is very OO. There's a reason Patrick is a PhD. > > > While PHP is inelegant, and Patrick's code is sometimes confusing, I > > > am constantly amazed at how he solves problems. > > > > > > I invite you to consider PmWiki's Markup engine (specifically function > > > Markup(); and BuildMarkupRules();) The former instructs on how to > > > extend markup ad-hoc. The latter instructs how to take the resulting > > > heap and build a parse tree. > > > > > > The only problem would be implementing this would not be backward > > > compatible. But, this is Pythonic as well, as the BDFL willingly > > > disregards tradition when warranted. It is not backward compatible > > > because it totally dismisses the present mechanism for ordering > > > markup. However, I think the gains are worth the cost. > > > > > > Warm Regards, > > > Ben Wilson > > > > > > On 4/10/07, Yuri Takhteyev <qar...@gm...> wrote: > > > > Just wanted to let you guys know that I am reading this, but don't > > > > have time to think about it seriously and respond this week. However, > > > > from what I see so far, I think Ben identified a real problem and I > > > > would love it if you guys could come up with a solution that addresses > > > > most of the points that have been brought up so far. > > > > > > > > Ideally, this solution would maintain backwards compatibility with > > > > existing extensions. If not, we can still put it in, but we'll have > > > > to think more carefully of when to release it and whether there should > > > > be a more general upgrade of how the extension mechanism works. > > > > (I.e., I think it's ok to change the extension framework once, but not > > > > every day.) > > > > > > > > - yuri > > > > > > > > On 4/10/07, Waylan Limberg <wa...@gm...> wrote: > > > > > > > > > > > > > > > Ben Wilson wrote: > > > > > [snip] > > > > > > PmWiki has a situation where markups may be added willy-nilly while > > > > > > maintaining order. It would be rather radical to introduce to > > > > > > Markdown(). > > > > > > > > > > And not very pythonic. I remember the first time I realized how PmWiki > > > > > did some very OO like things without OO code. For PHP it was amazing - > > > > > and a pleasure to work with. Especially considering PHP's OO sytax. Uhg! > > > > > > > > > > But if one tried to use PmWiki's approach in python, it would probably > > > > > be more work than it's worth. A subclass of dict which maintains order > > > > > or a class wrapping a list of tuples would be much less effort -- and > > > > > more pythonic. For that matter, it wouldn't all that difficult to build > > > > > a class from scratch for such a purpose. > > > > > > > > > > [snip] > > > > > > want the conversion to occur before/after/during another item. I > > > > > > mention PmWiki only because I'm very familiar with its approach and > > > > > > know its author seeks ease-of-customization. Markdown() generally does > > > > > > not mean to be as customizable as it follows the Markdown standard > > > > > > format. > > > > > > > > > > Ahh, now I know why your name seemed so familiar. Although I've been out > > > > > of the (PmWIki) loop for about a year now. It is true that Markdown does not > > > > > come close to PmWiki. If you're looking for more power, perhaps you > > > > > should look at reStructuredText [1]. It seems to be the python default > > > > > for markup, is easily extendable [2], and will output LaTex [3]. > > > > > Personally, I prefer Markdown for its simplicity, but you seem to want > > > > > power which brings more complexity. Imo, using an establish markup > > > > > language (rest) is better than building your own custom creation. > > > > > > > > > > [1]: http://docutils.sourceforge.net/rst.html > > > > > [2]: http://docutils.sourceforge.net/docs/howto/rst-directives.html > > > > > [3]: http://docutils.sourceforge.net/docs/user/latex.html > > > > > > > > > > -- > > > > > Waylan Limberg > > > > > wa...@gm... > > > > > > > > > > ------------------------------------------------------------------------- > > > > > Take Surveys. Earn Cash. Influence the Future of IT > > > > > Join SourceForge.net's Techsay panel and you'll get the chance to share your > > > > > opinions on IT & business topics through brief surveys-and earn cash > > > > > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > > > > > _______________________________________________ > > > > > Python-markdown-discuss mailing list > > > > > Pyt...@li... > > > > > https://lists.sourceforge.net/lists/listinfo/python-markdown-discuss > > > > > > > > > > > > > > > > > -- > > > > Yuri Takhteyev > > > > UC Berkeley School of Information > > > > http://www.freewisdom.org/ > > > > > > > > ------------------------------------------------------------------------- > > > > Take Surveys. Earn Cash. Influence the Future of IT > > > > Join SourceForge.net's Techsay panel and you'll get the chance to share your > > > > opinions on IT & business topics through brief surveys-and earn cash > > > > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > > > > _______________________________________________ > > > > Python-markdown-discuss mailing list > > > > Pyt...@li... > > > > https://lists.sourceforge.net/lists/listinfo/python-markdown-discuss > > > > > > > > > > > > > -- > > > Ben Wilson > > > "Words are the only thing which will last forever" Churchill > > > > > > > > > -- > > Yuri Takhteyev > > UC Berkeley School of Information > > http://www.freewisdom.org/ > > > > > -- > Ben Wilson > "Words are the only thing which will last forever" Churchill > -- Yuri Takhteyev UC Berkeley School of Information http://www.freewisdom.org/ |