Thread: [Python-markdown-discuss] Limitations of inlinePatterns

Brought to you by: qaramazov, waylanhl

python-markdown-discuss

[Python-markdown-discuss] Limitations of inlinePatterns

From: Waylan L. <wa...@gm...> - 2008-02-18 23:58:56

We have a few bugs in our tracker that highlight a limitation of the
inlinePatterns. I'd like some feedback about which behavior is
preferred or if we should look for a different way of doing things.

Currently, it's only possible for one of the following to work in
python-markdown:

    A bold [**link**](http://example.com) currently works fine.

    A **bold [link](http://example.com)**  currently does not work.

For those that care, here's why:

Markdown parses the first line and finds the link. The label
`**link**` is then run through all remaining inlinePatterns and
properly identified as bold text.

The second line is parsed and, as the link pattern is run first, the
link is found and a link element is created. That line of text is now
represented as the following list in python:

    ["A **bold ", <markdown.element>, "** currently does not work.\n"]

Any remaining patterns are then run against each string in that list.
The problem should be obvious by now. The opening and closing `**` are
in separate strings split by the link element, so no match is made.
Finally, the list is looped through and any remaining strings are
converted to textnodes and the entire thing is added to the dom inside
a paragraph element.

The easy solution is to reverse the order of the inlinePatterns. But
then we can't do the first example as the link syntax is broken up in
the same way. Now, if no one ever uses that syntax, that would be
fine. Of course, both should work, so we may need a new approach to
the inlinePatterns. Any ideas?

-- 
----
Waylan Limberg
wa...@gm...

Re: [Python-markdown-discuss] Limitations of inlinePatterns

From: Yuri T. <qar...@gm...> - 2008-02-19 08:07:20

>  The easy solution is to reverse the order of the inlinePatterns. But
>  then we can't do the first example as the link syntax is broken up in
>  the same way. Now, if no one ever uses that syntax, that would be
>  fine. Of course, both should work, so we may need a new approach to
>  the inlinePatterns. Any ideas?


I've thought about this issue before and I think there are basically
two solutions (apart from the zeroth solution of just dealing with
it).  One, somewhat tricky, would be implement some kind of data
structure contains a mixture of strings and dom nodes and works with
RE.  It's not impossible, and I got half-way there implementing it in
2006, but then didn't have time to finish.  What I tried at the time
was storing a sting which uses a special Unicode character to mark the
positions where the nodes are supposed to be included.  I.e., if "⊙"
is the special character, we could store something like:

    ["A **⊙**  currently does not work.", <link>]

This would allow us to run REs (if we are careful) and still get the
dom tree in the end.

Another possibility is to only use dom trees for high-level elements
(lists, code blocks, quotes, etc), and do reduce inline patterns to
simple REs (each run on one element of the larger tree at a time).

The second solution would break some old extensions, but I think it's
overall simpler and better.  To give credit where credit is due, this
is basically Ben Wilson's suggestion from last summer:

https://sourceforge.net/mailarchive/forum.php?thread_name=cc6097050704100456x4daa81f0i9ca0137b6c484ba4%40mail.gmail.com&forum_name=python-markdown-discuss

I don't have time at the moment for such a major overhaul (this would
basically be Python-Markdown 2.0), but if someone else does then I
think this is the way to go.  I am also pretty sure that this would
give us a sizeable performance boost.

  - yuri

Re: [Python-markdown-discuss] Limitations of inlinePatterns

From: Blake W. <bw...@la...> - 2008-02-19 15:29:24

Yuri Takhteyev wrote:
>>  Of course, both should work, so we may need a new approach to
>>  the inlinePatterns. Any ideas?
> What I tried at the time
> was storing a sting which uses a special Unicode character to mark the
> positions where the nodes are supposed to be included.  I.e., if "⊙"
> is the special character, we could store something like:
>     ["A **⊙**  currently does not work.", <link>]

If your Unicode "character" were instead "%s", you could put the doms in a list, 
and repeatedly string-interpolate them...

i.e. you would end up with (using pretend dom syntax):
values = ("A %s currently does not work",
   ((dom("b","%s")),
    (dom("a", {'href':'index.html'}, "foo"))
   ))

and then you could loop through, doing something like:
template = values[0]
substitutions = values[1]
for subs in substitutions:
     template %= subs

and, as long as your %s were escaped the correct number of times, you should be 
good to go.  (If you went down that road, I might suggest using a dictionary, so 
that it was easier to see what was going on.  The data in that case (if you just 
used strings) would look more like:
values = ("A %(bold)s currently does not work",
   ({'bold':"<b>%(code)s</b>"),
    {'code':"<a href=`index.html`>foo</a>"}
   ))

Where each processor got to choose its own namespace.

 > This would allow us to run REs (if we are careful) and still get the
 > dom tree in the end.

Hmmm...  Thinking about that a little started me wondering...  If you end up 
with stuff in the wrong order it still wouldn't work.  Unless you ran the inline 
parsers on the data of each substitution, which is probably a good idea, come to 
think of it.  (And then override that method in the CodeProcessor to not call 
Markdown on its internal data.)

> Another possibility is to only use dom trees for high-level elements
> (lists, code blocks, quotes, etc), and do reduce inline patterns to
> simple REs (each run on one element of the larger tree at a time).

The nice property that you lose here is that you can't guarantee you'll always 
generate valid html/xml.  Of course, you might not care about that, since 
Markdown will include any old stuff from the user, but if you cared, using dom 
trees gives you that guarantee.

> I don't have time at the moment for such a major overhaul (this would
> basically be Python-Markdown 2.0), but if someone else does then I
> think this is the way to go.  I am also pretty sure that this would
> give us a sizeable performance boost.

You've got to love performance boosts.  :)

Later,
Blake.

Re: [Python-markdown-discuss] Limitations of inlinePatterns

From: Yuri T. <qar...@gm...> - 2008-02-19 18:17:45

>  The nice property that you lose here is that you can't guarantee you'll always
>  generate valid html/xml.  Of course, you might not care about that, since
>  Markdown will include any old stuff from the user, but if you cared, using dom
>  trees gives you that guarantee.

Well, as you noted, since Markdown allows you to insert anything that
vaguely looks like html, you already have no guarantee of getting
valid HTML in the end.  (Unless you are using the "safe" option and
discarding HTML.)  Note that this approach should still give you valid
HTML for any correct input.

  - yuri

Re: [Python-markdown-discuss] Limitations of inlinePatterns

From: Waylan L. <wa...@gm...> - 2008-02-19 21:30:01

After posting this last night I spent some time playing with an idea I
had. What if the inlinepatterns had two stages? In the first stage,
the regex was run against the text and any resulting matches are
stored for later retrieval. Throughout this process the text remains a
single string. Then, only after all the patterns have run and all the
matches found do we modify the string by looping through the matches
and call the handleMatch method of each pattern.

The result is here [1]. It doesn't currently handle nesting well (or
at all), but that should be fairly easy, the api for storage is ugly
(really ugly) and it's probably terribly slow. I'm also not using the
dom, but that should be easy to change. It certainly is not ready for
public consumption. But what do you think? It is worth further
efforts?

What I find most compelling is that with dom support added back in,
this continues to support the current extension api. In fact, there
should be little to no need for adjustments to existing extensions

In any event, there are some other good ideas here. Perhaps with a
little from everyone, we'll see something that works. And I'm half
inclined to drop the dom from inlinepatterns as well.

[1]: https://code.achinghead.com/browser/md_branches/inlinePatterns/patterns.py



-- 
----
Waylan Limberg
wa...@gm...