[Docutils-develop] Re: nested inline markup

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

David Goodger <go...@py...> writes:

> David Abrahams wrote:
>  > Here's another question: the way I've coded it, once problematic
>  > text is found, I stop trying to recursively find nested markup.
>  > Would you like it to warn about the problem as it does now, and then
>  > just continue to parse it for inline markup?
>
> Can you show examples of what your code does now?  Here is some input:
>
>      *emph **strong *prob ``literal``, end of strong**, end of emph*
>
> Ideally, I'd like this to parse to:
>
>      <paragraph>
>          <emphasis>
>              emph
>              <strong>
>                  strong
>                  <problematic ...>
>                      *
>                  prob
>                  <literal>
>                      literal
>                  , end of strong
>              , end of emph

    <document source="<stdin>">
        <paragraph>
            <emphasis>
                emph 
                <strong>
                    strong 
                    <problematic id="id2" refid="id1">
                        *
                    prob 
                    <literal>
                        literal
                    , end of strong
                , end of emph
        <system_message backrefs="id2" id="id1" level="2" line="1" source="<stdin>" type="WARNING">
            <paragraph>
                Inline emphasis start-string without end-string.

> I don't know if that's possible or not though.  How can the parser
> know that the end-emphasis matches the first start-emphasis and not
> the second?

It can't.  I guess that's the
interpretation which results the fewest errors, but I think we could
probably construct cases where the other interpretation would be more
sensible:

      *emph *prob **strong ``literal``, end of strong**, end of emph*

It happens to work because we always match from outer to inner right
now, but I am beginning to realize that the algorithm's going to have
to change to handle other cases, and it's going to get tough if you
want to retain the behavior you're seeing above.  The problem arises
with situations like:

     *emphasis *within emphasis* and such*

or, worse:

    :emphasis:`foo :strong:`bar` baz`

which right now are ending up as:

    <document source="<stdin>">
        <paragraph>
            <emphasis>
                emphasis 
                <problematic id="id2" refid="id1">
                    *
                within emphasis
             and such*
        <system_message backrefs="id2" id="id1" level="2" line="1" source="<stdin>" type="WARNING">
            <paragraph>
                Inline emphasis start-string without end-string.

  <document source="<stdin>">
      <paragraph>
          <emphasis>
              foo :strong:
              <problematic id="id2" refid="id1">
                  `
              bar
           baz`
      <system_message backrefs="id2" id="id1" level="2" line="1" source="<stdin>" type="WARNING">
          <paragraph>
              Inline interpreted text or phrase reference start-string without end-string.

The problem is that we're greedy in searching for end-strings (call
this problem 1)

> You've probably picked up on this already, but the rules for
> start-string and end-string may have to change to allow nested inline
> markup.  For example, in "*emph **strong***" the end-strings are
> adjacent.  Or do we have to disambiguate with "\ " (as in
> "*emph **strong**\ *")?

Right now, the rule for matching emphasis end-strings is that you can
match the last in an odd-length string of stars.

That's kind of a hack, and it clearly doesn't allow 

       *emphasis within *emphasis**

(problem 2)

to work, either.  Of course you can disambiguate the latter as

       *emphasis within *emphasis*\ *

So I don't consider that to be serious.

I think I understand what the algorithm for solving problem 1 must be
roughly:

def parse(remaining):
     search for start string
     if found:
          parse2(start, remaining)

def parse2(start, remaining)
    children = []
    messages = []
    while 1:
        search *simultaneously* for end(start) and for all starts.
        if another_start is found:
            nodes += text(remaining[:position(another_start)])
            n,m,remaining = parse2(another_start,remaining)
            children += n
            messages += m
        else if end(start) is found:
            children += text(remaining[end_pos(end(start)):])
            return [ some_node(..., *children) ], messages
        else:
            error

This algorithm would address problem 2 as well.

Now, if you want the current behavior for your example in the
beginning of this message, we'll have exactly the problem you
anticipated, unless we bend over backwards to avoid it.  I can think
of ways to do that, but none of them are natural or pretty or
efficient.  IMO it's better to live with the fact that when there's
problematic text, the (mis)interpretation of the text you get may not
be the one you'd prefer.

BTW, even to get this far I had to do some semi-massive refactorings
and renamings.  It was just too hairy in there; I was losing track of
what things meant.  I hope you don't find my changes too odious.  I'll
send you a current diff for states.py so you can look at it, though it
clearly will take some substantial work from where the code is now to
get the above algorithm implemented, and I'm not sure I have
time... :-( But knowing me, I won't be able to resist solving the
problem ;-) so I'll spend the time I don't have :-|.

-- 
Dave Abrahams
Boost Consulting
www.boost-consulting.com