Thread: Re: [Fwd: Re: [Docutils-develop] Re: nested inline markup]

Brought to you by: goodger, grubert, milde, tibs, wiemann

docutils-develop

Re: [Fwd: Re: [Docutils-develop] Re: nested inline markup]

From: Mark N. <no...@so...> - 2004-01-19 20:41:15

David Abrahams wrote:
> 
> David Goodger <go...@py...> writes:
> 
> > Another off-list message from Mark, in case you don't already have it.
> 
> You really should ask his permission to forward these to me.

FWIW, I hereby give permission to forward any private message 
with [Docutils-develop] in the subject line to the docutils-develop
list unless the contents of the message specifically state otherwise.

> > Thanks.  I (and my parser) disagree with your output for the third test,
> > namely::
> >
> >    *emphasis **strong**
> >
> > Given that a problematic has to occur somewhere in this, I think
> > it's more predictable (and easier to parse/explain) to match
> > the very first and last asterisks and generate the problematic
> > on the strong start-string::
> >
> >     <paragraph>
> >         <emphasis>
> >             emphasis \n\
> >             <problematic id="id2" refid="id1">
> >                 **
> >             strong*
> >     <system_message backrefs="id2" id="id1" level="2" line="1"
> > source="nested_inline03.rst" type="WARNING">
> >         <paragraph>
> >             Inline strong start-string without end-string.
> 
> I disagree strongly with the assessment that it's easier to parse and
> explain that way.
> 
> David, I hope you do, too, because I don't have a strategem for
> producing that result, and I wouldn't want to.

First, let me start by saying that I appreciate the opportunity to
discuss the issues!  It really livens up my day and helps keep me
sharp to have someone to debate these things with.

OK, let me see if I can explain my reasoning.  The assumption that
my parser makes is that the first matching start-string will always 
get paired if there is any valid end-string for it.  That way, the
parsing with nesting turned on matches the parsing with nesting
turned off (I have a switch in my parser for compatibility with
the python parser) on the outer level.  If you parse the above
string while not allowing nesting (use the current docutils),
it gives::

  <emphasis>
      emphasis **strong*

Turning on nesting simply identifies the "**" as an unmatched
strong start-string.

As for an algorithm, see the pseudo-code in my previous message
(which wasn't sent yet when you penned your reply).

	--Mark

[Docutils-develop] Re: [Fwd: Re: Re: nested inline markup]

From: David A. <da...@bo...> - 2004-01-21 12:27:21

David Goodger <go...@py...> writes:

> I believe the algorithm that David Abrahams described off-list does
> it:
>
>      always search for the end-strings of *all* pending open strings
>      in inward-to-outward order along with all possible openers
>
> Or as Beni Cherniavsky wrote on docutils-users:
>
>      All we need then is to look simultaneously for any (nested)
>      start-string OR the pending end-string (we will need a stack of
>      them).
>
> And I believe that's what David Abrahams is implementing (but I have
> yet to look at the in-progress code he sent me yesterday, off-list).

What I sent you yesterday was the (non-working) state the code was in
before I realized what the right algorithm was, so while it was
similar it was not quite doing that.  I'm still hoping to carve out a
few hours to implement it for real in the next couple of days.

-- 
Dave Abrahams
Boost Consulting
www.boost-consulting.com

Re: [Docutils-develop] Re: [Fwd: Re: Re: nested inline markup]

From: David G. <go...@py...> - 2004-02-15 17:06:07

David Abrahams wrote:
> So, Dave G., do you have a verdict on this stuff?  It was hard work;
> I'd rather it didn't languish.  I'm willing to polish!

I don't have a "verdict" yet, because I haven't been able to get a
good grasp on it yet.  I've tried, but the code was and is complex
and I haven't been able to wrap my mind around it -- perhaps a sign
that some major refactoring is required to make it easier to grok.

I invite anyone who is interested to take a look at the code, on the
"nesting" branch.  I *will* get to it, but I don't know when.

-- David Goodger

Re: [Docutils-develop] Re: [Fwd: Re: Re: nested inline markup]

From: David A. <da...@bo...> - 2004-02-15 18:14:28

David Goodger <go...@py...> writes:

> David Abrahams wrote:
>> So, Dave G., do you have a verdict on this stuff?  It was hard work;
>> I'd rather it didn't languish.  I'm willing to polish!
>
> I don't have a "verdict" yet, because I haven't been able to get a
> good grasp on it yet.  I've tried, but the code was and is complex
> and I haven't been able to wrap my mind around it -- perhaps a sign
> that some major refactoring is required to make it easier to grok.

Maybe so.  I thought I was simplifying what you already had there,
but there's no doubt about it: handling nesting added complexity of
its own.  If you have any specific questions I'll be happy to try to
answer them, and to try to make the answers more self-evident in the
code.

-- 
Dave Abrahams
Boost Consulting
www.boost-consulting.com

Re: [Fwd: Re: [Docutils-develop] Re: nested inline markup]

From: David G. <go...@py...> - 2004-01-20 05:06:14

Mark Nodine wrote:
 > OK, let me see if I can explain my reasoning.  The assumption that
 > my parser makes is that the first matching start-string will always
 > get paired if there is any valid end-string for it.  That way, the
 > parsing with nesting turned on matches the parsing with nesting
 > turned off (I have a switch in my parser for compatibility with
 > the python parser) on the outer level.

If we make nested inline markup a part of the reStructuredText spec,
it will be a big change, and it may break some markup.  But I think we
should make a clean break, and not carry around
backwards-compatibility baggage.  In other words, don't try to be
backwards-compatible at that level.

-- 
David Goodger                               http://python.net/~goodger
For hire: http://python.net/~goodger/cv

Re: [Fwd: Re: [Docutils-develop] Re: nested inline markup]

From: Mark N. <no...@so...> - 2004-01-20 16:08:34

David Goodger wrote:
> 
> Mark Nodine wrote:
>  > OK, let me see if I can explain my reasoning.  The assumption that
>  > my parser makes is that the first matching start-string will always
>  > get paired if there is any valid end-string for it.  That way, the
>  > parsing with nesting turned on matches the parsing with nesting
>  > turned off (I have a switch in my parser for compatibility with
>  > the python parser) on the outer level.
> 
> If we make nested inline markup a part of the reStructuredText spec,
> it will be a big change, and it may break some markup.  But I think we
> should make a clean break, and not carry around
> backwards-compatibility baggage.  In other words, don't try to be
> backwards-compatible at that level.

I can always change my parser to whatever we think best, but
we need to spec out precisely the principles on which it's based
so that someone looking at the markup should be able to figure out
what will happen.  (Note: this specification is only an issue for
malformed markup.)  The rules I used are:

1. Match the first start-string s1 for which an end-string e1 is found.
   (This is the only rule the user of rST needs to know; the other
   two rules simply specify how to deal with nested markup.)
2. If there is a new start-string s2 before end-string e1 AND
   the start-string s2 has an end-string s2 AND the original
   start-string s1 still has an end-string e3 (which may be e1 if
   e2 occurs before e1) after end-string e2, then extend the match
   of s1 to e3.
3. Repeat step 2 while there is still a new start string between
   e2 and e3.

It's a sort of greedy matching algorithm that in essence defers unmatched
markup to the inner nesting levels.

One alternative would be to try to do some kind of minimization scheme
that minimizes the number of unmatched start-strings.  I would not favor
such an approach on the grounds that it is best to optimize code for
the common case (which should be well-formed start-string,end-string
pairs) and not to introduce extra complexity for error cases.

There may be other alternatives that you can suggest.  I do consider it
important that the user be able to have some idea of what will happen.
I'm not sure I see any pressing reason why parsing ::

  *emphasis **strong**

as ::

  (*)emphasis ((**)strong(**))
  ^^^

is preferable to ::

  ((*)emphasis (**)strong*(*))
               ^^^^

except possibly that the former parsing considers one character (*) as
an error and the latter considers two characters (**) as the error.
Please give me the rules by which we arrive at the former parsing
as long as it does not require any global optimization.

	--Mark

Re: [Fwd: Re: [Docutils-develop] Re: nested inline markup]

From: David G. <go...@py...> - 2004-01-21 02:36:58

Mark Nodine wrote:
 > I can always change my parser to whatever we think best, but
 > we need to spec out precisely the principles on which it's based
 > so that someone looking at the markup should be able to figure out
 > what will happen.  (Note: this specification is only an issue for
 > malformed markup.)

I think I disagree with that.  The spec has to describe what is
correct markup, and how correct markup is parsed, such that two
independent implementations will produce document-model-equivalent
output.  If there's incorrect markup, I don't think it's necessary
that the two implementations produce identical *error* output.  As
long as they both indicate that there *is* an error, I think the form
of that indication may differ.

Or am I misreading?

 > The rules I used are:
 >
 > 1. Match the first start-string s1 for which an end-string e1 is
 >    found.  (This is the only rule the user of rST needs to know; the
 >    other two rules simply specify how to deal with nested markup.)
 > 2. If there is a new start-string s2 before end-string e1 AND
 >    the start-string s2 has an end-string s2 AND the original
                                            ^^ s/b e2?
 >    start-string s1 still has an end-string e3 (which may be e1 if
 >    e2 occurs before e1) after end-string e2, then extend the match
 >    of s1 to e3.
 > 3. Repeat step 2 while there is still a new start string between
 >    e2 and e3.
 >
 > It's a sort of greedy matching algorithm that in essence defers
 > unmatched markup to the inner nesting levels.

I don't understand how the rules work.  Could you provide an example?

 > One alternative would be to try to do some kind of minimization
 > scheme that minimizes the number of unmatched start-strings.  I
 > would not favor such an approach on the grounds that it is best to
 > optimize code for the common case (which should be well-formed
 > start-string,end-string pairs) and not to introduce extra complexity
 > for error cases.

I concur: don't optimize for errors.

 > There may be other alternatives that you can suggest.  I do consider
 > it important that the user be able to have some idea of what will
 > happen.

The user ought to expect an error message, hopefully helpful, in case
of bad markup.  Exactly what that message says is up to the
implementation, IMO.

 > I'm not sure I see any pressing reason why parsing ::
 >
 >   *emphasis **strong**
 >
 > as ::
 >
 >   (*)emphasis ((**)strong(**))
 >   ^^^
 >
 > is preferable to ::
 >
 >   ((*)emphasis (**)strong*(*))
 >                ^^^^

I don't know exactly why, but it seems obvious to me.  For example,
replace the markup with brackets and parse:

     [emphasis [strong]

Which open-bracket is unmatched?  It seems clear to me that the inner
matches take priority.

 > except possibly that the former parsing considers one character (*)
 > as an error and the latter considers two characters (**) as the
 > error.

No, nothing to do with the length of the erroneous markup.

 > Please give me the rules by which we arrive at the former
 > parsing as long as it does not require any global optimization.

I believe the algorithm that David Abrahams described off-list does
it:

     always search for the end-strings of *all* pending open strings
     in inward-to-outward order along with all possible openers

Or as Beni Cherniavsky wrote on docutils-users:

     All we need then is to look simultaneously for any (nested)
     start-string OR the pending end-string (we will need a stack of
     them).

And I believe that's what David Abrahams is implementing (but I have
yet to look at the in-progress code he sent me yesterday, off-list).

-- 
David Goodger                               http://python.net/~goodger
For hire: http://python.net/~goodger/cv

Re: [Fwd: Re: [Docutils-develop] Re: nested inline markup]

From: David A. <da...@bo...> - 2004-01-21 16:29:06

David Goodger <go...@py...> writes:

> I don't know exactly why, but it seems obvious to me.  For example,
> replace the markup with brackets and parse:
>
>      [emphasis [strong]
>
> Which open-bracket is unmatched?  It seems clear to me that the inner
> matches take priority.

Well the example is more analagous to:

    [emphasis [[strong]]

To me, ']]' reads as a single token, especially after just having seen
the single token '[['.  In the case of '**' and '**' the association
is even stronger, since they're lexically identical.

-- 
Dave Abrahams
Boost Consulting
www.boost-consulting.com

Re: [Fwd: Re: [Docutils-develop] Re: nested inline markup]

From: Mark N. <no...@so...> - 2004-01-21 16:32:59

David Goodger wrote:
> 
> Mark Nodine wrote:
>  > I can always change my parser to whatever we think best, but
>  > we need to spec out precisely the principles on which it's based
>  > so that someone looking at the markup should be able to figure out
>  > what will happen.  (Note: this specification is only an issue for
>  > malformed markup.)
> 
> I think I disagree with that.  The spec has to describe what is
> correct markup, and how correct markup is parsed, such that two
> independent implementations will produce document-model-equivalent
> output.  If there's incorrect markup, I don't think it's necessary
> that the two implementations produce identical *error* output.  As
> long as they both indicate that there *is* an error, I think the form
> of that indication may differ.
> 
> Or am I misreading?

The reason I said what I did is that correctly nested markup is
well-defined, just as it is with parentheses, with one small change
to the spec.  Right now an "end-string" has a certain set of defined
characters which it must precede in order to be considered an 
end-string.  However, as a technical point, nested markup requires
that we add a stipulation that an end-string can also immediately 
precede the end-string of a nesting markup.

So I think the only relevance of the rules below are with incorrect
markup.

>  > The rules I used are:
>  >
>  > 1. Match the first start-string s1 for which an end-string e1 is
>  >    found.  (This is the only rule the user of rST needs to know; the
>  >    other two rules simply specify how to deal with nested markup.)
>  > 2. If there is a new start-string s2 before end-string e1 AND
>  >    the start-string s2 has an end-string e2 AND the original
>  >    start-string s1 still has an end-string e3 (which may be e1 if
>  >    e2 occurs before e1) after end-string e2, then extend the match
>  >    of s1 to e3 (rename e1 to e3).
>  > 3. Repeat step 2 while there is still a new start string between
>  >    e2 and e3.
>  >
>  > It's a sort of greedy matching algorithm that in essence defers
>  > unmatched markup to the inner nesting levels.
> 
> I don't understand how the rules work.  Could you provide an example?

Sure.  Sorry about the small mistake above, which I've corrected.  
I'll use braces for start- and end-strings as you did below.
The "*" indicates the point from which we start looking for new 
start-strings.  Commentary appears below. ::

  Step 1 (rule 1):
  [text [nested markup] [unmatched markup ]
  ^*                  ^
  s1                  e1

  Step 2 (rule 2 before rename):
  [text [nested markup] [unmatched markup ]
  ^     ^             ^                   ^
  s1    s2           e1,e2                e3

  Step 3 (rule 2 after rename):
  [text [nested markup] [unmatched markup ]
  ^                    *                  ^
  s1                                      e1

  Step 4 (repeat rule 2):
  [text [nested markup] [unmatched markup ]
  ^                     ^                 ^
  s1                    s2                e1,e2

  Step 5 (continue rule 2 after rejecting s2):
  [text [nested markup] [unmatched markup ]
  ^                      *                ^
  s1                                      e1

At step 1, we tentatively identify the first close brace e1 as the
match to s1.  In step 2, we notice that there is a valid start-string s2
before e1 so we check to see if it has a matching end-string e2.  It does,
and it happens to be the same as e1, so we check whether we could still
find an end mark e3 for s1 using s2,e2 as a nested markup.  We can, so
we skip over s2-e2 and start looking for a new start-string (step 3).
In step 4, we find a new start-string s2 whose matching end-string e2
coincides with our new e1.  Since we cannot find a new e3 that would
allow s2-e2 to be a nested markup, we ignore s2, which will later
be flagged as an error.  In step 5, we look for another start-string
and do not find one before e1.  We therefore return that e1 is the
matching end-string for s1.  We will then parse ::

  text [nested markup] [unmatched markup

recursively, which will process the nested markup and report the
second bracket as an unmatched start-string.

> I don't know exactly why, but it seems obvious to me.  For example,
> replace the markup with brackets and parse:
> 
>      [emphasis [strong]
> 
> Which open-bracket is unmatched?  It seems clear to me that the inner
> matches take priority.

I guess this is a matter of preference, but I believe it only matters,
as I said, with incorrect markup.

> I believe the algorithm that David Abrahams described off-list does
> it:
> 
>      always search for the end-strings of *all* pending open strings
>      in inward-to-outward order along with all possible openers
> 
> Or as Beni Cherniavsky wrote on docutils-users:
> 
>      All we need then is to look simultaneously for any (nested)
>      start-string OR the pending end-string (we will need a stack of
>      them).
> 
> And I believe that's what David Abrahams is implementing (but I have
> yet to look at the in-progress code he sent me yesterday, off-list).

The reason I used the algorithm I did is that it fits neatly into
a recursive analysis so the my parse tree basically corresponds to
the nestings of my subroutine calls (recursive descent).  David A's
algorithm would also work, but I don't see any reason to break up
the recursive descent nature of my parser to use an inner-match
preference for the error cases.

	--Mark

[Docutils-develop] Re: [Fwd: Re: Re: nested inline markup]

From: David A. <da...@bo...> - 2004-01-22 13:20:25

Mark Nodine <no...@so...> writes:

> The reason I used the algorithm I did is that it fits neatly into
> a recursive analysis so the my parse tree basically corresponds to
> the nestings of my subroutine calls (recursive descent).  David A's
> algorithm would also work, but I don't see any reason to break up
> the recursive descent nature of my parser to use an inner-match
> preference for the error cases.

My algorithm would also use recursive descent and would not have N^2
complexity to boot (though I admit that's probably more of an
aesthetic concern in most real-life cases).

-- 
Dave Abrahams
Boost Consulting
www.boost-consulting.com

[Docutils-develop] Re: [Fwd: Re: Re: nested inline markup]

From: David A. <da...@bo...> - 2004-02-13 18:52:18

So, Dave G., do you have a verdict on this stuff?  It was hard work;
I'd rather it didn't languish.  I'm willing to polish!

-- 
Dave Abrahams
Boost Consulting
www.boost-consulting.com