From: Mark N. <no...@so...> - 2004-01-19 20:41:15
|
David Abrahams wrote: > > David Goodger <go...@py...> writes: > > > Another off-list message from Mark, in case you don't already have it. > > You really should ask his permission to forward these to me. FWIW, I hereby give permission to forward any private message with [Docutils-develop] in the subject line to the docutils-develop list unless the contents of the message specifically state otherwise. > > Thanks. I (and my parser) disagree with your output for the third test, > > namely:: > > > > *emphasis **strong** > > > > Given that a problematic has to occur somewhere in this, I think > > it's more predictable (and easier to parse/explain) to match > > the very first and last asterisks and generate the problematic > > on the strong start-string:: > > > > <paragraph> > > <emphasis> > > emphasis \n\ > > <problematic id="id2" refid="id1"> > > ** > > strong* > > <system_message backrefs="id2" id="id1" level="2" line="1" > > source="nested_inline03.rst" type="WARNING"> > > <paragraph> > > Inline strong start-string without end-string. > > I disagree strongly with the assessment that it's easier to parse and > explain that way. > > David, I hope you do, too, because I don't have a strategem for > producing that result, and I wouldn't want to. First, let me start by saying that I appreciate the opportunity to discuss the issues! It really livens up my day and helps keep me sharp to have someone to debate these things with. OK, let me see if I can explain my reasoning. The assumption that my parser makes is that the first matching start-string will always get paired if there is any valid end-string for it. That way, the parsing with nesting turned on matches the parsing with nesting turned off (I have a switch in my parser for compatibility with the python parser) on the outer level. If you parse the above string while not allowing nesting (use the current docutils), it gives:: <emphasis> emphasis **strong* Turning on nesting simply identifies the "**" as an unmatched strong start-string. As for an algorithm, see the pseudo-code in my previous message (which wasn't sent yet when you penned your reply). --Mark |
From: David A. <da...@bo...> - 2004-01-21 12:27:21
|
David Goodger <go...@py...> writes: > I believe the algorithm that David Abrahams described off-list does > it: > > always search for the end-strings of *all* pending open strings > in inward-to-outward order along with all possible openers > > Or as Beni Cherniavsky wrote on docutils-users: > > All we need then is to look simultaneously for any (nested) > start-string OR the pending end-string (we will need a stack of > them). > > And I believe that's what David Abrahams is implementing (but I have > yet to look at the in-progress code he sent me yesterday, off-list). What I sent you yesterday was the (non-working) state the code was in before I realized what the right algorithm was, so while it was similar it was not quite doing that. I'm still hoping to carve out a few hours to implement it for real in the next couple of days. -- Dave Abrahams Boost Consulting www.boost-consulting.com |
From: David G. <go...@py...> - 2004-02-15 17:06:07
|
David Abrahams wrote: > So, Dave G., do you have a verdict on this stuff? It was hard work; > I'd rather it didn't languish. I'm willing to polish! I don't have a "verdict" yet, because I haven't been able to get a good grasp on it yet. I've tried, but the code was and is complex and I haven't been able to wrap my mind around it -- perhaps a sign that some major refactoring is required to make it easier to grok. I invite anyone who is interested to take a look at the code, on the "nesting" branch. I *will* get to it, but I don't know when. -- David Goodger |
From: David A. <da...@bo...> - 2004-02-15 18:14:28
|
David Goodger <go...@py...> writes: > David Abrahams wrote: >> So, Dave G., do you have a verdict on this stuff? It was hard work; >> I'd rather it didn't languish. I'm willing to polish! > > I don't have a "verdict" yet, because I haven't been able to get a > good grasp on it yet. I've tried, but the code was and is complex > and I haven't been able to wrap my mind around it -- perhaps a sign > that some major refactoring is required to make it easier to grok. Maybe so. I thought I was simplifying what you already had there, but there's no doubt about it: handling nesting added complexity of its own. If you have any specific questions I'll be happy to try to answer them, and to try to make the answers more self-evident in the code. -- Dave Abrahams Boost Consulting www.boost-consulting.com |
From: David G. <go...@py...> - 2004-01-20 05:06:14
|
Mark Nodine wrote: > OK, let me see if I can explain my reasoning. The assumption that > my parser makes is that the first matching start-string will always > get paired if there is any valid end-string for it. That way, the > parsing with nesting turned on matches the parsing with nesting > turned off (I have a switch in my parser for compatibility with > the python parser) on the outer level. If we make nested inline markup a part of the reStructuredText spec, it will be a big change, and it may break some markup. But I think we should make a clean break, and not carry around backwards-compatibility baggage. In other words, don't try to be backwards-compatible at that level. -- David Goodger http://python.net/~goodger For hire: http://python.net/~goodger/cv |
From: Mark N. <no...@so...> - 2004-01-20 16:08:34
|
David Goodger wrote: > > Mark Nodine wrote: > > OK, let me see if I can explain my reasoning. The assumption that > > my parser makes is that the first matching start-string will always > > get paired if there is any valid end-string for it. That way, the > > parsing with nesting turned on matches the parsing with nesting > > turned off (I have a switch in my parser for compatibility with > > the python parser) on the outer level. > > If we make nested inline markup a part of the reStructuredText spec, > it will be a big change, and it may break some markup. But I think we > should make a clean break, and not carry around > backwards-compatibility baggage. In other words, don't try to be > backwards-compatible at that level. I can always change my parser to whatever we think best, but we need to spec out precisely the principles on which it's based so that someone looking at the markup should be able to figure out what will happen. (Note: this specification is only an issue for malformed markup.) The rules I used are: 1. Match the first start-string s1 for which an end-string e1 is found. (This is the only rule the user of rST needs to know; the other two rules simply specify how to deal with nested markup.) 2. If there is a new start-string s2 before end-string e1 AND the start-string s2 has an end-string s2 AND the original start-string s1 still has an end-string e3 (which may be e1 if e2 occurs before e1) after end-string e2, then extend the match of s1 to e3. 3. Repeat step 2 while there is still a new start string between e2 and e3. It's a sort of greedy matching algorithm that in essence defers unmatched markup to the inner nesting levels. One alternative would be to try to do some kind of minimization scheme that minimizes the number of unmatched start-strings. I would not favor such an approach on the grounds that it is best to optimize code for the common case (which should be well-formed start-string,end-string pairs) and not to introduce extra complexity for error cases. There may be other alternatives that you can suggest. I do consider it important that the user be able to have some idea of what will happen. I'm not sure I see any pressing reason why parsing :: *emphasis **strong** as :: (*)emphasis ((**)strong(**)) ^^^ is preferable to :: ((*)emphasis (**)strong*(*)) ^^^^ except possibly that the former parsing considers one character (*) as an error and the latter considers two characters (**) as the error. Please give me the rules by which we arrive at the former parsing as long as it does not require any global optimization. --Mark |
From: David G. <go...@py...> - 2004-01-21 02:36:58
|
Mark Nodine wrote: > I can always change my parser to whatever we think best, but > we need to spec out precisely the principles on which it's based > so that someone looking at the markup should be able to figure out > what will happen. (Note: this specification is only an issue for > malformed markup.) I think I disagree with that. The spec has to describe what is correct markup, and how correct markup is parsed, such that two independent implementations will produce document-model-equivalent output. If there's incorrect markup, I don't think it's necessary that the two implementations produce identical *error* output. As long as they both indicate that there *is* an error, I think the form of that indication may differ. Or am I misreading? > The rules I used are: > > 1. Match the first start-string s1 for which an end-string e1 is > found. (This is the only rule the user of rST needs to know; the > other two rules simply specify how to deal with nested markup.) > 2. If there is a new start-string s2 before end-string e1 AND > the start-string s2 has an end-string s2 AND the original ^^ s/b e2? > start-string s1 still has an end-string e3 (which may be e1 if > e2 occurs before e1) after end-string e2, then extend the match > of s1 to e3. > 3. Repeat step 2 while there is still a new start string between > e2 and e3. > > It's a sort of greedy matching algorithm that in essence defers > unmatched markup to the inner nesting levels. I don't understand how the rules work. Could you provide an example? > One alternative would be to try to do some kind of minimization > scheme that minimizes the number of unmatched start-strings. I > would not favor such an approach on the grounds that it is best to > optimize code for the common case (which should be well-formed > start-string,end-string pairs) and not to introduce extra complexity > for error cases. I concur: don't optimize for errors. > There may be other alternatives that you can suggest. I do consider > it important that the user be able to have some idea of what will > happen. The user ought to expect an error message, hopefully helpful, in case of bad markup. Exactly what that message says is up to the implementation, IMO. > I'm not sure I see any pressing reason why parsing :: > > *emphasis **strong** > > as :: > > (*)emphasis ((**)strong(**)) > ^^^ > > is preferable to :: > > ((*)emphasis (**)strong*(*)) > ^^^^ I don't know exactly why, but it seems obvious to me. For example, replace the markup with brackets and parse: [emphasis [strong] Which open-bracket is unmatched? It seems clear to me that the inner matches take priority. > except possibly that the former parsing considers one character (*) > as an error and the latter considers two characters (**) as the > error. No, nothing to do with the length of the erroneous markup. > Please give me the rules by which we arrive at the former > parsing as long as it does not require any global optimization. I believe the algorithm that David Abrahams described off-list does it: always search for the end-strings of *all* pending open strings in inward-to-outward order along with all possible openers Or as Beni Cherniavsky wrote on docutils-users: All we need then is to look simultaneously for any (nested) start-string OR the pending end-string (we will need a stack of them). And I believe that's what David Abrahams is implementing (but I have yet to look at the in-progress code he sent me yesterday, off-list). -- David Goodger http://python.net/~goodger For hire: http://python.net/~goodger/cv |
From: David A. <da...@bo...> - 2004-01-21 16:29:06
|
David Goodger <go...@py...> writes: > I don't know exactly why, but it seems obvious to me. For example, > replace the markup with brackets and parse: > > [emphasis [strong] > > Which open-bracket is unmatched? It seems clear to me that the inner > matches take priority. Well the example is more analagous to: [emphasis [[strong]] To me, ']]' reads as a single token, especially after just having seen the single token '[['. In the case of '**' and '**' the association is even stronger, since they're lexically identical. -- Dave Abrahams Boost Consulting www.boost-consulting.com |
From: Mark N. <no...@so...> - 2004-01-21 16:32:59
|
David Goodger wrote: > > Mark Nodine wrote: > > I can always change my parser to whatever we think best, but > > we need to spec out precisely the principles on which it's based > > so that someone looking at the markup should be able to figure out > > what will happen. (Note: this specification is only an issue for > > malformed markup.) > > I think I disagree with that. The spec has to describe what is > correct markup, and how correct markup is parsed, such that two > independent implementations will produce document-model-equivalent > output. If there's incorrect markup, I don't think it's necessary > that the two implementations produce identical *error* output. As > long as they both indicate that there *is* an error, I think the form > of that indication may differ. > > Or am I misreading? The reason I said what I did is that correctly nested markup is well-defined, just as it is with parentheses, with one small change to the spec. Right now an "end-string" has a certain set of defined characters which it must precede in order to be considered an end-string. However, as a technical point, nested markup requires that we add a stipulation that an end-string can also immediately precede the end-string of a nesting markup. So I think the only relevance of the rules below are with incorrect markup. > > The rules I used are: > > > > 1. Match the first start-string s1 for which an end-string e1 is > > found. (This is the only rule the user of rST needs to know; the > > other two rules simply specify how to deal with nested markup.) > > 2. If there is a new start-string s2 before end-string e1 AND > > the start-string s2 has an end-string e2 AND the original > > start-string s1 still has an end-string e3 (which may be e1 if > > e2 occurs before e1) after end-string e2, then extend the match > > of s1 to e3 (rename e1 to e3). > > 3. Repeat step 2 while there is still a new start string between > > e2 and e3. > > > > It's a sort of greedy matching algorithm that in essence defers > > unmatched markup to the inner nesting levels. > > I don't understand how the rules work. Could you provide an example? Sure. Sorry about the small mistake above, which I've corrected. I'll use braces for start- and end-strings as you did below. The "*" indicates the point from which we start looking for new start-strings. Commentary appears below. :: Step 1 (rule 1): [text [nested markup] [unmatched markup ] ^* ^ s1 e1 Step 2 (rule 2 before rename): [text [nested markup] [unmatched markup ] ^ ^ ^ ^ s1 s2 e1,e2 e3 Step 3 (rule 2 after rename): [text [nested markup] [unmatched markup ] ^ * ^ s1 e1 Step 4 (repeat rule 2): [text [nested markup] [unmatched markup ] ^ ^ ^ s1 s2 e1,e2 Step 5 (continue rule 2 after rejecting s2): [text [nested markup] [unmatched markup ] ^ * ^ s1 e1 At step 1, we tentatively identify the first close brace e1 as the match to s1. In step 2, we notice that there is a valid start-string s2 before e1 so we check to see if it has a matching end-string e2. It does, and it happens to be the same as e1, so we check whether we could still find an end mark e3 for s1 using s2,e2 as a nested markup. We can, so we skip over s2-e2 and start looking for a new start-string (step 3). In step 4, we find a new start-string s2 whose matching end-string e2 coincides with our new e1. Since we cannot find a new e3 that would allow s2-e2 to be a nested markup, we ignore s2, which will later be flagged as an error. In step 5, we look for another start-string and do not find one before e1. We therefore return that e1 is the matching end-string for s1. We will then parse :: text [nested markup] [unmatched markup recursively, which will process the nested markup and report the second bracket as an unmatched start-string. > I don't know exactly why, but it seems obvious to me. For example, > replace the markup with brackets and parse: > > [emphasis [strong] > > Which open-bracket is unmatched? It seems clear to me that the inner > matches take priority. I guess this is a matter of preference, but I believe it only matters, as I said, with incorrect markup. > I believe the algorithm that David Abrahams described off-list does > it: > > always search for the end-strings of *all* pending open strings > in inward-to-outward order along with all possible openers > > Or as Beni Cherniavsky wrote on docutils-users: > > All we need then is to look simultaneously for any (nested) > start-string OR the pending end-string (we will need a stack of > them). > > And I believe that's what David Abrahams is implementing (but I have > yet to look at the in-progress code he sent me yesterday, off-list). The reason I used the algorithm I did is that it fits neatly into a recursive analysis so the my parse tree basically corresponds to the nestings of my subroutine calls (recursive descent). David A's algorithm would also work, but I don't see any reason to break up the recursive descent nature of my parser to use an inner-match preference for the error cases. --Mark |
From: David A. <da...@bo...> - 2004-01-22 13:20:25
|
Mark Nodine <no...@so...> writes: > The reason I used the algorithm I did is that it fits neatly into > a recursive analysis so the my parse tree basically corresponds to > the nestings of my subroutine calls (recursive descent). David A's > algorithm would also work, but I don't see any reason to break up > the recursive descent nature of my parser to use an inner-match > preference for the error cases. My algorithm would also use recursive descent and would not have N^2 complexity to boot (though I admit that's probably more of an aesthetic concern in most real-life cases). -- Dave Abrahams Boost Consulting www.boost-consulting.com |
From: David A. <da...@bo...> - 2004-02-13 18:52:18
|
So, Dave G., do you have a verdict on this stuff? It was hard work; I'd rather it didn't languish. I'm willing to polish! -- Dave Abrahams Boost Consulting www.boost-consulting.com |