From: David A. <da...@bo...> - 2004-01-17 22:06:41
|
David Goodger <go...@py...> writes: > David Abrahams wrote: > > Here's another question: the way I've coded it, once problematic > > text is found, I stop trying to recursively find nested markup. > > Would you like it to warn about the problem as it does now, and then > > just continue to parse it for inline markup? > > Can you show examples of what your code does now? Here is some input: > > *emph **strong *prob ``literal``, end of strong**, end of emph* > > Ideally, I'd like this to parse to: > > <paragraph> > <emphasis> > emph > <strong> > strong > <problematic ...> > * > prob > <literal> > literal > , end of strong > , end of emph <document source="<stdin>"> <paragraph> <emphasis> emph <strong> strong <problematic id="id2" refid="id1"> * prob <literal> literal , end of strong , end of emph <system_message backrefs="id2" id="id1" level="2" line="1" source="<stdin>" type="WARNING"> <paragraph> Inline emphasis start-string without end-string. > I don't know if that's possible or not though. How can the parser > know that the end-emphasis matches the first start-emphasis and not > the second? It can't. I guess that's the interpretation which results the fewest errors, but I think we could probably construct cases where the other interpretation would be more sensible: *emph *prob **strong ``literal``, end of strong**, end of emph* It happens to work because we always match from outer to inner right now, but I am beginning to realize that the algorithm's going to have to change to handle other cases, and it's going to get tough if you want to retain the behavior you're seeing above. The problem arises with situations like: *emphasis *within emphasis* and such* or, worse: :emphasis:`foo :strong:`bar` baz` which right now are ending up as: <document source="<stdin>"> <paragraph> <emphasis> emphasis <problematic id="id2" refid="id1"> * within emphasis and such* <system_message backrefs="id2" id="id1" level="2" line="1" source="<stdin>" type="WARNING"> <paragraph> Inline emphasis start-string without end-string. <document source="<stdin>"> <paragraph> <emphasis> foo :strong: <problematic id="id2" refid="id1"> ` bar baz` <system_message backrefs="id2" id="id1" level="2" line="1" source="<stdin>" type="WARNING"> <paragraph> Inline interpreted text or phrase reference start-string without end-string. The problem is that we're greedy in searching for end-strings (call this problem 1) > You've probably picked up on this already, but the rules for > start-string and end-string may have to change to allow nested inline > markup. For example, in "*emph **strong***" the end-strings are > adjacent. Or do we have to disambiguate with "\ " (as in > "*emph **strong**\ *")? Right now, the rule for matching emphasis end-strings is that you can match the last in an odd-length string of stars. That's kind of a hack, and it clearly doesn't allow *emphasis within *emphasis** (problem 2) to work, either. Of course you can disambiguate the latter as *emphasis within *emphasis*\ * So I don't consider that to be serious. I think I understand what the algorithm for solving problem 1 must be roughly: def parse(remaining): search for start string if found: parse2(start, remaining) def parse2(start, remaining) children = [] messages = [] while 1: search *simultaneously* for end(start) and for all starts. if another_start is found: nodes += text(remaining[:position(another_start)]) n,m,remaining = parse2(another_start,remaining) children += n messages += m else if end(start) is found: children += text(remaining[end_pos(end(start)):]) return [ some_node(..., *children) ], messages else: error This algorithm would address problem 2 as well. Now, if you want the current behavior for your example in the beginning of this message, we'll have exactly the problem you anticipated, unless we bend over backwards to avoid it. I can think of ways to do that, but none of them are natural or pretty or efficient. IMO it's better to live with the fact that when there's problematic text, the (mis)interpretation of the text you get may not be the one you'd prefer. BTW, even to get this far I had to do some semi-massive refactorings and renamings. It was just too hairy in there; I was losing track of what things meant. I hope you don't find my changes too odious. I'll send you a current diff for states.py so you can look at it, though it clearly will take some substantial work from where the code is now to get the above algorithm implemented, and I'm not sure I have time... :-( But knowing me, I won't be able to resist solving the problem ;-) so I'll spend the time I don't have :-|. -- Dave Abrahams Boost Consulting www.boost-consulting.com |