From: David G. <go...@py...> - 2004-01-17 03:18:34
|
[Beni Cherniavsky] > You also need to update the spec, DTD and writers of course... The DTD already supports nested markup. The HTML writer shouldn't have any trouble with it. I don't know about LaTeX. Updating the spec is a small job. [David Abrahams] > I'm hoping to get some help from the community on those chores once > I have done the "hard part". I've certainly seen changes go into > docutils core without all the writers being updated by the same > person. Yes, don't wait for other parts to become compatible. That'll happen in time. [David Abrahams] > OK, I have something implemented which seems to handle most of the > logic (it's not really doing tokenization, but something that should > be semantically equivalent, and doesn't result in a complete > rewrite) Cool. Can't wait to see the code. As this would be a major change to the parser, please provide a patch rather than checking it in directly. Thanks. > but there are a few cases I need to get some feedback on. > > 1. ***** > > gets "tokenized", naturally, as > > > <**><*><**> > > And when the inliner re-parses <*>, it complains about an inline > start-string without a corresponding end-string. So, the question > is, do we: > > a. turn off complaints inside inline markup about unmatched > inline markup start-strings without corresponding end strings > > b. Make that an error and force the user to write > > **\*** I think (b) is the way to go here. > 2. ``literal ``TeX quotes'' & \\backslash`` > > This one is currently parsed as though tokenized this way: > > <``><literal ``TeX quotes'' & \\backslash><``> > > But my code tokenizes inside markup and so sees: > > <``><literal ><``>TeX quotes'' & \\backslash><``> > > Again, I see two choices: > > a. Turn off recognition of an inline markup start string within > regions already using that markup. > > b. Make that an error and force the user to write > > ``literal \``TeX quotes'' & \\backslash`` > > In both cases (a) is more backward-compatible but (b) is more > consistent, and, dare I say, Pythonic. Here I disagree. The spec says: No markup interpretation (including backslash-escape interpretation) is done within inline literals. The current parsing is correct and must remain. The double-backquotes before "TeX" must not be interpreted as markup. Only an inline literal end-string should be searched for upon encountering an inline literal start-string. So (b) is no good because backslashes must be left alone; they must appear in the output. (a) is too general; for instance, nested inline markup has to allow interpreted text within interpreted text:: :role1:`interpreted :role2:`text`` > in-the-face-of-ambiguity-refuse-the-temptation-to-guess-ly y'rs, Good policy ;-) -- David Goodger http://python.net/~goodger For hire: http://python.net/~goodger/cv |
From: David A. <da...@bo...> - 2004-01-17 03:39:05
|
David Goodger <go...@py...> writes: > [Beni Cherniavsky] > > You also need to update the spec, DTD and writers of course... > > The DTD already supports nested markup. The HTML writer shouldn't > have any trouble with it. I don't know about LaTeX. Updating the > spec is a small job. Great! > [David Abrahams] > > I'm hoping to get some help from the community on those chores once > > I have done the "hard part". I've certainly seen changes go into > > docutils core without all the writers being updated by the same > > person. > > Yes, don't wait for other parts to become compatible. That'll happen > in time. > > [David Abrahams] > > OK, I have something implemented which seems to handle most of the > > logic (it's not really doing tokenization, but something that should > > be semantically equivalent, and doesn't result in a complete > > rewrite) > > Cool. Can't wait to see the code. I'll send you a preview. > As this would be a major change to the parser, please provide a > patch rather than checking it in directly. Thanks. Of course, I would never dare stomp on the core codebase that way. > > but there are a few cases I need to get some feedback on. > > > > 1. ***** > > > > gets "tokenized", naturally, as > > > > > > <**><*><**> > > > > And when the inliner re-parses <*>, it complains about an inline > > start-string without a corresponding end-string. So, the question > > is, do we: > > > > a. turn off complaints inside inline markup about unmatched > > inline markup start-strings without corresponding end strings > > > > b. Make that an error and force the user to write > > > > **\*** > > I think (b) is the way to go here. Done. 'cause it already works that way ;^) > > 2. ``literal ``TeX quotes'' & \\backslash`` > > > > This one is currently parsed as though tokenized this way: > > > > <``><literal ``TeX quotes'' & \\backslash><``> > > > > But my code tokenizes inside markup and so sees: > > > > <``><literal ><``>TeX quotes'' & \\backslash><``> > > > > Again, I see two choices: > > > > a. Turn off recognition of an inline markup start string within > > regions already using that markup. > > > > b. Make that an error and force the user to write > > > > ``literal \``TeX quotes'' & \\backslash`` > > > > In both cases (a) is more backward-compatible but (b) is more > > consistent, and, dare I say, Pythonic. > > Here I disagree. The spec says: > > No markup interpretation (including backslash-escape > interpretation) is done within inline literals. > > The current parsing is correct and must remain. The double-backquotes > before "TeX" must not be interpreted as markup. Only an inline > literal end-string should be searched for upon encountering an inline > literal start-string. Well, OK, it's easy to special-case literal markup. Imagine it wasn't about literal text: *emphasized *emphasis* etcetera* The question is: do we end up with nested emphasis, or do we go out of our way to disable nested recognition of the same inline markup. That is, do we parse it as: ((*)(emphasized *emphasis* etcetera)(*)) or ((*)(emphasized ((*)(emphasis)(*)) etcetera)(*)) ?? > So (b) is no good because backslashes must be left alone; they must > appear in the output. (a) is too general; for instance, nested inline > markup has to allow interpreted text within interpreted text:: > > :role1:`interpreted :role2:`text`` I could *easily* do (b) for everything other than literal text. That's what falls most naturally out of the code. Please let me know; I'd love to have something I consider to be a good first draft tonight. -- Dave Abrahams Boost Consulting www.boost-consulting.com |
From: David A. <da...@bo...> - 2004-01-17 18:41:19
|
David Goodger <go...@py...> writes: > I think that's correct. Watch out for cases like this though: > > **strong with a wildcard a.* inside** > > That shouldn't cause any error (the "*" is not in start-string > context). That one's already working. > > Please let me know; I'd love to have something I consider to be a > > good first draft tonight. > > Sorry, went to bed after my reply last night (got the flu) and didn't > see your message until now. Please get better soon! > > Could you perhaps help me out by writing up a few nesting tests > > which you think should pass? I'm liable to be overlooking more > > things... > > I've attached some unit tests in test_nested_inline_markup.py, with > some interesting edge cases and relatively complex examples. I > *think* the "expected output" is correct, but I may have made mistakes > (can't test it yet ;-). There are probably other edge cases which > I've missed, but we'll find them once we have code to exercise. See > the top of the file for installation instructions. Thanks! > Please also run test/test_parsers/test_rst/test_inline_markup.py and > report the results. I expect some tests to fail, but I'd like to see > which ones and how. OK, will do. -- Dave Abrahams Boost Consulting www.boost-consulting.com |
From: Mark N. <no...@so...> - 2004-01-19 17:12:12
|
Mark Nodine wrote: > > David Abrahams wrote: > > > > I guess that's the > > interpretation which results the fewest errors, but I think we could > > probably construct cases where the other interpretation would be more > > sensible: > > > > *emph *prob **strong ``literal``, end of strong**, end of emph* > > However, this one did trip up my Perl parser :-(. I'll have to see > what's going on. Never mind. My parser did do something reasonable for this one. It used the two asterisks in "strong**" to close the two levels of emphasis "*emph" and "*prob" and reported an unmatched start-string for the "**strong". The final "emph*" was kept as part of the string. --Mark |
From: David A. <da...@bo...> - 2004-01-19 18:58:59
|
Mark Nodine <no...@so...> writes: > Mark Nodine wrote: >> >> David Abrahams wrote: >> > >> > I guess that's the >> > interpretation which results the fewest errors, but I think we could >> > probably construct cases where the other interpretation would be more >> > sensible: >> > >> > *emph *prob **strong ``literal``, end of strong**, end of emph* >> >> However, this one did trip up my Perl parser :-(. I'll have to see >> what's going on. > > Never mind. My parser did do something reasonable for this one. > It used the two asterisks in "strong**" to close the two levels > of emphasis "*emph" and "*prob" and reported an unmatched start-string > for the "**strong". The final "emph*" was kept as part of the string. That doesn't seem very reasonable to me, but my point is that once we get into malformed text, there are any number of reasonable interpretations. It doesn't seem to matter all that much which one we choose. -- Dave Abrahams Boost Consulting www.boost-consulting.com |
From: David A. <da...@bo...> - 2004-01-26 12:57:52
|
> David Goodger <go...@py...> writes: > > > > > Here's the promised attachment. > > > > > > -- David Goodger > > > #! /usr/bin/env python > > > > > > # Copy this file to docutils/test/test_parsers/test_rst/ and do > > > # ``chmod +x test_inline_markup.py``, then execute this file to test. > > > > > > # To be added (later) to > > > # docutils/test/test_parsers/test_rst/test_inline_markup.py? > > > > > > > <snip> > > > > I've checked in a new version of states.py on the "nesting" branch > > which passes all of the nested inline tests Dave G. gave me. It also > > passes all of the regular inline tests that I think should pass, > > except for the ones involving embedded URIs. > > > > The regular expressions get quite hairy and I'm not sure how to solve > > the embedded URI issue. I am out of time to work on it for several > > weeks probably, but I would be happy to point anyone who wants to look > > at it at the cause of the problem. > > I managed to fix everything so now there's only a single suspicious > looking test result: > > test_parsers\test_rst\<stdin>: totest['embedded_URIs'][0]; test_parser (DocutilsTestSupport.ParserTestCase) > input: > `phrase reference <http://example.com>`_ > > -: expected > +: output > <document source="test data"> > <paragraph> > <reference refuri="http://example.com"> > phrase reference > <target id="phrase-reference" name="phrase reference" refuri="http://example.com"> > + phrase reference > > > I think this is probably something trivial, but I'm not sure what's > going on yet. OK, that's fixed now. As far as I know, there's only one remaining problem, and that's easily fixed: *emphasis ``literal``* -: expected +: output <document source="test data"> <paragraph> <emphasis> emphasis + <problematic id="id2" refid="id1"> + `` - <literal> ? - ^ + literal`` ? ^^ - literal + <system_message backrefs="id2" id="id1" level="2" line="1" source="test data" type="WARNING"> + <paragraph> + Inline literal start-string without end-string. The same issue will come up for *emphasis |substitution|* But I'm out of time to work on this at the moment. I hope someone else will be inspired to pick up the baton; I'd be happy to give guidance. -- Dave Abrahams Boost Consulting www.boost-consulting.com |
From: David A. <da...@bo...> - 2004-01-26 11:23:55
|
David Goodger <go...@py...> writes: > Here's the promised attachment. > > -- David Goodger > #! /usr/bin/env python > > # Copy this file to docutils/test/test_parsers/test_rst/ and do > # ``chmod +x test_inline_markup.py``, then execute this file to test. > > # To be added (later) to > # docutils/test/test_parsers/test_rst/test_inline_markup.py? > <snip> I've checked in a new version of states.py on the "nesting" branch which passes all of the nested inline tests Dave G. gave me. It also passes all of the regular inline tests that I think should pass, except for the ones involving embedded URIs. The regular expressions get quite hairy and I'm not sure how to solve the embedded URI issue. I am out of time to work on it for several weeks probably, but I would be happy to point anyone who wants to look at it at the cause of the problem. Regards, Dave A. -- Dave Abrahams Boost Consulting www.boost-consulting.com |
From: David A. <da...@bo...> - 2004-01-26 12:41:33
|
> David Goodger <go...@py...> writes: > > > Here's the promised attachment. > > > > -- David Goodger > > #! /usr/bin/env python > > > > # Copy this file to docutils/test/test_parsers/test_rst/ and do > > # ``chmod +x test_inline_markup.py``, then execute this file to test. > > > > # To be added (later) to > > # docutils/test/test_parsers/test_rst/test_inline_markup.py? > > > > <snip> > > I've checked in a new version of states.py on the "nesting" branch > which passes all of the nested inline tests Dave G. gave me. It also > passes all of the regular inline tests that I think should pass, > except for the ones involving embedded URIs. > > The regular expressions get quite hairy and I'm not sure how to solve > the embedded URI issue. I am out of time to work on it for several > weeks probably, but I would be happy to point anyone who wants to look > at it at the cause of the problem. I managed to fix everything so now there's only a single suspicious looking test result: test_parsers\test_rst\<stdin>: totest['embedded_URIs'][0]; test_parser (DocutilsTestSupport.ParserTestCase) input: `phrase reference <http://example.com>`_ -: expected +: output <document source="test data"> <paragraph> <reference refuri="http://example.com"> phrase reference <target id="phrase-reference" name="phrase reference" refuri="http://example.com"> + phrase reference I think this is probably something trivial, but I'm not sure what's going on yet. -- Dave Abrahams Boost Consulting www.boost-consulting.com |
From: David A. <da...@bo...> - 2004-01-17 04:01:57
|
David Goodger <go...@py...> writes: > [Beni Cherniavsky] > > You also need to update the spec, DTD and writers of course... > > The DTD already supports nested markup. The HTML writer shouldn't > have any trouble with it. I don't know about LaTeX. Updating the > spec is a small job. Great! > [David Abrahams] > > I'm hoping to get some help from the community on those chores once > > I have done the "hard part". I've certainly seen changes go into > > docutils core without all the writers being updated by the same > > person. > > Yes, don't wait for other parts to become compatible. That'll happen > in time. > > [David Abrahams] > > OK, I have something implemented which seems to handle most of the > > logic (it's not really doing tokenization, but something that should > > be semantically equivalent, and doesn't result in a complete > > rewrite) > > Cool. Can't wait to see the code. I'll send you a preview. > As this would be a major change to the parser, please provide a > patch rather than checking it in directly. Thanks. Of course, I would never dare stomp on the core codebase that way. > > but there are a few cases I need to get some feedback on. > > > > 1. ***** > > > > gets "tokenized", naturally, as > > > > > > <**><*><**> > > > > And when the inliner re-parses <*>, it complains about an inline > > start-string without a corresponding end-string. So, the question > > is, do we: > > > > a. turn off complaints inside inline markup about unmatched > > inline markup start-strings without corresponding end strings > > > > b. Make that an error and force the user to write > > > > **\*** > > I think (b) is the way to go here. Done. 'cause it already works that way ;^) > > 2. ``literal ``TeX quotes'' & \\backslash`` > > > > This one is currently parsed as though tokenized this way: > > > > <``><literal ``TeX quotes'' & \\backslash><``> > > > > But my code tokenizes inside markup and so sees: > > > > <``><literal ><``>TeX quotes'' & \\backslash><``> > > > > Again, I see two choices: > > > > a. Turn off recognition of an inline markup start string within > > regions already using that markup. > > > > b. Make that an error and force the user to write > > > > ``literal \``TeX quotes'' & \\backslash`` > > > > In both cases (a) is more backward-compatible but (b) is more > > consistent, and, dare I say, Pythonic. > > Here I disagree. The spec says: > > No markup interpretation (including backslash-escape > interpretation) is done within inline literals. > > The current parsing is correct and must remain. The double-backquotes > before "TeX" must not be interpreted as markup. Only an inline > literal end-string should be searched for upon encountering an inline > literal start-string. Well, OK, it's easy to special-case literal markup. Imagine it wasn't about literal text: *emphasized *emphasis* etcetera* The question is: do we end up with nested emphasis, or do we go out of our way to disable nested recognition of the same inline markup. That is, do we parse it as: ((*)(emphasized *emphasis* etcetera)(*)) or ((*)(emphasized ((*)(emphasis)(*)) etcetera)(*)) ?? > So (b) is no good because backslashes must be left alone; they must > appear in the output. (a) is too general; for instance, nested inline > markup has to allow interpreted text within interpreted text:: > > :role1:`interpreted :role2:`text`` I could *easily* do (b) for everything other than literal text. That's what falls most naturally out of the code. Please let me know; I'd love to have something I consider to be a good first draft tonight. -- Dave Abrahams Boost Consulting www.boost-consulting.com |
From: David A. <da...@bo...> - 2004-01-17 04:24:16
|
David Abrahams <da...@bo...> writes: >> The current parsing is correct and must remain. The double-backquotes >> before "TeX" must not be interpreted as markup. Only an inline >> literal end-string should be searched for upon encountering an inline >> literal start-string. > > Well, OK, it's easy to special-case literal markup. Imagine it wasn't > about literal text: > > *emphasized *emphasis* etcetera* > > The question is: do we end up with nested emphasis, or do we go out > of our way to disable nested recognition of the same inline markup. > That is, do we parse it as: > > ((*)(emphasized *emphasis* etcetera)(*)) > > or > > ((*)(emphasized ((*)(emphasis)(*)) etcetera)(*)) > > ?? > >> So (b) is no good because backslashes must be left alone; they must >> appear in the output. (a) is too general; for instance, nested inline >> markup has to allow interpreted text within interpreted text:: >> >> :role1:`interpreted :role2:`text`` > > I could *easily* do (b) for everything other than literal text. > That's what falls most naturally out of the code. Actually I already have it working that way. I obviously didn't cover all cases, though. I need to do something for your "role" example. Should be a quick fix. Could you perhaps help me out by writing up a few nesting tests which you think should pass? I'm liable to be overlooking more things... Thanks, -- Dave Abrahams Boost Consulting www.boost-consulting.com |
From: David G. <go...@py...> - 2004-01-17 15:49:05
Attachments:
test_nested_inline_markup.py
|
Here's the promised attachment. -- David Goodger |
From: David G. <go...@py...> - 2004-01-17 15:47:17
|
[Sorry for the dupes. I recently restricted the docutils-users and -develop lists to member-posting-only (once I determined that that did *not* mean non-member posts are discarded, only held for approval; MailMan's docs aren't the clearest on the subject). I approved the post this moring, but David A. subscribed and re-posted last night.] David Abrahams wrote: > Well, OK, it's easy to special-case literal markup. Imagine it > wasn't about literal text: > > *emphasized *emphasis* etcetera* > > The question is: do we end up with nested emphasis, or do we go out > of our way to disable nested recognition of the same inline markup. I think we should allow nested emphasis (or anything else, *except* inline literals), and leave it up to the docs and the author to avoid potentially meaningless examples. Not that this case is really meaningless: I've seen such examples in printed text before, where a long passage of italics (e.g. a character's inner thoughts) contains a bit of emphasis, which is displayed in roman (un-italicized). In other words, the display of emphasis could be thought of as a toggle switch: apply it twice and it turns off. > That is, do we parse it as: > > ((*)(emphasized *emphasis* etcetera)(*)) > > or > > ((*)(emphasized ((*)(emphasis)(*)) etcetera)(*)) > > ?? The latter. >> So (b) is no good because backslashes must be left alone; they must >> appear in the output. (a) is too general; for instance, nested >> inline markup has to allow interpreted text within interpreted >> text:: >> >> :role1:`interpreted :role2:`text`` > > I could *easily* do (b) for everything other than literal text. > That's what falls most naturally out of the code. I think that's correct. Watch out for cases like this though: **strong with a wildcard a.* inside** That shouldn't cause any error (the "*" is not in start-string context). > Please let me know; I'd love to have something I consider to be a > good first draft tonight. Sorry, went to bed after my reply last night (got the flu) and didn't see your message until now. > Could you perhaps help me out by writing up a few nesting tests > which you think should pass? I'm liable to be overlooking more > things... I've attached some unit tests in test_nested_inline_markup.py, with some interesting edge cases and relatively complex examples. I *think* the "expected output" is correct, but I may have made mistakes (can't test it yet ;-). There are probably other edge cases which I've missed, but we'll find them once we have code to exercise. See the top of the file for installation instructions. Please also run test/test_parsers/test_rst/test_inline_markup.py and report the results. I expect some tests to fail, but I'd like to see which ones and how. Thanks! -- David Goodger http://python.net/~goodger For hire: http://python.net/~goodger/cv |
From: David A. <da...@bo...> - 2004-01-17 18:52:19
|
David Goodger <go...@py...> writes: > I think we should allow nested emphasis (or anything else, *except* > inline literals), and leave it up to the docs and the author to avoid > potentially meaningless examples. Not that this case is really > meaningless: I've seen such examples in printed text before, where a > long passage of italics (e.g. a character's inner thoughts) contains a > bit of emphasis, which is displayed in roman (un-italicized). In > other words, the display of emphasis could be thought of as a toggle > switch: apply it twice and it turns off. Good thinking. Here's another question: the way I've coded it, once problematic text is found, I stop trying to recursively find nested markup. Would you like it to warn about the problem as it does now, and then just continue to parse it for inline markup? -- Dave Abrahams Boost Consulting www.boost-consulting.com |
From: David G. <go...@py...> - 2004-01-17 19:11:15
|
David Abrahams wrote: > Here's another question: the way I've coded it, once problematic > text is found, I stop trying to recursively find nested markup. > Would you like it to warn about the problem as it does now, and then > just continue to parse it for inline markup? Can you show examples of what your code does now? Here is some input: *emph **strong *prob ``literal``, end of strong**, end of emph* Ideally, I'd like this to parse to: <paragraph> <emphasis> emph <strong> strong <problematic ...> * prob <literal> literal , end of strong , end of emph I don't know if that's possible or not though. How can the parser know that the end-emphasis matches the first start-emphasis and not the second? You've probably picked up on this already, but the rules for start-string and end-string may have to change to allow nested inline markup. For example, in "*emph **strong***" the end-strings are adjacent. Or do we have to disambiguate with "\ " (as in "*emph **strong**\ *")? -- David Goodger http://python.net/~goodger For hire: http://python.net/~goodger/cv |
From: David A. <da...@bo...> - 2004-01-17 22:06:41
|
David Goodger <go...@py...> writes: > David Abrahams wrote: > > Here's another question: the way I've coded it, once problematic > > text is found, I stop trying to recursively find nested markup. > > Would you like it to warn about the problem as it does now, and then > > just continue to parse it for inline markup? > > Can you show examples of what your code does now? Here is some input: > > *emph **strong *prob ``literal``, end of strong**, end of emph* > > Ideally, I'd like this to parse to: > > <paragraph> > <emphasis> > emph > <strong> > strong > <problematic ...> > * > prob > <literal> > literal > , end of strong > , end of emph <document source="<stdin>"> <paragraph> <emphasis> emph <strong> strong <problematic id="id2" refid="id1"> * prob <literal> literal , end of strong , end of emph <system_message backrefs="id2" id="id1" level="2" line="1" source="<stdin>" type="WARNING"> <paragraph> Inline emphasis start-string without end-string. > I don't know if that's possible or not though. How can the parser > know that the end-emphasis matches the first start-emphasis and not > the second? It can't. I guess that's the interpretation which results the fewest errors, but I think we could probably construct cases where the other interpretation would be more sensible: *emph *prob **strong ``literal``, end of strong**, end of emph* It happens to work because we always match from outer to inner right now, but I am beginning to realize that the algorithm's going to have to change to handle other cases, and it's going to get tough if you want to retain the behavior you're seeing above. The problem arises with situations like: *emphasis *within emphasis* and such* or, worse: :emphasis:`foo :strong:`bar` baz` which right now are ending up as: <document source="<stdin>"> <paragraph> <emphasis> emphasis <problematic id="id2" refid="id1"> * within emphasis and such* <system_message backrefs="id2" id="id1" level="2" line="1" source="<stdin>" type="WARNING"> <paragraph> Inline emphasis start-string without end-string. <document source="<stdin>"> <paragraph> <emphasis> foo :strong: <problematic id="id2" refid="id1"> ` bar baz` <system_message backrefs="id2" id="id1" level="2" line="1" source="<stdin>" type="WARNING"> <paragraph> Inline interpreted text or phrase reference start-string without end-string. The problem is that we're greedy in searching for end-strings (call this problem 1) > You've probably picked up on this already, but the rules for > start-string and end-string may have to change to allow nested inline > markup. For example, in "*emph **strong***" the end-strings are > adjacent. Or do we have to disambiguate with "\ " (as in > "*emph **strong**\ *")? Right now, the rule for matching emphasis end-strings is that you can match the last in an odd-length string of stars. That's kind of a hack, and it clearly doesn't allow *emphasis within *emphasis** (problem 2) to work, either. Of course you can disambiguate the latter as *emphasis within *emphasis*\ * So I don't consider that to be serious. I think I understand what the algorithm for solving problem 1 must be roughly: def parse(remaining): search for start string if found: parse2(start, remaining) def parse2(start, remaining) children = [] messages = [] while 1: search *simultaneously* for end(start) and for all starts. if another_start is found: nodes += text(remaining[:position(another_start)]) n,m,remaining = parse2(another_start,remaining) children += n messages += m else if end(start) is found: children += text(remaining[end_pos(end(start)):]) return [ some_node(..., *children) ], messages else: error This algorithm would address problem 2 as well. Now, if you want the current behavior for your example in the beginning of this message, we'll have exactly the problem you anticipated, unless we bend over backwards to avoid it. I can think of ways to do that, but none of them are natural or pretty or efficient. IMO it's better to live with the fact that when there's problematic text, the (mis)interpretation of the text you get may not be the one you'd prefer. BTW, even to get this far I had to do some semi-massive refactorings and renamings. It was just too hairy in there; I was losing track of what things meant. I hope you don't find my changes too odious. I'll send you a current diff for states.py so you can look at it, though it clearly will take some substantial work from where the code is now to get the above algorithm implemented, and I'm not sure I have time... :-( But knowing me, I won't be able to resist solving the problem ;-) so I'll spend the time I don't have :-|. -- Dave Abrahams Boost Consulting www.boost-consulting.com |
From: Mark N. <no...@so...> - 2004-01-19 16:43:23
|
David Abrahams wrote: > > David Goodger <go...@py...> writes: > > > David Abrahams wrote: > > > Here's another question: the way I've coded it, once problematic > > > text is found, I stop trying to recursively find nested markup. > > > Would you like it to warn about the problem as it does now, and then > > > just continue to parse it for inline markup? > > > > Can you show examples of what your code does now? Here is some input: > > > > *emph **strong *prob ``literal``, end of strong**, end of emph* > > > > Ideally, I'd like this to parse to: > > > > <paragraph> > > <emphasis> > > emph > > <strong> > > strong > > <problematic ...> > > * > > prob > > <literal> > > literal > > , end of strong > > , end of emph > > It can't. It can. The Perl parser gets what David G. wanted. > I guess that's the > interpretation which results the fewest errors, but I think we could > probably construct cases where the other interpretation would be more > sensible: > > *emph *prob **strong ``literal``, end of strong**, end of emph* However, this one did trip up my Perl parser :-(. I'll have to see what's going on. > It happens to work because we always match from outer to inner right > now, but I am beginning to realize that the algorithm's going to have > to change to handle other cases, and it's going to get tough if you > want to retain the behavior you're seeing above. The problem arises > with situations like: > > *emphasis *within emphasis* and such* > > or, worse: > > :emphasis:`foo :strong:`bar` baz` The Perl parser handled both of these correctly. The bottom line is that you can't be greedy in matching end strings. --Mark |
From: David A. <da...@bo...> - 2004-01-19 18:51:51
|
Mark Nodine <no...@so...> writes: > David Abrahams wrote: >> >> David Goodger <go...@py...> writes: >> >> > David Abrahams wrote: >> > > Here's another question: the way I've coded it, once problematic >> > > text is found, I stop trying to recursively find nested markup. >> > > Would you like it to warn about the problem as it does now, and then >> > > just continue to parse it for inline markup? >> > >> > Can you show examples of what your code does now? Here is some input: >> > >> > *emph **strong *prob ``literal``, end of strong**, end of emph* >> > >> > Ideally, I'd like this to parse to: >> > >> > <paragraph> >> > <emphasis> >> > emph >> > <strong> >> > strong >> > <problematic ...> >> > * >> > prob >> > <literal> >> > literal >> > , end of strong >> > , end of emph >> >> It can't. > > It can. The Perl parser gets what David G. wanted. > >> I guess that's the >> interpretation which results the fewest errors, but I think we could >> probably construct cases where the other interpretation would be more >> sensible: >> >> *emph *prob **strong ``literal``, end of strong**, end of emph* > > However, this one did trip up my Perl parser :-(. I'll have to see > what's going on. It's not clear there's a "right answer". The algorithm I outlined in private mail to you and David gets: (*)emph ((*)prob ((**)strong ((``)literal(``)), end of strong(**)), end of emph(*)) ^^^ ^^^-------unmatched If you want: ((*)emph (*)prob ((**)strong ((``)literal(``)), end of strong(**)), end of emph(*)) ^^^ unmatched---^^^ You need a non-deterministic parser and a rule which values outer matching earlier start strings more than later ones (or you have to parse it backwards ;->). Non-deterministic parsers are possible (I've built them), but to do that would be more Perlish than Pythonic, IMO. It's just a case of giving in to the temptation to guess. -- Dave Abrahams Boost Consulting www.boost-consulting.com |
From: David G. <go...@py...> - 2004-01-19 19:27:48
|
David Abrahams wrote: >>> *emph *prob **strong ``literal``, end of strong**, end of emph* > > It's not clear there's a "right answer". The algorithm I outlined > in private mail to you and David gets: > > (*)emph ((*)prob ((**)strong ((``)literal(``)), > end of strong(**)), end of emph(*)) > ^^^ > ^^^-------unmatched > > If you want: > > ((*)emph (*)prob ((**)strong ((``)literal(``)), > end of strong(**)), end of emph(*)) > ^^^ > unmatched---^^^ > > You need a non-deterministic parser and a rule which values outer > matching earlier start strings more than later ones (or you have to > parse it backwards ;->). Non-deterministic parsers are possible > (I've built them), but to do that would be more Perlish than > Pythonic, IMO. It's just a case of giving in to the temptation to > guess. It's not worth the effort. The earlier parse (first * unmatched) is best IMO. As you said, it doesn't really matter which bit of markup gets flagged as problematic, as long as something does, and consistently. -- David Goodger http://python.net/~goodger For hire: http://python.net/~goodger/cv |
From: Mark N. <no...@so...> - 2004-01-19 23:40:42
|
Now that I'm making nested inline the default, I'm having trouble figuring out what to do with one of the inline markup tests, namely:: `embedded URI with too much whitespace < http://example.com/ long/path /and /whitespace >`__ `embedded URI with too much whitespace at end <http://example.com/ long/path /and /whitespace >`__ `embedded URI with no preceding whitespace<http://example.com>`__ `escaped URI \<http://example.com>`__ In each of these cases, the malformed embedded URI parses recursively as implicit markup, resulting in a reference within a reference. Should we (a) allow this embedding to occur? (b) make a special case for not parsing implicit markup within a reference? (c) something else? --Mark |
From: David G. <go...@py...> - 2004-01-20 05:12:49
|
Mark Nodine wrote: > Now that I'm making nested inline the default, I'm having trouble > figuring out what to do with one of the inline markup tests, > namely:: ... > In each of these cases, the malformed embedded URI parses > recursively as implicit markup, resulting in a reference within a > reference. Should we > > (a) allow this embedding to occur? Perhaps. The URL being visible is an indication to the author that there's a problem. > (b) make a special case for not parsing implicit markup within a > reference? Perhaps. There are pros and cons to each approach. How about: parse it, but complain if a reference is created inside another reference. That may be best. -- David Goodger http://python.net/~goodger For hire: http://python.net/~goodger/cv |
From: Mark N. <no...@so...> - 2004-01-20 16:10:30
|
David Goodger wrote: > > Mark Nodine wrote: > > Now that I'm making nested inline the default, I'm having trouble > > figuring out what to do with one of the inline markup tests, > > namely:: > ... > > In each of these cases, the malformed embedded URI parses > > recursively as implicit markup, resulting in a reference within a > > reference. Should we > > > > (a) allow this embedding to occur? > > Perhaps. The URL being visible is an indication to the author that > there's a problem. > > > (b) make a special case for not parsing implicit markup within a > > reference? > > Perhaps. There are pros and cons to each approach. > > How about: parse it, but complain if a reference is created inside > another reference. That may be best. Sounds reasonable. Level 2 (Warning) message? --Mark |
From: Mark N. <no...@so...> - 2004-01-20 17:25:27
|
David Goodger wrote: > > Mark Nodine wrote: > > Now that I'm making nested inline the default, I'm having trouble > > figuring out what to do with one of the inline markup tests, > > namely:: > ... > > In each of these cases, the malformed embedded URI parses > > recursively as implicit markup, resulting in a reference within a > > reference. > > How about: parse it, but complain if a reference is created inside > another reference. That may be best. Is it worth checking all the way up the parse tree for another reference, so we'd still catch a reference inside an emphasis inside a reference? --Mark |
From: Aahz <aa...@py...> - 2004-01-18 19:27:43
|
On Fri, Jan 16, 2004, David Goodger wrote: > [David Abrahams] >> >> 2. ``literal ``TeX quotes'' & \\backslash`` >> >> This one is currently parsed as though tokenized this way: >> >> <``><literal ``TeX quotes'' & \\backslash><``> >> >> But my code tokenizes inside markup and so sees: >> >> <``><literal ><``>TeX quotes'' & \\backslash><``> >> >> Again, I see two choices: >> >> a. Turn off recognition of an inline markup start string within >> regions already using that markup. >> >> b. Make that an error and force the user to write >> >> ``literal \``TeX quotes'' & \\backslash`` >> >> In both cases (a) is more backward-compatible but (b) is more >> consistent, and, dare I say, Pythonic. > > Here I disagree. The spec says: > > No markup interpretation (including backslash-escape > interpretation) is done within inline literals. > > The current parsing is correct and must remain. The double-backquotes > before "TeX" must not be interpreted as markup. Only an inline > literal end-string should be searched for upon encountering an inline > literal start-string. > > So (b) is no good because backslashes must be left alone; they must > appear in the output. (a) is too general; for instance, nested inline > markup has to allow interpreted text within interpreted text:: > > :role1:`interpreted :role2:`text`` In other words, if I want to have an inline parsed-literal (say, for rendering variables in italic), I need to do either:: ``if ``\ *``foo``*\ ``:`` or:: :parsed-literal:`if *foo* :` Correct? -- Aahz (aa...@py...) <*> http://www.pythoncraft.com/ A: No. Q: Is top-posting okay? |
From: David G. <go...@py...> - 2004-01-18 19:47:18
|
Aahz wrote: > In other words, if I want to have an inline parsed-literal (say, for > rendering variables in italic), I need to do either:: > > ``if ``\ *``foo``*\ ``:`` > > or:: > > :parsed-literal:`if *foo* :` > > Correct? Yes, correct (once it's all implemented, of course). -- David Goodger |
From: David A. <da...@bo...> - 2004-01-19 00:13:42
|
David Goodger <go...@py...> writes: > Aahz wrote: >> In other words, if I want to have an inline parsed-literal (say, for >> rendering variables in italic), I need to do either:: >> ``if ``\ *``foo``*\ ``:`` >> or:: >> :parsed-literal:`if *foo* :` >> Correct? > > Yes, correct (once it's all implemented, of course). I want to suggest again that all literals written with :literal:`...` be parsed. I'll remind you that there's no way around it when the suffix syntax (`...`:literal:) is used. -- Dave Abrahams Boost Consulting www.boost-consulting.com |