From: Beni C. <cb...@us...> - 2006-03-17 15:19:56
|
David Goodger wrote: > [J=C3=B6rg W Mittag] >> >> I have discovered what I believe to be an inconsistency in the >> handling of whitespace and punctuation by the |reST|_ parser: the >> parser doesn=E2=80=99t treat all forms of whitespace and/or punctuatio= n >> consistently. For example, a ``space`` is considered whitespace, >> but a ``thin space`` or ``non-breaking space`` are not. >=20 > The way the reST parser is now written, each character that is to be > considered whitespace or punctuation must be explicitly supplied. To > expand our definition to non-ASCII whitespace/punctuation, we would > have to significantly expand our explicit list. This is certainly > possible, but I don't know how desirable it is. It may slow things > down a lot. (Then again, it might not. Somebody shoud try it.) If > the parser is slowed significantly, a different implementation may be > called for. In other words, this is uncharted territory. >=20 Speed is irrelevant. reST is supposed to be i18n-firendly so this is a=20 bug. Implementation details are our problem, not users'. Anyway, I don't think regexps with big character set are noticably slower. And writing them might be easy: with the UNICODE flag, ``\s`` matches [ \t\n\r\f\v] plus whatever is classified as space in the Unicode character properties database; > Also, in the case of the "no-break space" character (U+00A0), the fact > that it is a non-ASCII character is often used to *avoid* markup. For > example, >=20 > A. Einstein was a bright fellow. >=20 > Without the no-break space after "A.", that single-line paragraph > would be interpreted as a list item. >=20 That's a rather dirty hack. It's *invisible* markup (or rather markdown ;-). I see the usefulness but I don't like it at all. >> Anyway, I just wanted to mention this. I think the handling of >> whitespace and punctuation should be consistent, so that *any* >> kind of whitespace and punctuation can delimit a link or other >> markup, not just ``space``, ``dot`` and a few others. >=20 > Do you realize that once this step is taken, we would have to allow > for whitespace and punctuation from all of Unicode (Chinese, Japanese, > Korean, Arabic, Thai, ...)? I'm not saying that's bad, just that it > may be a lot of work. We programmer tend to be a lazy bunch. Yep, we are ;-). For a long time already, I'm using a version of=20 docutils locally patched to accept Hebrew Maqaf as punctuation but I was too lazy to add generic support. > I know someone who has memorized a bunch of Alt-numpad sequences on > Windows. It's crazy. That's not what computers are for. >=20 I did that too, until I discovered AutoCorrect in Office fits my needs. > You have provided food for thought. I don't know when action will be > taken, but I think it will. >=20 If laziness delays you, let's decide on the correct behaviour and I'll=20 implement it. It scratches my itch for Hebrew. --=20 Beni Cherniavsky <cb...@us...>, who can only read email on weekends= . |