Re: [Docutils-users] Alternative punctuation and whitespace not recognized as such

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

David Goodger wrote:
> [J=C3=B6rg W Mittag]
>>
>> I have discovered what I believe to be an inconsistency in the
>> handling of whitespace and punctuation by the |reST|_ parser: the
>> parser doesn=E2=80=99t treat all forms of whitespace and/or punctuatio=
n
>> consistently.  For example, a ``space`` is considered whitespace,
>> but a ``thin space`` or ``non-breaking space`` are not.
>=20
> The way the reST parser is now written, each character that is to be
> considered whitespace or punctuation must be explicitly supplied.  To
> expand our definition to non-ASCII whitespace/punctuation, we would
> have to significantly expand our explicit list.  This is certainly
> possible, but I don't know how desirable it is.  It may slow things
> down a lot.  (Then again, it might not.  Somebody shoud try it.)  If
> the parser is slowed significantly, a different implementation may be
> called for.  In other words, this is uncharted territory.
>=20
Speed is irrelevant.  reST is supposed to be i18n-firendly so this is a=20
bug.  Implementation details are our problem, not users'.

Anyway, I don't think regexps with big character set are noticably
slower.  And writing them might be easy: with the UNICODE flag, ``\s``
matches [ \t\n\r\f\v] plus whatever is classified as space in the
Unicode character properties database;

> Also, in the case of the "no-break space" character (U+00A0), the fact
> that it is a non-ASCII character is often used to *avoid* markup.  For
> example,
>=20
>     A. Einstein was a bright fellow.
>=20
> Without the no-break space after "A.", that single-line paragraph
> would be interpreted as a list item.
>=20
That's a rather dirty hack.  It's *invisible* markup (or rather
markdown ;-).  I see the usefulness but I don't like it at all.

>> Anyway, I just wanted to mention this.  I think the handling of
>> whitespace and punctuation should be consistent, so that *any*
>> kind of whitespace and punctuation can delimit a link or other
>> markup, not just ``space``, ``dot`` and a few others.
>=20
> Do you realize that once this step is taken, we would have to allow
> for whitespace and punctuation from all of Unicode (Chinese, Japanese,
> Korean, Arabic, Thai, ...)?  I'm not saying that's bad, just that it
> may be a lot of work.  We programmer tend to be a lazy bunch.

Yep, we are ;-).  For a long time already, I'm using a version of=20
docutils locally patched to accept Hebrew Maqaf as punctuation but I
was too lazy to add generic support.

> I know someone who has memorized a bunch of Alt-numpad sequences on
> Windows.  It's crazy.  That's not what computers are for.
>=20
I did that too, until I discovered AutoCorrect in Office fits my needs.

> You have provided food for thought.  I don't know when action will be
> taken, but I think it will.
>=20
If laziness delays you, let's decide on the correct behaviour and I'll=20
implement it.  It scratches my itch for Hebrew.

--=20
Beni Cherniavsky <cb...@us...>, who can only read email on weekends=
.