Re: [Docutils-develop] [ docutils-Bugs-3166907 ] rst parser doesn't handle languages without spaces

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

On 2011-01-28, SourceForge.net wrote:

> Summary: rst parser doesn't handle languages without spaces

> Initial Comment: 

> When parsing a reStructuredText document in a language without standard
> spaces, such as Japanese, inlining does not work as it should. The
> problem relates to the unicode_delimeters and end_string_suffix
> variables of the Inliner class in parsers/rst/states.py. When
> inline_obj() in that file checks for an end-of-inline match 
> ("endmatch = end_pattern.search(string[matchend:])"), the RE fails
> because end_string_suffix doesn't handle the use of any character after
> the inlined string's suffix. 

This is the documented behaviour. It reduces the number of false positivs
(allowing in-line markup inside words).

> Inline literals are demonstrated in the attached files. Even the
> full-width space used in East Asian languages such as Japanese and
> Chinese doesn't work (lines 10 and 12). Adding a new line or ASCII
> space before/after the inlined string allows it to be parsed normally.

The "official" workaround is to use a protected space (``\ ``).

> Probably, either unicode_delimiters needs to be expanded to include the
> full character set from languages such as Japanese and Chinese (can it
> do a Unicode range?) or the patterns used to find the start and end of
> inlined strings need to be.

IMV, the full-width space (and maybe other space variants from "General
Punctuation") could be accepted around inline markup.

I cannot say whether adding the full JCK character set is an improvement
as it would require escaping the inline markup characters inside a
Chinese/Japanese/Korean word.

Günter