Re: [Docutils-develop] [docutils:patches] #103 Recognize inline markups without word boundaries

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

This morning, I updated my docutils repository using SVN.  Now I get
this traceback when I run rst2html.py::

    [snip]
      File "/usr/local/lib/python2.7/site-packages/docutils/parsers/rst/states.py", line 461, in <module>
        class Inliner:
      File "/usr/local/lib/python2.7/site-packages/docutils/parsers/rst/states.py", line 591, in Inliner
        initial=build_regexp(parts),
      File "/usr/local/lib/python2.7/site-packages/docutils/parsers/rst/states.py", line 456, in build_regexp
        return re.compile(regexp, re.UNICODE)
      File "/usr/local/lib/python2.7/re.py", line 190, in compile
        return _compile(pattern, flags)
      File "/usr/local/lib/python2.7/re.py", line 242, in _compile
        raise error, v # invalid expression
    sre_constants.error: bad character range

I suspect that this error is caused by the new word boundary
characters for CJK languages that have been updated in
docutils/utils/punctuation_chars.py.  Am I right about that?

I removed my docutils installation, then installed the latest
snapshot, and the error did not occur.

Is there something that I should do to avoid this error?  Perhaps
some sort of locale setting?  Or, is it really a bug?

- Dave K

--

Dave Kuhlman
http://www.rexx.com/~dkuhlman

----- Original Message -----
> From: Guenter Milde <mi...@us...>
> To: doc...@li...
> Cc: 
> Sent: Friday, April 26, 2013 2:23 AM
> Subject: Re: [Docutils-develop] [docutils:patches] #103 Recognize inline markups without word boundaries
> 
> On 2013-04-25, David Goodger wrote:
> 
> ...
> 
>>  I agree with Ishimoto-san's position on this. It's easier and 
> cleaner
>>  to simply flip a switch that turns off delimiter detection, than it is
>>  to try to specify all the ranges of characters which should be
>>  considered delimiters. It would be really hard to completely and
>>  accurately specify all the ranges, without including some characters
>>  that shouldn't be included, and leaving out some that should. Let alone
>>  dealing with the edge cases, and there will be many. I believe that
>>  specifying ranges of delimiters would open us up to a slew of bug
>>  reports & feature requests that would be difficult to satisfy.
> 
>>  Ultimately, the issue here is that language and character sets are
>>  ambiguous -- a simple global on/off rule would be simpler than
>>  complicating the concept of delimiters (which is already too
>>  complicated).
> 
> I don't think so. I really like reStructuredText for the way it *just works*
> in the majority of "no-edge" cases. In my view, we can easily extend 
> this
> important principle to the handling of CJK characters:
> 
> Specifying just 9 character ranges, we can easily include CJK characters
> into the start_string_prefix and end_sting_suffix regular expressions
> (see below for the patch).
> 
> +1 for all users not using CJK, there is no change.
> +1 for CJK users, inline markup becomes simpler and less intrusive in the
>    majority of cases.
> 
>>  The ideal would be an inline switch, a "pragma" directive that 
> would
>>  tell the parser: treat content from here on (until the next switch) in
>>  this way. Another such directive further on would change the parser's
>>  state for content from then on. This would allow for multi-language
>>  documents, such as an English document containing passages in Japanese,
>>  or vice-versa. A switch like this could be useful for "spaced"
>>  languages as well.
> 
> I don't see a "switch" directive as an ideal. 
> 
> -1 explicit markup makes the document less "natural" in its source 
> form
> -1 also a switch adds complexity: we have to define and document the
>    restrictions around inline markup characters for two cases.
>    (Should the switch also remove restrictions for the "inner side" of
>    inline markup?)
>   
> The other alternative, a config setting is not ideal either:
> 
> -1 document parsing depends on settings not specified in the document
>    reducing portability. (I know this is the case for other config
>    settings, too. Still, increasing this set is not good.)
> 
>>  Günter, beware the desire to simplify the treatment of CJK languages. I
>>  have personal knowledge of the Japanese language [1], a complicated
>>  writing system that grew organically over many centuries and continues
>>  to change. A typical document will contain kanji (Chinese characters),
>>  hiragana & katakana (phonetic characters), and Latin alphabetic
>>  characters (including numerals & symbols), both "half-width" 
> and
>>  "full-width" versions. I don't claim to have full knowledge 
> of all the
>>  idiosyncrasies of the Japanese writing system, and I wouldn't pretend
>>  to understand those of Chinese or Korean.
> 
> I like to extend the specified goal of the inline markup rules:
> 
>   The inline markup recognition rules were devised to allow 90% of
>   non-markup uses of "*", "`", "_", and 
> "|" without escaping.
> 
> in the opposite direction:
> 
>   and allow 90% of markup uses of "*", "`", "_", 
> and "|" without the need
>   for escaped whitespace or other workarounds.
> 
> 
> If this goal can be achieved by a simple addition to rules 1 and 4, I
> would prefer this very much over a switch between two sets of inline
> markup recognition rules.
> 
> If this simple approach fails to make a significant improvement for
> CJK languages, I am ready to drop my proposal.
> 
>>  One thing I don't like about the proposal is the switch name.
>>  "Skippable" is too vague. Perhaps 
> "--no-inline-delimiters", but I'm
>>  open to alternatives.
> 
> How about "force inline markup"? (If you insist.)
> 
> 
> Günter
> 
> 
> 
> 
> Index: punctuation_chars.py
> ===================================================================
> 
> +++ punctuation_chars.py    (Arbeitskopie)
> 
> +# CJK ideographs
> +# --------------
> +#
> +# The East Asian languages Chinese, Japanese, and Korean use ideographs that
> +# do not require spaces between words. Inline markup should be recognized also
> +# directly following or preceding a CJK character. Unicode unifies under the
> +# term CJK characters the scripts Han, Bopomofo, Hiragana, Katakana, Hangul,
> +# and Yi.
> +#
> +# Sources for determination of the "CJK" property are the `Unicode 
> standard
> +# Chapter 11 East Asian Scripts`__ describing the CJK unification and the
> +# Unicode data file ``Scripts.txt``.
> +#
> +# __ http://unicode.org/versions/Unicode4.0.0/ch11.pdf
> 
> +cjk = (u'\u1100-\u11FF'   # 1100..11FF; Hangul Jamo
> +       u'\u2E80-\u4DBF'   # 2E80..2EFF; CJK Radicals Supplement
> +                          # 2F00..2FDF; Kangxi Radicals
> +                          # 2FF0..2FFF; Ideographic Description Characters
> +                          # 3000..303F; CJK Symbols and Punctuation
> +                          # 3040..309F; Hiragana
> +                          # 30A0..30FF; Katakana
> +                          # 3100..312F; Bopomofo
> +                          # 3130..318F; Hangul Compatibility Jamo
> +                          # 3190..319F; Kanbun
> +                          # 31A0..31BF; Bopomofo Extended
> +                          # 31C0..31EF; CJK Strokes
> +                          # 31F0..31FF; Katakana Phonetic Extensions
> +                          # 3200..32FF; Enclosed CJK Letters and Months
> +                          # 3300..33FF; CJK Compatibility
> +                          # 3400..4DBF; CJK Unified Ideographs Extension A
> +       u'\u4E00-\uA4CF'   # 4E00..9FFF; CJK Unified Ideographs
> +                          # A000..A48F; Yi Syllables
> +                          # A490..A4CF; Yi Radicals
> +       u'\uA960-\uA97F'   # A960..A97F; Hangul Jamo Extended-A
> +       u'\uAC00-\uD7FF'   # AC00..D7AF; Hangul Syllables
> +                          # D7B0..D7FF; Hangul Jamo Extended-B
> +       u'\uF900-\uFAFF'   # F900..FAFF; CJK Compatibility 
> Ideographs
> +       u'\uFE30-\uFE4F'   # FE30..FE4F; CJK Compatibility Forms
> +       u'\uFF00-\uFFEF'   # FF00..FFEF; Halfwidth and Fullwidth 
> Forms
> +       u'\U00020000-\U0002FA1F' # 20000..2A6DF; CJK Unified 
> Ideographs Extension B
> +                                # 2A700..2B73F; CJK Unified Ideographs 
> Extension C
> +                                # 2B740..2B81F; CJK Unified Ideographs 
> Extension D
> +                                # 2F800..2FA1F; CJK Compatibility Ideographs 
> Supplement
> +       )
> 
> 
> Index: states.py
> ===================================================================
> --- states.py    (Revision 7640)
> +++ states.py    (Arbeitskopie)
> @@ -528,13 +528,15 @@
>      # Inline object recognition
>      # -------------------------
>      # lookahead and look-behind expressions for inline markup rules
> -    start_string_prefix = (u'(^|(?<=\\s|[%s%s]))' %
> +    start_string_prefix = (u'(^|(?<=\\s|[%s%s%s]))' %
>                             (punctuation_chars.openers,
> -                            punctuation_chars.delimiters))
> -    end_string_suffix = (u'($|(?=\\s|[\x00%s%s%s]))' %
> +                            punctuation_chars.delimiters,
> +                            punctuation_chars.cjk))
> +    end_string_suffix = (u'($|(?=\\s|[\x00%s%s%s%s]))' %
>                           (punctuation_chars.closing_delimiters,
>                            punctuation_chars.delimiters,
> -                          punctuation_chars.closers))
> +                          punctuation_chars.closers,
> +                          punctuation_chars.cjk))
>      # print start_string_prefix.encode('utf8')
>      # TODO: support non-ASCII whitespace in the following 4 patterns?
>      non_whitespace_before = r'(?<![ \n])'
> 
> Index: restructuredtext.txt
> ===================================================================
> --- restructuredtext.txt    (Revision 7629)
> +++ restructuredtext.txt    (Arbeitskopie)
> @@ -2382,13 +2382,14 @@
>     immediately preceded by
> 
>     * whitespace,
> -   * one of the ASCII characters ``- : / ' " < ( [ {`` or
> +   * one of the ASCII characters ``- : / ' " < ( [ {``,
>     * a non-ASCII punctuation character with `Unicode category`_
>       `Pd` (Dash),
>       `Po` (Other),
>       `Ps` (Open),
>       `Pi` (Initial quote), or
> -     `Pf` (Final quote) [#PiPf]_.
> +     `Pf` (Final quote) [#PiPf]_,
> +   * a CJK character.
> 
> 2. Inline markup start-strings must be immediately followed by
>     non-whitespace.
> @@ -2400,13 +2401,14 @@
>     followed by
> 
>     * whitespace,
> -   * one of the ASCII characters ``- . , : ; ! ? \ / ' " ) ] } 
>> `` or
> +   * one of the ASCII characters ``- . , : ; ! ? \ / ' " ) ] } 
>> ``,
>     * a non-ASCII punctuation character with `Unicode category`_
>       `Pd` (Dash),
>       `Po` (Other),
>       `Pe` (Close),
>       `Pf` (Final quote), or
> -     `Pi` (Initial quote) [#PiPf]_.
> +     `Pi` (Initial quote) [#PiPf]_,
> +   * a CJK character.
> 
> 5. If an inline markup start-string is immediately preceded by one of the
>     ASCII characters ``' " < ( [ {``, or a character with Unicode 
> character
> 
> 
> ------------------------------------------------------------------------------
> Try New Relic Now & We'll Send You this Cool Shirt
> New Relic is the only SaaS-based application performance monitoring service 
> that delivers powerful full stack analytics. Optimize and monitor your
> browser, app, & servers with just a few lines of code. Try New Relic
> and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr
> _______________________________________________
> Docutils-develop mailing list
> Doc...@li...
> https://lists.sourceforge.net/lists/listinfo/docutils-develop
> 
> Please use "Reply All" to reply to the list.
>