From: Dave K. <dku...@pa...> - 2013-05-03 02:24:36
|
This morning, I updated my docutils repository using SVN. Now I get this traceback when I run rst2html.py:: [snip] File "/usr/local/lib/python2.7/site-packages/docutils/parsers/rst/states.py", line 461, in <module> class Inliner: File "/usr/local/lib/python2.7/site-packages/docutils/parsers/rst/states.py", line 591, in Inliner initial=build_regexp(parts), File "/usr/local/lib/python2.7/site-packages/docutils/parsers/rst/states.py", line 456, in build_regexp return re.compile(regexp, re.UNICODE) File "/usr/local/lib/python2.7/re.py", line 190, in compile return _compile(pattern, flags) File "/usr/local/lib/python2.7/re.py", line 242, in _compile raise error, v # invalid expression sre_constants.error: bad character range I suspect that this error is caused by the new word boundary characters for CJK languages that have been updated in docutils/utils/punctuation_chars.py. Am I right about that? I removed my docutils installation, then installed the latest snapshot, and the error did not occur. Is there something that I should do to avoid this error? Perhaps some sort of locale setting? Or, is it really a bug? - Dave K -- Dave Kuhlman http://www.rexx.com/~dkuhlman ----- Original Message ----- > From: Guenter Milde <mi...@us...> > To: doc...@li... > Cc: > Sent: Friday, April 26, 2013 2:23 AM > Subject: Re: [Docutils-develop] [docutils:patches] #103 Recognize inline markups without word boundaries > > On 2013-04-25, David Goodger wrote: > > ... > >> I agree with Ishimoto-san's position on this. It's easier and > cleaner >> to simply flip a switch that turns off delimiter detection, than it is >> to try to specify all the ranges of characters which should be >> considered delimiters. It would be really hard to completely and >> accurately specify all the ranges, without including some characters >> that shouldn't be included, and leaving out some that should. Let alone >> dealing with the edge cases, and there will be many. I believe that >> specifying ranges of delimiters would open us up to a slew of bug >> reports & feature requests that would be difficult to satisfy. > >> Ultimately, the issue here is that language and character sets are >> ambiguous -- a simple global on/off rule would be simpler than >> complicating the concept of delimiters (which is already too >> complicated). > > I don't think so. I really like reStructuredText for the way it *just works* > in the majority of "no-edge" cases. In my view, we can easily extend > this > important principle to the handling of CJK characters: > > Specifying just 9 character ranges, we can easily include CJK characters > into the start_string_prefix and end_sting_suffix regular expressions > (see below for the patch). > > +1 for all users not using CJK, there is no change. > +1 for CJK users, inline markup becomes simpler and less intrusive in the > majority of cases. > >> The ideal would be an inline switch, a "pragma" directive that > would >> tell the parser: treat content from here on (until the next switch) in >> this way. Another such directive further on would change the parser's >> state for content from then on. This would allow for multi-language >> documents, such as an English document containing passages in Japanese, >> or vice-versa. A switch like this could be useful for "spaced" >> languages as well. > > I don't see a "switch" directive as an ideal. > > -1 explicit markup makes the document less "natural" in its source > form > -1 also a switch adds complexity: we have to define and document the > restrictions around inline markup characters for two cases. > (Should the switch also remove restrictions for the "inner side" of > inline markup?) > > The other alternative, a config setting is not ideal either: > > -1 document parsing depends on settings not specified in the document > reducing portability. (I know this is the case for other config > settings, too. Still, increasing this set is not good.) > >> Günter, beware the desire to simplify the treatment of CJK languages. I >> have personal knowledge of the Japanese language [1], a complicated >> writing system that grew organically over many centuries and continues >> to change. A typical document will contain kanji (Chinese characters), >> hiragana & katakana (phonetic characters), and Latin alphabetic >> characters (including numerals & symbols), both "half-width" > and >> "full-width" versions. I don't claim to have full knowledge > of all the >> idiosyncrasies of the Japanese writing system, and I wouldn't pretend >> to understand those of Chinese or Korean. > > I like to extend the specified goal of the inline markup rules: > > The inline markup recognition rules were devised to allow 90% of > non-markup uses of "*", "`", "_", and > "|" without escaping. > > in the opposite direction: > > and allow 90% of markup uses of "*", "`", "_", > and "|" without the need > for escaped whitespace or other workarounds. > > > If this goal can be achieved by a simple addition to rules 1 and 4, I > would prefer this very much over a switch between two sets of inline > markup recognition rules. > > If this simple approach fails to make a significant improvement for > CJK languages, I am ready to drop my proposal. > >> One thing I don't like about the proposal is the switch name. >> "Skippable" is too vague. Perhaps > "--no-inline-delimiters", but I'm >> open to alternatives. > > How about "force inline markup"? (If you insist.) > > > Günter > > > > > Index: punctuation_chars.py > =================================================================== > > +++ punctuation_chars.py (Arbeitskopie) > > +# CJK ideographs > +# -------------- > +# > +# The East Asian languages Chinese, Japanese, and Korean use ideographs that > +# do not require spaces between words. Inline markup should be recognized also > +# directly following or preceding a CJK character. Unicode unifies under the > +# term CJK characters the scripts Han, Bopomofo, Hiragana, Katakana, Hangul, > +# and Yi. > +# > +# Sources for determination of the "CJK" property are the `Unicode > standard > +# Chapter 11 East Asian Scripts`__ describing the CJK unification and the > +# Unicode data file ``Scripts.txt``. > +# > +# __ http://unicode.org/versions/Unicode4.0.0/ch11.pdf > > +cjk = (u'\u1100-\u11FF' # 1100..11FF; Hangul Jamo > + u'\u2E80-\u4DBF' # 2E80..2EFF; CJK Radicals Supplement > + # 2F00..2FDF; Kangxi Radicals > + # 2FF0..2FFF; Ideographic Description Characters > + # 3000..303F; CJK Symbols and Punctuation > + # 3040..309F; Hiragana > + # 30A0..30FF; Katakana > + # 3100..312F; Bopomofo > + # 3130..318F; Hangul Compatibility Jamo > + # 3190..319F; Kanbun > + # 31A0..31BF; Bopomofo Extended > + # 31C0..31EF; CJK Strokes > + # 31F0..31FF; Katakana Phonetic Extensions > + # 3200..32FF; Enclosed CJK Letters and Months > + # 3300..33FF; CJK Compatibility > + # 3400..4DBF; CJK Unified Ideographs Extension A > + u'\u4E00-\uA4CF' # 4E00..9FFF; CJK Unified Ideographs > + # A000..A48F; Yi Syllables > + # A490..A4CF; Yi Radicals > + u'\uA960-\uA97F' # A960..A97F; Hangul Jamo Extended-A > + u'\uAC00-\uD7FF' # AC00..D7AF; Hangul Syllables > + # D7B0..D7FF; Hangul Jamo Extended-B > + u'\uF900-\uFAFF' # F900..FAFF; CJK Compatibility > Ideographs > + u'\uFE30-\uFE4F' # FE30..FE4F; CJK Compatibility Forms > + u'\uFF00-\uFFEF' # FF00..FFEF; Halfwidth and Fullwidth > Forms > + u'\U00020000-\U0002FA1F' # 20000..2A6DF; CJK Unified > Ideographs Extension B > + # 2A700..2B73F; CJK Unified Ideographs > Extension C > + # 2B740..2B81F; CJK Unified Ideographs > Extension D > + # 2F800..2FA1F; CJK Compatibility Ideographs > Supplement > + ) > > > Index: states.py > =================================================================== > --- states.py (Revision 7640) > +++ states.py (Arbeitskopie) > @@ -528,13 +528,15 @@ > # Inline object recognition > # ------------------------- > # lookahead and look-behind expressions for inline markup rules > - start_string_prefix = (u'(^|(?<=\\s|[%s%s]))' % > + start_string_prefix = (u'(^|(?<=\\s|[%s%s%s]))' % > (punctuation_chars.openers, > - punctuation_chars.delimiters)) > - end_string_suffix = (u'($|(?=\\s|[\x00%s%s%s]))' % > + punctuation_chars.delimiters, > + punctuation_chars.cjk)) > + end_string_suffix = (u'($|(?=\\s|[\x00%s%s%s%s]))' % > (punctuation_chars.closing_delimiters, > punctuation_chars.delimiters, > - punctuation_chars.closers)) > + punctuation_chars.closers, > + punctuation_chars.cjk)) > # print start_string_prefix.encode('utf8') > # TODO: support non-ASCII whitespace in the following 4 patterns? > non_whitespace_before = r'(?<![ \n])' > > Index: restructuredtext.txt > =================================================================== > --- restructuredtext.txt (Revision 7629) > +++ restructuredtext.txt (Arbeitskopie) > @@ -2382,13 +2382,14 @@ > immediately preceded by > > * whitespace, > - * one of the ASCII characters ``- : / ' " < ( [ {`` or > + * one of the ASCII characters ``- : / ' " < ( [ {``, > * a non-ASCII punctuation character with `Unicode category`_ > `Pd` (Dash), > `Po` (Other), > `Ps` (Open), > `Pi` (Initial quote), or > - `Pf` (Final quote) [#PiPf]_. > + `Pf` (Final quote) [#PiPf]_, > + * a CJK character. > > 2. Inline markup start-strings must be immediately followed by > non-whitespace. > @@ -2400,13 +2401,14 @@ > followed by > > * whitespace, > - * one of the ASCII characters ``- . , : ; ! ? \ / ' " ) ] } >> `` or > + * one of the ASCII characters ``- . , : ; ! ? \ / ' " ) ] } >> ``, > * a non-ASCII punctuation character with `Unicode category`_ > `Pd` (Dash), > `Po` (Other), > `Pe` (Close), > `Pf` (Final quote), or > - `Pi` (Initial quote) [#PiPf]_. > + `Pi` (Initial quote) [#PiPf]_, > + * a CJK character. > > 5. If an inline markup start-string is immediately preceded by one of the > ASCII characters ``' " < ( [ {``, or a character with Unicode > character > > > ------------------------------------------------------------------------------ > Try New Relic Now & We'll Send You this Cool Shirt > New Relic is the only SaaS-based application performance monitoring service > that delivers powerful full stack analytics. Optimize and monitor your > browser, app, & servers with just a few lines of code. Try New Relic > and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr > _______________________________________________ > Docutils-develop mailing list > Doc...@li... > https://lists.sourceforge.net/lists/listinfo/docutils-develop > > Please use "Reply All" to reply to the list. > |