Work at SourceForge, help us to make it a better place! We have an immediate need for a Support Technician in our San Francisco or Denver office.

Close

#103 Recognize inline markups without word boundaries

None
open
nobody
None
5
2013-07-03
2013-04-12
atsuo ishimoto
No

Most east-asian languages such as japanese or Chinese don't have white spaces between words.
Current rules of inline markups are not fit these languages. We need to write a lot of '\ ' to use inline markups.

In this patch, I added a new configuration to recognize inline markups even if word boundaries are not present around inline markups.

e.g. rst2html --skippable-inline-boundaries japanese.rst

1 Attachments

Discussion

1 2 > >> (Page 1 of 2)
  • Günter Milde
    Günter Milde
    2013-04-23

    Thanks for your patch.
    I wonder, whether a different approach is more suited, though:

    Just like punctuation, CJK characters could be generally allowed around inline markup. This would save us from one more configuration setting.

    Of course, this is a backwards incompatible change that could be problematic, if there are Japanese/Chinese/Korean documents that contain inline markup characters (*`_) in the text that should not be recognized as such without the need to escape them or mark the containing text as literal.

     
  • atsuo ishimoto
    atsuo ishimoto
    2013-04-23

    Thank you for your comment.

    I agree that new configuration setting is not desirable. But I don't know how to distinguish characters allowed around inline markups from other characters. Since CJK people uses thousands of characters, so there is no simple way to know a character is CJK or not. Alsoand I"m sure other languages such as Vietnumese don't need white spaces around words.

    We may be able to construct complex rules or character tables to do the job, but I don't think rest/sphinx users want to learn such a complex rules. Python 3 has a definition of a identifiers based on unicode character data[1], but definition like this will never be understood by non-technical people.

    I think the rule "All special characters should be escaped if it is not markup" is simple, easy and good enough for most CJK documents. I don't use ascii punctuation like period, quote at all for Japanese document. Most of non-alphanumeric characters I write will be rest markups. So, current rules for inline markup does harm more than good for documents I usually write.

    [1] http://docs.python.org/3/reference/lexical_analysis.html#identifiers

     
  • Günter Milde
    Günter Milde
    2013-04-25

    It seems the CJK (or CJKV if you want to include Vietnamese ideographs)
    characters are laid out in blocks, so that it is relatively easy to specify
    them with character ranges in the regular expression.

    As a simple test of the concept, I included the CJK Unified Ideographs:

    --- states.py   (Revision 7640)
    +++ states.py   (Arbeitskopie)
    @@ -528,7 +528,7 @@
         # Inline object recognition
         # -------------------------
         # lookahead and look-behind expressions for inline markup rules
    -    start_string_prefix = (u'(^|(?<=\\s|[%s%s]))' %
    +    start_string_prefix = (u'(^|(?<=\\s|[%s%s\u4E00-\uFFFF]))' %
                                (punctuation_chars.openers,
                                 punctuation_chars.delimiters))
         end_string_suffix = (u'($|(?=\\s|[\x00%s%s%s]))' %
    

    and it seemed to work with the test string ⼂*emph*?.

    In a real implementation, the ranges (which could span several Unicode
    blocks) would be defined in a separate variable and included in both,
    start_string_prefix, and end_string_suffix.

    The Unicode Standard lists relevant Unicode Blocks e.g. in
    http://unicode.org/versions/Unicode4.0.0/ch11.pdf

    The Scripts.txt Unicode data file may also be of interest. (Ranges of
    characters in Hiragana, Katanaga, ...., Han, Yi)

    We may be able to construct complex rules or character tables to do the
    job, but I don't think rest/sphinx users want to learn such a complex
    rules. Python 3 has a definition of a identifiers based on unicode
    character data[1], but definition like this will never be understood by
    non-technical people.

    Adding CJK characters to the specification in docs/ref/rst/restructuredtext.txt
    seems easy to understand for the affected average user:

      1. Inline markup start-strings must start a text block or be
         immediately preceded by
    
         * whitespace,
         * one of the ASCII characters ``- : / ' " < ( [ {`` or
         * a non-ASCII punctuation character with `Unicode category`_
           `Pd` (Dash),
           `Po` (Other),
           `Ps` (Open),
           `Pi` (Initial quote), or
           `Pf` (Final quote) [#PiPf]_.
     +   * a CJK (east asian ideographic) character
    

    (Please replace this suggestion with a description of the character
    set that is best understood by people using these scripts.) A similar extension would be added for characters after the end-string.

    I think the rule "All special characters should be escaped if it is not
    markup" is simple, easy and good enough for most CJK documents. I don't
    use ascii punctuation like period, quote at all for Japanese document.
    Most of non-alphanumeric characters I write will be rest markups. So,
    current rules for inline markup does harm more than good for documents I
    usually write.

    IMO, extending the inline recognition rules could bring a solution for all
    users that avoids documents depending on a special config setting.

    However, when making such a not fully backwards compatible extension, we
    need to care for uses of rst markup characters as "normal" characters
    inside CJK-text. Could you estimate the percentage of documents with a
    mixture of CJK-characters and any of the ASCII characters *`|_ that work
    up to now but would need escaping under the new rule?

    Günter

     
  • David Goodger
    David Goodger
    2013-04-25

    This is a long-standing issue that makes reST difficult to use in languages that don't use spaces as word delimiters. ReST was designed for languages that do use such delimiters, and has neglected languages like Japanese and Chinese (not so much modern Korean, which does use spaces).

    I agree with Ishimoto-san's position on this. It's easier and cleaner to simply flip a switch that turns off delimiter detection, than it is to try to specify all the ranges of characters which should be considered delimiters. It would be really hard to completely and accurately specify all the ranges, without including some characters that shouldn't be included, and leaving out some that should. Let alone dealing with the edge cases, and there will be many. I believe that specifying ranges of delimiters would open us up to a slew of bug reports & feature requests that would be difficult to satisfy.

    Ultimately, the issue here is that language and character sets are ambiguous -- a simple global on/off rule would be simpler than complicating the concept of delimiters (which is already too complicated).

    The ideal would be an inline switch, a "pragma" directive that would tell the parser: treat content from here on (until the next switch) in this way. Another such directive further on would change the parser's state for content from then on. This would allow for multi-language documents, such as an English document containing passages in Japanese, or vice-versa. A switch like this could be useful for "spaced" languages as well.

    Günter, beware the desire to simplify the treatment of CJK languages. I have personal knowledge of the Japanese language [1], a complicated writing system that grew organically over many centuries and continues to change. A typical document will contain kanji (Chinese characters), hiragana & katakana (phonetic characters), and Latin alphabetic characters (including numerals & symbols), both "half-width" and "full-width" versions. I don't claim to have full knowledge of all the idiosyncrasies of the Japanese writing system, and I wouldn't pretend to understand those of Chinese or Korean.

    [1] I speak/read/write Japanese, lived in Japan for 7 years, married a Japanese woman and speak Japanese at home.

    One thing I don't like about the proposal is the switch name. "Skippable" is too vague. Perhaps "--no-inline-delimiters", but I'm open to alternatives.

     
  • atsuo ishimoto
    atsuo ishimoto
    2013-04-26

    Günter,

    It seems the CJK (or CJKV if you want to include Vietnamese ideographs) characters are laid out in blocks, so that it is relatively easy to specify them with character ranges in the regular expression.

    There are much more characters to be included to the list such as U+334d(SQUARE MEETORU).

    And we should discuss to determine what's Japanese characters are. JIS(Japan Industrial Standard) defines japanese character set, but the standard contains a lot of non-character symbols and foreign characters from such language as English or Greek. These characters behave like "Japanese character" in some case, but not always.

    Adding CJK characters to the specification in docs/ref/rst/restructuredtext.txt seems easy to understand for the affected average user:

    I'm skeptical that users understand the specification, because I don't think we can describe whats CJK's are with a few sentences.

    And, current rule is already hard to understand for me. I don't know which characters are 'Ps', and I don't want to keep UCD opened while I'm writing rest docs.

    Could you estimate the percentage of documents with a mixture of CJK-characters and any of the ASCII characters *`|_ that work up to now but would need escaping under the new rule?

    This is a hard question to answer, but in my personal experience, not so much documents will be broken. We have Python community site in Japan(http://www.python.jp) which is written in rest. I compiled entire the site, and Japanese translation of Python standard documents(http://docs.python.jp) with patched version of docutils.

    Most of errors are came from underscore like 'mod_python'. These names are recognized as reference to 'mod' and 'Unknown target name' error was issued. I think I can live with 'mod\_python', but it would be nice if we can fix it. Or, this markup (WORD_) could be disabled when the new option was enabled.

     
    Last edit: atsuo ishimoto 2013-04-26
    • Günter Milde
      Günter Milde
      2013-04-28

      If a patch of moderate complexity can solve the problem in 90% of the use cases, I'd prefer it over explicit markup or external configuration settings.

      There are much more characters to be included to the list.

      Of course. This was just a "proof of concept". The attached patch includes a complete list of Unicode characters with the script property Han, Bopomofo, Hiragana, Katakana, Hangul, or Yi.

      I don't think we can describe whats CJK's are with a few sentences.

      For the details, we can refer to the Unicode standard.

       
      Last edit: Günter Milde 2013-04-28
      • atsuo ishimoto
        atsuo ishimoto
        2013-04-29

        I failed to see why you prefer big table of characters over setting. Current syntax of inline markup designed western languages in mind, and does not work for some other languages. So, I think configuration to turn such syntax off for non-latin languages would be a natural solution.

        I think languages can not be determined by each characters. As I wrote above, typical Japanese text contains a lot of characters from other languages. Meaning of characters should be determined by context of text, not character itself.

        For the details, we can refer to the Unicode standard.

        I'm against this. The rule is too complicated if users need to consult Unicode standard.

         
        • Günter Milde
          Günter Milde
          2013-04-29

          I failed to see why you prefer big table of characters over setting.

          I like Docutils to "just work".

          The current (admittedly complex)
          inline recognition rules allow this in 90% of use cases in western
          languages. Extending the rules to reach this "success rate" also with
          CJK languages seems "natural" to me. In normal use, the complexity of
          the rules is not in the way -- cases like 'mod_python' would not
          require special care with my patch. Corner cases are reported and can
          be solved (escaping the offending character in non-markup use and
          escaped whitespace to force markup use).

          Switching between two sets of recognition rules does not seem less
          complex to me. You still need to specify which rules apply. (How
          about whitespace inside inline markup?) Also the code is not
          less complex than an extension of the existing character table.

          I think languages can not be determined by each characters.
          As I wrote above, typical Japanese text contains a lot of characters from other languages

          How is the typical use case of characters from other languages? Are
          e.g. Latin letters commonly used whithout surrounding whitespace in a
          Japanese text? How often would you use inline markup for the
          text bordering the Latin part? Remember, that punctuation is allowed
          around inline markup anyway.

          In additon, or alternatively, we can also use the language setting.
          Both the document language as well as the language of text parts can
          be set, so that inline markup recognition could be made language
          sensitive without the need for a new setting. Of course, this would
          not solve cases like "mod_python" inside Japanese text without
          additional markup (either mod_python or, e.g.
          :language-en:mod_python (after defining the language-en role).
          It would, however work out of the box for documents with the correct
          language setting.

          I will leave the decision on the right approach to the native speakers.
          For a complete patch, an update to the documentation (both,
          ref/restructuredtext.txt and config.txt) is required.
          For the name, I prefer "force inline markup" over "no inline
          delimiters".

           
          • atsuo ishimoto
            atsuo ishimoto
            2013-05-09

            I'm sorry for my slow response.

            Extending the rules to reach this "success rate" also with CJK languages seems "natural" to me.
            

            I don"t think it natural. Such rule will not be easily extended to other languages without careful consideration.

            Switching between two sets of recognition rules does not seem less complex to me.
            

            My proposal is not switching rules, but turn the rule off. Most Japanese writers do not need to learn current inline-markup rules at all.

            You still need to specify which rules apply.
            

            I think using global setting is much easier than worrying for each markups in the Japanese document.

            Are e.g. Latin letters commonly used whithout surrounding whitespace in a Japanese text?
            

            Yes. This is pretty common.

            How often would you use inline markup for the  text bordering the Latin part?
            

            Very often, at least for me. Markups like * or ` would not be used so much, but In my case, I use a lot of markups like :var:spam or links were used a lot in my python book I am writing.

             
  • atsuo ishimoto
    atsuo ishimoto
    2013-04-26

    David,

    Thank you for your comment.

    I updated the patch with a new option name. The old patch does not detect backslash escape correctly. The new patch fixed it.

    A "pragma" directive would be great. I'll try to implement the directive later.

     
    Attachments
1 2 > >> (Page 1 of 2)