Menu

#1226 Extend Document::WordCharacterClass() to handle CJK symbols and punctuation

Completed
open
nobody
None
5
2019-01-26
2018-08-06
Zufu Liu
No

Extend CharClassify::cc Document::WordCharacterClass(unsigned int ch) const to handle CJK symbols and punctuation, instead of return CharClassify::ccWord for all DBCS characters.

Some links:
https://en.wikipedia.org/wiki/CJK_Symbols_and_Punctuation
https://www.unicode.org/charts/PDF/U3000.pdf

Discussion

  • Neil Hodgson

    Neil Hodgson - 2018-08-06

    Document::WordCharacterClass does not return CharClassify::ccWord for all CJK characters when used in a UTF-8 document. For example, U+3004 〄 (JAPANESE INDUSTRIAL STANDARD SYMBOL) and U+3010 【 (LEFT BLACK LENTICULAR BRACKET) return CharClassify::ccPunctuation. This is based on the general category of the character.

    Character classification for other encodings like Big5 is difficult as it would require large tables of classification data for each encoding or conversion to Unicode. Conversion to Unicode is not as simple as it may seem due to historical incompatibilities particularly in Shift-JIS.

     
  • Zufu Liu

    Zufu Liu - 2018-08-06

    Just handle some common used punctuation from U+3000 block and the fullwidth forms from U+FF00 block, (without convert to Unicode?) maybe not hard.

    https://en.wikipedia.org/wiki/Halfwidth_and_fullwidth_forms

    The hard part as you said, is to read their encoding document, and find the correct decoding / conversion.

     
  • Zufu Liu

    Zufu Liu - 2018-08-07

    Hi Neil, I found it don't need to convert back to Unicode or add table/map. Punctuation is encoded continuously in DBCS encodings.

    Attachement script can be used to generate these punctuation list, lead bytes, invalid lead bytes and invalid tail bytes.

     
  • Zufu Liu

    Zufu Liu - 2018-08-07

    By the way, this script can be modified to generate a uint16_t table[256] (as previous discussed), for DBCSIsLeadByte()/ IsDBCSLeadByteNoExcept() IsDBCSLeadByteInvalid() and IsDBCSTrailByteInvalid() (the 5*3 attributes).

     

    Last edit: Zufu Liu 2018-08-07
  • Zufu Liu

    Zufu Liu - 2018-08-07

    Script updated, PUA is ignpored, result is printed to file and more readable.

    DBCSIsLeadByte() and IsDBCSLeadByteNoExcept() is conflict with IsDBCSLeadByteInvalid().

     

    Last edit: Zufu Liu 2018-08-07
  • Zufu Liu

    Zufu Liu - 2018-08-07

    Script updated to display ranges of symbol and punctuation.

    The result for cp932:

    lead byte: [81, 84], [87, 9F], [E0, EA], [ED, EE], [F0, FC]
    tail byte: [40, 7E], [80, FC]
    invalid lead: [80, 80], [85, 86], [A0, DF], [EB, EC], [EF, EF], [FD, FF]
    invalid tail: [00, 3F], [7F, 7F], [FD, FF]
    ccPunctuation: [8141, 8151], [8156, 8156], [815C, 817E], [8180, 81AC], [81B8, 81BF], [81C8, 81CE], [81DA, 81E8], [81F1, 81F7], [81FC, 81FC], [849F, 84BE], [875F, 8775], [877E, 877E], [8780, 879C], [EEF9, EEFC], [FA54, FA5B]
    ccSpace: [8140, 8140]
    

    For cp936:

    lead byte: [81, FE]
    tail byte: [40, 7E], [80, FE]
    invalid lead: [80, 80], [FF, FF]
    invalid tail: [00, 3F], [7F, 7F], [FF, FF]
    ccPunctuation: [A1A2, A1A4], [A1A7, A1A8], [A1AA, A1FE], [A3A1, A3AF], [A3BA, A3C0], [A3DB, A3E0], [A3FB, A3FE], [A6E0, A6EB], [A6EE, A6F2], [A6F4, A6F5], [A842, A87E], [A880, A895], [A949, A957], [A959, A95A], [A95C, A95C], [A961, A962], [A968, A97E], [A980, A988], [A9A4, A9EF]
    ccSpace: [A1A1, A1A1]
    

    For cp949:

    lead byte: [81, C8], [CA, FD]
    tail byte: [41, 5A], [61, 7A], [81, FE]
    invalid lead: [80, 80], [C9, C9], [FE, FF]
    invalid tail: [00, 40], [5B, 60], [7B, 80], [FF, FF]
    ccPunctuation: [A1A2, A1A8], [A1AA, A1C9], [A1CB, A1FE], [A2A1, A2A6], [A2A8, A2AF], [A2B1, A2E7], [A3A1, A3AF], [A3BA, A3C0], [A3DB, A3E0], [A3FB, A3FE], [A6A1, A6E4], [A7A1, A7A3], [A7A5, A7D8], [A7DA, A7EF], [A8B1, A8E6], [A9B1, A9E6]
    ccSpace: [A1A1, A1A1], [A1A9, A1A9]
    

    For cp950:

    lead byte: [A1, C7], [C9, F9]
    tail byte: [40, 7E], [A1, FE]
    invalid lead: [80, A0], [C8, C8], [FA, FF]
    invalid tail: [00, 3F], [7F, A0], [FF, FF]
    ccPunctuation: [A141, A17E], [A1A1, A1C4], [A1C6, A1FE], [A240, A258], [A262, A27E], [A2A1, A2AE], [A3BB, A3BB], [A3E1, A3E1], [F9DD, F9FE]
    ccSpace: [A140, A140]
    

    For cp 1361:

    lead byte: [84, D3], [D9, DE], [E0, F9]
    tail byte: [31, 7E], [81, FE]
    invalid lead: [80, 83], [D4, D8], [DF, DF], [FA, FF]
    invalid tail: [00, 30], [7F, 80], [FF, FF]
    ccPunctuation: [D932, D938], [D93A, D959], [D95B, D97E], [D991, D9A6], [D9A8, D9AF], [D9B1, D9E7], [DA31, DA3F], [DA4A, DA50], [DA6B, DA70], [DA9D, DAA0], [DBA1, DBE4], [DC31, DC33], [DC35, DC68], [DC6A, DC7E], [DC91, DC91], [DCB1, DCE6], [DD41, DD76]
    ccSpace: [8441, 8441], [D931, D931], [D939, D939]
    
     
  • Zufu Liu

    Zufu Liu - 2018-08-07

    Hi Neil, what's you opinion on make a 256*2 table for DBCS to simplify current code, and extend WordCharacterClass() to handle CJK symbol, punctuation and space?

     
  • Neil Hodgson

    Neil Hodgson - 2018-08-09

    I don't understand the "256*2". The sets of bytes for lead/tail/invalid lead/invalid tail could be 256 element tables but the punctuation and space appear to either be lists or 256**2 tables. The DBCS character sets were originally based on 94x94 tables although I don't know whether all characters are still defined in that space.

    Current cp932 code has some differences from the script as it was based on multiple sources of data due to variant definitions. As well as the old definitions from IBM, Microsoft, NEC, and Apple, it would be best to avoid any unnecessary incompatibilities with JIS X 0213.

    Table-based implementations may be OK but only where the additional space is justified - 64K or even 8K tables need to produce a strong benefit. They should also demonstrate good post-initialization performance - some setup cost may be justified when changing encoding.

     
  • Zufu Liu

    Zufu Liu - 2018-08-10

    OK, 256*2 is for lead/invalid lead/invalid tail (5 code pages and 3 attributes, so uint16_t table[256]).

    For punctuation and space I will try to find some way.

     
  • Neil Hodgson

    Neil Hodgson - 2018-08-10

    Using an array indexed by byte for character attributes may be good for speed but its more difficult to understand or modify when it is expressed in code as an array initialization. Especially when multiple flags are encoded as a bit pattern. It'd be better to construct the table(s) at runtime from a more easily modified expression.

     
  • Zufu Liu

    Zufu Liu - 2018-08-10

    That's why I wrote above DBCS.py script, I would prefer to modify it (keep it in the scripts folder) to generate the table (put into DBCS.cxx) like other Unicode tables, so it don't need to be readable.

    The table looks like:

    cp    cp936  cp932
    bit   5 4 3  2 1 0
    

    Add some macros (0, 3, etc.) in DBCS.h to donate offsets in the 16-bit attribute value for each code page.

    Add some macros (1, 2, and 4 for the three attributes) in DBCS.h to donate each attribute.

    Add a inline function in DBCS.h get the offset for each code page.

    Add a new filed like dbcsTableOffset in Document. When change code page, call the inline function to update dbcsTableOffset.

    Then all three functions can be implemented as:

    return ((DBCSTable[static_cast<unsigned char>(ch)] >> dbcsTableOffset) & SomeAttrMaskMacro) != 0;
    

    Related for punctuation and space https://sourceforge.net/p/scintilla/feature-requests/1056/.

     

    Last edit: Zufu Liu 2018-08-10
    • Neil Hodgson

      Neil Hodgson - 2018-08-11

      That's why I wrote above DBCS.py script, I would prefer to modify it (keep it in the scripts folder) to generate the table (put into DBCS.cxx) like other Unicode tables, so it don't need to be readable.

      Using a script to generate inline numeric data is a technique that should only be used rarely when there is no other way to achieve some goal. It makes it far more difficult to understand and modify the behaviour: for a start, it requires maintainers to understand 2 programming languages. It is preferrable to produce readable code in the main implementation language C++.

      A more maintainable approach would be to use code like the current IsDBCSLeadByteInvalid (or an equivalent expressed as data) but use that to produce a more efficient data structure. I'd expect the fastest data structure would be a 256 element array of char containing 0 or 1 to avoid bit twiddling although its possible that std::bitset<256> might be a similar speed.

      This code is on an object that changes code page rarely so there is little benefit to static tables. The static table in UniConversion helps in that case since UniConversion defines functions rather than methods.

      This is what I want to avoid, although the cast is likely difficult to remove:

      return ((DBCSTable[static_cast<unsigned char="">(ch)]</unsigned> >> dbcsTableOffset) & SomeAttrMaskMacro) != 0;

      Preferrable:

      return invalidLeadByte[static_cast<unsigned char>(ch)];
      // or
      return dbcsCharacterClassification->invalidLeadByte[static_cast<unsigned char>(ch)];
      

      I looked at extending SCI_SET*CHARS to work well for Unicode and DBCS but the large sets of characters (up to 1,114,112 for Unicode) made it appear unwieldy with clients having to generate huge strings just to make what would often be a few tweaks. That's why CategoriseCharacter was used.

      If customizing character classes over large character sets is implemented, it should be done at a higher level with ways to use default classifications and then override with a few calls. For example, calls could add to a character class based on Unicode general category or a list of characters:

      SCI_ADDPUNCTUATIONSET(SCUCC_CCCF);
      SCI_ADDPUNCTUATIONCHARACTERS("〄〰〓 ");
      
       
      • Zufu Liu

        Zufu Liu - 2018-08-11

        std::bitset is fine (96 bytes for each code page, unlike my total 512 bytes static array), except it requires dynamic setup, and more code is required.

        Is the new API can handle the DBCS branch (my original request), see the screenshot.
        And how about the spaces: U+3000 IDEOGRAPHIC SPACE (category is Zs) and U+00AD SOFT HYPHEN (category is Cf, used in Korean text) .

        By the way, IsDBCSLeadByte in SciTE's GUIGTK.cxx is incomplete.
        IsDBCSLeadByte in SciTE's GUIWin.cxx and in Scintilla's LexCaml.cxx calls Win32 IsDBCSLeadByteEx.
        And SciTE's GUIWin.cxx contains duplicate code in Scintilla's UniConversion.

         
        • Neil Hodgson

          Neil Hodgson - 2018-08-12

          This feature request has strayed beyond its initial purpose. While the performance impact of any behavioural change to character classification is pertinant, changing the existing DBCS routines for clarity or performance benefits should be a separate issue.

          And how about the spaces: U+3000 IDEOGRAPHIC SPACE (category is Zs) and U+00AD SOFT HYPHEN (category is Cf, used in Korean text) .

          Are you asking for these characters to be specially treated in Unicode or DBCS?

          SciTE's IsDBCSLeadByte is only called from lexers written in Lua which are quite rare with those lexers run over DBCS files even more unusual. Performance here is not a worry so system calls like IsDBCSLeadByteEx are fine.

          LexCaml's IsDBCSLeadByte is never called. Its there to stop link errors in highly unusual builds. Just ignore everything inside "#ifdef BUILD_AS_EXTERNAL_LEXER".

          Scintilla and SciTE are separate projects and must only interact over Scintilla's published API. Don't try to share code between them.

           
  • Zufu Liu

    Zufu Liu - 2018-08-10

    Another concern is whether optimize the table to handle

    IsDBCSLeadByteNoExcept(cb.CharAt(pos)) ? 2 : 1;
    

    , like UTF8BytesOfLead.

     
  • Zufu Liu

    Zufu Liu - 2018-08-12

    Hi Neil,

    This Issue originally is talking about the "Asian DBCS" branch, to handle CJK symbols and punctuation in DBCS encoding.

    The table for DBCS is side effect of the (buggy) script to list symbols and punctuation in each DBCS encodings and code pages.

    As you said in https://sourceforge.net/p/scintilla/bugs/1974/, DBCS (IsDBCSLeadByte) maybe be tabulated in the future. So I think I can generate the table as well.

    For DBCS tabularization, we can reopen https://sourceforge.net/p/scintilla/bugs/1974/ or make a new feature-request (as 1974 originally talking about duplication of IsDBCSLeadByte).

     
    • Neil Hodgson

      Neil Hodgson - 2018-08-12

      While a tabular approach to byte checks could be a simple array index, this does not appear reasonable for the punctuation / space checks as they are quite sparse - eyeballing it, maybe 2% of DBCS characters are punctuation. A better implementation for this may be an ordered list of ranges with corresponding character class which can be binary searched. Alternatively, a std::map could be used from character value to character class.

      Reopening [#1974] should only be done if the initially proposed change or a very similar modification has demonstrable benefit. For the dbcs.diff on [#1974] showing that there is no performance regression and that the changed method compiles to inline code in the same circumstances would be worthwhile. A new item should be opened if a different technique is proposed. Motivation is an important part of change requests - what benefits would the change bring.

       

      Related

      Bugs: #1974


Log in to post a comment.

MongoDB Logo MongoDB