Extend CharClassify::cc Document::WordCharacterClass(unsigned int ch) const to handle CJK symbols and punctuation, instead of return CharClassify::ccWord for all DBCS characters.
Some links:
https://en.wikipedia.org/wiki/CJK_Symbols_and_Punctuation
https://www.unicode.org/charts/PDF/U3000.pdf
Document::WordCharacterClass does not return CharClassify::ccWord for all CJK characters when used in a UTF-8 document. For example, U+3004 〄 (JAPANESE INDUSTRIAL STANDARD SYMBOL) and U+3010 【 (LEFT BLACK LENTICULAR BRACKET) return CharClassify::ccPunctuation. This is based on the general category of the character.
Character classification for other encodings like Big5 is difficult as it would require large tables of classification data for each encoding or conversion to Unicode. Conversion to Unicode is not as simple as it may seem due to historical incompatibilities particularly in Shift-JIS.
Just handle some common used punctuation from U+3000 block and the fullwidth forms from U+FF00 block, (without convert to Unicode?) maybe not hard.
https://en.wikipedia.org/wiki/Halfwidth_and_fullwidth_forms
The hard part as you said, is to read their encoding document, and find the correct decoding / conversion.
Hi Neil, I found it don't need to convert back to Unicode or add table/map. Punctuation is encoded continuously in DBCS encodings.
Attachement script can be used to generate these punctuation list, lead bytes, invalid lead bytes and invalid tail bytes.
By the way, this script can be modified to generate a uint16_t table[256] (as previous discussed), for
DBCSIsLeadByte()/IsDBCSLeadByteNoExcept()IsDBCSLeadByteInvalid()andIsDBCSTrailByteInvalid()(the 5*3 attributes).Last edit: Zufu Liu 2018-08-07
Script updated, PUA is ignpored, result is printed to file and more readable.
DBCSIsLeadByte()andIsDBCSLeadByteNoExcept()is conflict withIsDBCSLeadByteInvalid().Last edit: Zufu Liu 2018-08-07
Script updated to display ranges of symbol and punctuation.
The result for cp932:
For cp936:
For cp949:
For cp950:
For cp 1361:
Hi Neil, what's you opinion on make a 256*2 table for DBCS to simplify current code, and extend
WordCharacterClass()to handle CJK symbol, punctuation and space?I don't understand the "256*2". The sets of bytes for lead/tail/invalid lead/invalid tail could be 256 element tables but the punctuation and space appear to either be lists or 256**2 tables. The DBCS character sets were originally based on 94x94 tables although I don't know whether all characters are still defined in that space.
Current cp932 code has some differences from the script as it was based on multiple sources of data due to variant definitions. As well as the old definitions from IBM, Microsoft, NEC, and Apple, it would be best to avoid any unnecessary incompatibilities with JIS X 0213.
Table-based implementations may be OK but only where the additional space is justified - 64K or even 8K tables need to produce a strong benefit. They should also demonstrate good post-initialization performance - some setup cost may be justified when changing encoding.
OK, 256*2 is for lead/invalid lead/invalid tail (5 code pages and 3 attributes, so uint16_t table[256]).
For punctuation and space I will try to find some way.
Using an array indexed by byte for character attributes may be good for speed but its more difficult to understand or modify when it is expressed in code as an array initialization. Especially when multiple flags are encoded as a bit pattern. It'd be better to construct the table(s) at runtime from a more easily modified expression.
That's why I wrote above DBCS.py script, I would prefer to modify it (keep it in the scripts folder) to generate the table (put into DBCS.cxx) like other Unicode tables, so it don't need to be readable.
The table looks like:
Add some macros (0, 3, etc.) in DBCS.h to donate offsets in the 16-bit attribute value for each code page.
Add some macros (1, 2, and 4 for the three attributes) in DBCS.h to donate each attribute.
Add a inline function in DBCS.h get the offset for each code page.
Add a new filed like
dbcsTableOffsetin Document. When change code page, call the inline function to updatedbcsTableOffset.Then all three functions can be implemented as:
Related for punctuation and space https://sourceforge.net/p/scintilla/feature-requests/1056/.
Last edit: Zufu Liu 2018-08-10
Using a script to generate inline numeric data is a technique that should only be used rarely when there is no other way to achieve some goal. It makes it far more difficult to understand and modify the behaviour: for a start, it requires maintainers to understand 2 programming languages. It is preferrable to produce readable code in the main implementation language C++.
A more maintainable approach would be to use code like the current IsDBCSLeadByteInvalid (or an equivalent expressed as data) but use that to produce a more efficient data structure. I'd expect the fastest data structure would be a 256 element array of char containing 0 or 1 to avoid bit twiddling although its possible that std::bitset<256> might be a similar speed.
This code is on an object that changes code page rarely so there is little benefit to static tables. The static table in UniConversion helps in that case since UniConversion defines functions rather than methods.
This is what I want to avoid, although the cast is likely difficult to remove:
Preferrable:
I looked at extending SCI_SET*CHARS to work well for Unicode and DBCS but the large sets of characters (up to 1,114,112 for Unicode) made it appear unwieldy with clients having to generate huge strings just to make what would often be a few tweaks. That's why CategoriseCharacter was used.
If customizing character classes over large character sets is implemented, it should be done at a higher level with ways to use default classifications and then override with a few calls. For example, calls could add to a character class based on Unicode general category or a list of characters:
std::bitset is fine (96 bytes for each code page, unlike my total 512 bytes static array), except it requires dynamic setup, and more code is required.
Is the new API can handle the DBCS branch (my original request), see the screenshot.
And how about the spaces: U+3000 IDEOGRAPHIC SPACE (category is Zs) and U+00AD SOFT HYPHEN (category is Cf, used in Korean text) .
By the way,
IsDBCSLeadBytein SciTE's GUIGTK.cxx is incomplete.IsDBCSLeadBytein SciTE's GUIWin.cxx and in Scintilla's LexCaml.cxx calls Win32IsDBCSLeadByteEx.And SciTE's GUIWin.cxx contains duplicate code in Scintilla's UniConversion.
This feature request has strayed beyond its initial purpose. While the performance impact of any behavioural change to character classification is pertinant, changing the existing DBCS routines for clarity or performance benefits should be a separate issue.
Are you asking for these characters to be specially treated in Unicode or DBCS?
SciTE's IsDBCSLeadByte is only called from lexers written in Lua which are quite rare with those lexers run over DBCS files even more unusual. Performance here is not a worry so system calls like IsDBCSLeadByteEx are fine.
LexCaml's IsDBCSLeadByte is never called. Its there to stop link errors in highly unusual builds. Just ignore everything inside "#ifdef BUILD_AS_EXTERNAL_LEXER".
Scintilla and SciTE are separate projects and must only interact over Scintilla's published API. Don't try to share code between them.
Another concern is whether optimize the table to handle
, like
UTF8BytesOfLead.Hi Neil,
This Issue originally is talking about the "Asian DBCS" branch, to handle CJK symbols and punctuation in DBCS encoding.
The table for DBCS is side effect of the (buggy) script to list symbols and punctuation in each DBCS encodings and code pages.
As you said in https://sourceforge.net/p/scintilla/bugs/1974/, DBCS (IsDBCSLeadByte) maybe be tabulated in the future. So I think I can generate the table as well.
For DBCS tabularization, we can reopen https://sourceforge.net/p/scintilla/bugs/1974/ or make a new feature-request (as 1974 originally talking about duplication of IsDBCSLeadByte).
While a tabular approach to byte checks could be a simple array index, this does not appear reasonable for the punctuation / space checks as they are quite sparse - eyeballing it, maybe 2% of DBCS characters are punctuation. A better implementation for this may be an ordered list of ranges with corresponding character class which can be binary searched. Alternatively, a std::map could be used from character value to character class.
Reopening [#1974] should only be done if the initially proposed change or a very similar modification has demonstrable benefit. For the dbcs.diff on [#1974] showing that there is no performance regression and that the changed method compiles to inline code in the same circumstances would be worthwhile. A new item should be opened if a different technique is proposed. Motivation is an important part of change requests - what benefits would the change bring.
Related
Bugs:
#1974I added DBCSCharClassify class at https://github.com/zufuliu/notepad2/blob/master/scintilla/src/CharClassify.h#L40
with 6.4K (4.3K for standard Scintilla) bytes character classification table for the five code pages (combined with Shift_JIS, Shift_JIS_2004, Shift_JISX0213, GBK, BIG5 and BIG5HKSCS).
https://github.com/zufuliu/notepad2/blob/master/scintilla/scripts/GenerateCharacterCategory.py#L555
DBCSIsLeadByte called in Surface's MeasureWidths() method is also replaced with DBCSCharClassify instance.
Last edit: Zufu Liu 2019-01-31