SourceForge has been redesigned. Learn more.
Close

#612 Various Unicode-related speedups/robustness

closed-fixed
5
2012-01-22
2012-01-14
No

Some ideas for improvements, noted while working
on UTF-6.1 support.
- The macro GetUniCharInfo() gets a byte from the
pageMap and immediately shifts the value
OFFSET_BITS to the left. We could as well put the
shifted value in groupMap, that saves a bit shift
for each character handled.
- The macro GetDelta() is complicated because it
compensates for compiler differences in handling
sign extension. If we change the group[] array
to use Bits 8-31 for the case delta, then we have
enough bits for all possible Unicode characters,
so sign extension can be compensated with
using a mask 0x1fffff, or (when handling only
the basic plane) casting to Tcl_UniChar
- The tool uniParse.tcl makes assumptions
about the content of UnicodeData.txt, but
doesn't check those assumptions. It might
be that future Unicode versions need more
categories or case types, then the tool
should warn us about that.

In order to make Unicode-related table
merges between the core branches
possible, this is meant for all open core
branches.

Discussion

  • Jan Nijtmans

    Jan Nijtmans - 2012-01-14

    implemented in branch rfe-3473670

     
  • Jan Nijtmans

    Jan Nijtmans - 2012-01-14

    Needless to say: all tests pass with this change

     
  • Jeffrey Hobbs

    Jeffrey Hobbs - 2012-01-14

    Did you test on both BE and LE machines?

     
  • Jan Nijtmans

    Jan Nijtmans - 2012-01-15

    >Did you test on both BE and LE machines?

    Goed question! No, I only tested on Windows and Ubuntu.
    However, the affectes macros and tables are only used in
    Unicode categories (like "string is control") or unicode case
    handling (like "string tolower"), it is not used in the Unicode
    encoding or anything which might be endian-dependant.
    So I am 100% confident that it will work on both LE and
    BE platforms. The affected functions in tclUtf.c all use
    Tcl_UniChar or int as internal data types, never (unsigned) char.

     
  • Jan Nijtmans

    Jan Nijtmans - 2012-01-22

    Updated the tools such that characters > 0xffff are
    handled as well, with #if's around it such that in
    the Basic Plane case everything is as before.

    This - again - allows for the same Unicode tables
    to be used in all branches, which simplifies the
    merging along branches when a new unicode
    version comes out.

     
  • Jan Nijtmans

    Jan Nijtmans - 2012-01-22
    • status: open --> closed-fixed
     
  • Jan Nijtmans

    Jan Nijtmans - 2012-01-22

    Applied to core-8-4-branch, core-8-5-branch and trunk.