Menu

#1259 Optimize Document::WordCharacterClass() and CategoriseCharacter()

Committed
closed
5
2019-04-16
2019-01-23
Zufu Liu
No

Document::WordCharacterClass() and CategoriseCharacter() can be optimized by using splitbins() from https://github.com/python/cpython/blob/master/Tools/unicode/makeunicodedata.py

but split index twice instead of once, result in 3 lookup tables:

# creation
indexA, indexB, shiftA = splitbins(index)
indexC, indexD, shiftC = splitbins(indexB)

# lookup for charcacter ch
maskA = (1 << shiftA) - 1
maskC = (1 << shiftC) - 1
i = (indexA[ch >> shiftA] << shiftA) + (ch & maskA)
i = (indexC[i >> shiftC] << shiftC) + (i & maskC)
value = indexD[i]

I have made CharClassify::ClassifyCharacter() at
https://github.com/zufuliu/notepad2/blob/master/scintilla/scripts/GenerateCharacterCategory.py#L311

to replace the switch block in Document::WordCharacterClass() (9K bytes for Scintilla, 10K bytes for me, because I treat CJK word as a distinct class). Which is about 180 times fast than old switch block for looking all charterer in [128, 0x10ffff] (through not a typical usage).

For CategoriseCharacter() 22K bytes is required (currently 14K bytes and average 12 lookup).

Discussion

1 2 3 > >> (Page 1 of 3)
  • Zufu Liu

    Zufu Liu - 2019-01-23

    By the way, I think call IsXidContinue() and IsXidStart() in LexPython for code page other than UTF-8 is wild.

    The following pieces of code is valid on Windows system with code page 932 (GBK).

    #!/usr/bin/env python3
    #-*- coding: ANSI -*-
    
    def 问候(名字):
        print(名字 + ",你好")
    
    问候("Bob")
    
     
    • Neil Hodgson

      Neil Hodgson - 2019-01-24

      This was likely contributed by a project that only supports UTF-8. The controlling property is called lexer.python.unicode.identifiers so I wouldn't expect it to work with other DBCS code pages. There could be an additional check so this is only executed for UTF-8.

       
  • Zufu Liu

    Zufu Liu - 2019-01-24

    How about the splitbins proposal? If it can be considered, I can adopt GenerateCharacterCategory.py in Notepad2 to generate tables for CategoriseCharacter() (or maybe standalone table for WordCharacterClass(), with which treating CJK word character as a distinct class can be implemented).

     
  • Zufu Liu

    Zufu Liu - 2019-01-25

    I find that, by split t2 inside splitbins(), index table for CategoriseCharacter() can reduced to 17.7 KB.
    attachments are the script, and changes in CharacterCategory.cxx.
    parameter type for CategoriseCharacter() is changed to unsigned to avoid one comparison.

     
  • Neil Hodgson

    Neil Hodgson - 2019-02-08

    To consider speed, I'd prefer to run CategoriseCharacter on every character in an example document that represents the sort of thing Scintilla is used for (source code and markup, not novels) so perhaps an East Asian UTF-8 HTML document such as the contents of https://www.asahi.com/ . The new code should be compared against the current code and the current code with a lookup table for 0..0xff as that would be a much simpler change.

    This code seems a little unclear as it appears that 2 indices were jammed into a single CharacterCategoryTable1 variable. The changed code is more complex than that which it is replacing so needs to be presented as clearly as possible. This needs to be maintainable and debuggable by people trying to understand problems.

    There should be a better explanation of what splitbins does and what it depends on. It expects that its input is organised as a sequence of equal length (power-of-two) subsequences where subsequences may be similar to other subsequences.

    Generated code should be limited to data tables where possible so CategoriseCharacter should be outside the generated block and depend on the table values and some declared constants. This minimizes the chance of behaviour changing unexpectedly when a script is run and allows maintainers to more easily modify the code.

    The signature of CategoriseCharacter has changed to take an unsigned int instead of an int. StyleContext::ch and related variablkes are int. CategoriseCharacter is called with int arguments in current lexers so this is a mismatch and differs from other APIs like IsXidContinue. While unsigned int or char32_t may have benefits changing to them has wide-ranging implications so should be avoided here.

    Using ch for the Unicode input as well as 2 different indices in CategoriseCharacter does not help understanding.

    Wide lines fail with many tools so the tables should be line-wrapped at either 1, 8, 10, or 16 elements.

    Code that is not used, such as updateCharClassifyTable should not be included.

    splitbins is copied from Python's source code so is licensed under Python's license and the conditions in that license need to be followed. That probably means including copyright notices and license conditions somewhere.
    https://github.com/python/cpython/blob/master/LICENSE

     
  • Zufu Liu

    Zufu Liu - 2019-02-08

    Added test source code.

    Adding a table for [0, 0xff] would make CategoriseCharacter faster for normal (code other than plain text) usage.

    The binary search method has drawback for Greek, Cyrillic and other languages
    https://en.wikipedia.org/wiki/Plane_(Unicode)#Basic_Multilingual_Plane

     
  • Zufu Liu

    Zufu Liu - 2019-02-09

    Update test, adding test for WordCharacterClass and copyright notices for splitbins.

     
  • Neil Hodgson

    Neil Hodgson - 2019-02-09

    The MSVC compilation should use /O2 instead of /Ox as /Ox is discouraged by Microsoft. Examples include MSDN

    In general, specify /O2 (Maximize Speed) instead of /Ox, and /O1 (Minimize Size) instead of /Oxs

    and from STL (Stephan T. Lavavej, maintainer of the MSVC standard library)

    Please don't use /Ox with MSVC. The compiler backend devs regard it as an ancient abomination. Instead, use /O2 (which adds optimizations that are always desirable).

    Add /GL as this allows link-time inlining and which is most likely to benefit short simple functions like "new method".

    Attached a script which gets the pages which may help others.

    The result.txt shows speeds which appear quite slow - which compiler are these with and are the results from a laptop or virtual machine? A shorter format specifier like %6.0f would make the results more readable.

    splitbins could be in a separate .py file to more clearly delimit the code copied from Python.

     
  • Zufu Liu

    Zufu Liu - 2019-02-09

    Updated, the result of GCC seems unreliable.

    The time unit is us. Run from a Win10 desktop with AMD Athlon II X4 640.

     
    • Zufu Liu

      Zufu Liu - 2019-02-09

      Updated runTestCat.bat.

       
  • Neil Hodgson

    Neil Hodgson - 2019-02-10

    The Python license is complex and appears to need the actual license agreement text to be included in derivative works, not a reference to the license:

    provided, however, that PSF's License Agreement and ... are retained in Python alone or in any derivative version prepared by Licensee.
    

    The build command for Clang on both macOS and Linux is

    clang++ -O2 -flto -std=c++17 -Wall testCat.cpp src/UniConversion.cxx lexlib/CharacterCategory.cxx -Isrc -Ilexlib -o testCat
    

    On Linux (Ubuntu 18.10 64-bit in a VM), Clang 7 appears much faster than GCC 8.2, to the extent (maximum speed-up of 8:1) that I think something is wrong. VMs are sometimes weird but Clang behaves well. The middle column here is speed-up over 'old method'.

    GCC 8.2.0
    
    time for file web_greek.html [445481 -> 438917]:
        old method       8340.55           1.000      363.86
        latin + old      4145.83           2.012      365.14
        new method       4766.56           1.750      326.24
        latin + new      4173.22           1.999      314.71
    
    Clang 7.0.0
    
    time for file web_greek.html [445481 -> 438917]:
        old method       4867.22           1.000      324.25
        latin + old       457.50          10.639      332.06
        new method        570.90           8.526      393.39
        latin + new       481.05          10.118      255.64
    
     
    • Neil Hodgson

      Neil Hodgson - 2019-02-12

      A possible explanation for the bad GCC Linux times could be that this GCC includes mitigations for the Spectre security vulnerability which can target indirect branches like calling the CategoriseCharacterSig function in TestWordClass. Making TestWordClass a template over the function may avoid this and would remove indirect function overhead from measurements including other platforms.
      Spectre

       
      • Zufu Liu

        Zufu Liu - 2019-02-12

        Seems no difference on Windows (or the template is wrong?), at least Spectre mitigation is not enabled by default (requires extra flags and libs for MSVC).

         
        • Neil Hodgson

          Neil Hodgson - 2019-02-13

          I'm seeing some large speed ups from the template version. It seems to fix GCC on Linux although basic 'new method' is still quite slow -- maybe there is more indirection.

          GCC 8.2.0
          
          time for file web_greek.html [445481 -> 438917, 1.47]:
              old method       6235.78           1.000          379.49
              latin + old       456.33          13.665          494.91
              new method       2161.84           2.884          519.86
              latin + new       400.31          15.578          352.70
          

          On MSVC 64-bit, templating seems to remove some testing overhead on my setup. The functions being tested are so short (often 3 comparisons and an array index) that replacing an indirect call with a direct call or inlining (since indirect calls can't be inlined) could make a large difference.

          Preshift:
          
          MSVC 19.16.27026.1
          
          time for file src\Editor.cxx [256856 -> 256856]:
              old method       4223.15           1.000          168.81
              latin + old       342.12          12.344          157.42
              new method        435.10           9.706          177.02
              latin + new       497.96           8.481          164.68
          
          Template
          
          MSVC 19.16.27026.1
          
          time for file src\Editor.cxx [256856 -> 256856, 0.00]:
              old method       3545.61           1.000          140.35
              latin + old       261.62          13.553          144.33
              new method        422.17           8.399          140.73
              latin + new       210.76          16.823          160.43
          
           
          • Zufu Liu

            Zufu Liu - 2019-02-14

            Result on Ubuntu 18.10 VirtualBox 6.0.4 (Win10 i5-4570 host).

            with -O2 -flto -fuse-linker-plugin -finline-functions:

            template: true, loop: 100
            GCC 8.2.0
            
            time for file src/Editor.cxx [248583 -> 248583, 0.00]:
                old method      19507.80           1.000         1838.42
                latin + old      3077.94           6.338         1928.46
                new method       1239.70          15.736         1921.52
                latin + new      2175.20           8.968         1919.02
            
            template: false, loop: 100
            GCC 8.2.0
            
            time for file src/Editor.cxx [248583 -> 248583, 0.00]:
                old method      18636.62           1.000         1942.83
                latin + old      3007.50           6.197         1925.58
                new method       1361.78          13.685         1921.28
                latin + new      2144.87           8.689         1924.81
            

            but still slow than Clang:

            template: true, loop: 100
            Clang 7.0.0
            
            time for file src/Editor.cxx [248583 -> 248583, 0.00]:
                old method      28283.76           1.000         1883.65
                latin + old      2242.81          12.611         1948.66
                new method       1199.97          23.570         1965.05
                latin + new      2268.21          12.470         2020.46
            
            template: false, loop: 100
            Clang 7.0.0
            
            time for file src/Editor.cxx [248583 -> 248583, 0.00]:
                old method      29227.12           1.000         1875.16
                latin + old      2234.05          13.083         1956.60
                new method       1234.61          23.673         2058.96
                latin + new      2259.55          12.935         2036.34
            
             
            • Neil Hodgson

              Neil Hodgson - 2019-02-14

              These results are really weird and differ greatly from my results, with an i7-6700 (Win10 host) through VirtualBox 5.2.26. I'm fairly slow at updating to new major VirtualBox versions as they can take a while to shake out bugs.

              template: true, loop: 100
              GCC 8.2.0
              
              time for file src/Editor.cxx [256856 -> 256856, 0.00]:
                  old method       2851.11           1.000          133.68
                  latin + old       321.84           8.859          134.39
                  new method        323.54           8.812          134.55
                  latin + new       205.44          13.878          136.17
              
              template: true, loop: 100
              Clang 7.0.0
              
              time for file src/Editor.cxx [256856 -> 256856, 0.00]:
                  old method       2963.94           1.000          137.99
                  latin + old       210.44          14.084          215.08
                  new method        314.81           9.415          154.85
                  latin + new       208.50          14.216          210.85
              

              Your VM results are around 10 times slower than these and also slower than your Windows native results. Benchmarks show around a 30% advantage for the i7-6700 over the i5-4570 so the processor is nowhere near enough to produce this difference.

              Your results show "new method" as around twice as fast as "latin + new" for Editor.cxx which also makes little sense to me. For Editor.cxx, every byte is ASCII so both the "latin +" cases should only run the first section of their code, indexing once into the Latin table which should always be faster than the 3 look ups and extra calculations of "new method". There is one comparison removed for "new method" over "latin + new", but that shouldn't be significant and should always be correctly predicted for Editor.cxx. About the only possibility that makes some sense is that "new method" is a preferred candidate for inlining over "latin + new".

              It would also be an idea to include the bit-width of the build in the output. Something like

                  constexpr int bits = static_cast<int>(sizeof(void *)*8);
              #if defined(__clang__)
                  printf("Clang %d.%d.%d %dbit\n", __clang_major__, __clang_minor__, __clang_patchlevel__, bits);
              #elif defined(__GNUC__)
                  printf("GCC %d.%d.%d %dbit\n", __GNUC__, __GNUC_MINOR__, __GNUC_PATCHLEVEL__, bits);
              #elif defined(_MSC_VER)
                  printf("MSVC %2d.%2d.%5d.%d %dbit\n", (_MSC_VER / 100), (_MSC_VER % 100), (_MSC_FULL_VER % 100000), _MSC_BUILD, bits);
              #endif
              
               
              • Zufu Liu

                Zufu Liu - 2019-02-15

                added bits.

                 
              • Neil Hodgson

                Neil Hodgson - 2019-02-15

                Updated to VirtualBox 6.0.4 with no significant change to results. Here is my VirtualBox configuration. You seem to be missing PAE/NX although I wouldn't expect them to be a problem. Make sure you have installed the current version of the Guest Additions.

                 
                • Zufu Liu

                  Zufu Liu - 2019-02-15

                  Host (desktop) and VM (with PAE/NX enabled) is update to date. output by vbox-greeter:

                  ubuntu@Ubuntu:~$ vbox-greeter 
                  00:00:00.002089 main     vbox-greeter 6.0.4 r128413 (verbosity: 0) linux.amd64 (Jan 25 2019 18:34:19) release log
                  00:00:00.002098 main     Log opened 2019-02-15T04:56:57.364018000Z
                  00:00:00.002153 main     OS Product: Linux
                  00:00:00.002161 main     OS Release: 4.18.0-15-generic
                  00:00:00.002169 main     OS Version: #16-Ubuntu SMP Thu Feb 7 10:56:39 UTC 2019
                  00:00:00.002177 main     OS Service Pack: #16-Ubuntu SMP Thu Feb 7 10:56:39 UTC 2019
                  00:00:00.002186 main     Executable: /opt/VBoxGuestAdditions-6.0.4/sbin/vbox-greeter
                  00:00:00.002189 main     Process ID: 2435
                  00:00:00.002191 main     Package type: LINUX_64BITS_GENERIC
                  
                  ** (process:2435): WARNING **: 12:56:57.549: No LIGHTDM_TO_SERVER_FD environment variable
                  00:00:00.187481 main     vbox-lightdm-greeter: error: unable to connect to LightDM server, aborting
                  ubuntu@Ubuntu:~$ 
                  

                  attached the detail compiler output (./runTestCat.sh -v and ./runTestCat.sh clang -v )

                   
                  • Zufu Liu

                    Zufu Liu - 2019-02-15

                    Clang enabled -vectorize-loops -vectorize-slp on O2, enable -ftree-vectorize for GCC seems not affect speed on my VM and host.

                     
  • Zufu Liu

    Zufu Liu - 2019-02-11

    Update the test, changing + to |.
    result on i5-4700 Win10 x64. Clang 9.0.0 SVN r351376 from https://llvm.org/builds/

    Clang 9.0.0
    
    time for file web_greek.html [442946 -> 437550]:
        old method       5616.25           1.000          311.81
        latin + old       353.82          15.873          321.97
        new method        695.00           8.081          257.70
        latin + new       344.36          16.309          264.38
    
    MSVC 19.16.27026.1
    
    time for file web_greek.html [442946 -> 437550]:
        old method       7343.81           1.000          356.22
        latin + old       954.68           7.692          346.85
        new method       1106.87           6.635          270.44
        latin + new       877.17           8.372          272.02
    
     
  • Zufu Liu

    Zufu Liu - 2019-02-11

    With pre-shift for catTable2.

    Clang 9.0.0
    
    time for file web_ru.html [178720 -> 156321]:
        old method       2026.91           1.000          340.50
        latin + old       377.05           5.376          384.73
        new method        229.27           8.841          157.73
        latin + new       192.82          10.512          164.64
    
    time for file web_greek.html [442946 -> 437550]:
        old method       5602.68           1.000          305.05
        latin + old       353.18          15.863          315.69
        new method        639.62           8.759          259.62
        latin + new       382.85          14.634          264.72
    
    MSVC 19.16.27026.1
    
    time for file web_ru.html [178720 -> 156321]:
        old method       2781.68           1.000          479.67
        latin + old       646.47           4.303          506.95
        new method        355.24           7.831          183.96
        latin + new       363.52           7.652          200.21
    
    time for file web_greek.html [442946 -> 437550]:
        old method       7421.27           1.000          358.16
        latin + old       945.82           7.846          350.07
        new method       1016.68           7.299          274.71
        latin + new       891.86           8.321          274.06
    
     
  • Zufu Liu

    Zufu Liu - 2019-02-11

    pre shift catTabel1 and catTable2.
    result on MacBookPro (i5, Late 2011) using Clang 10 (Xcode 10.1).

    Clang 10.0.0
    
    time for file web_ru.html [178630 -> 156315]:
        old method       8289.49           1.000         1432.08
        latin + old      1374.64           6.030         1319.16
        new method        343.18          24.155          215.30
        latin + new       299.20          27.705          212.43
    
    time for file web_greek.html [442544 -> 437445]:
        old method      24183.10           1.000          728.91
        latin + old       899.30          26.891          680.42
        new method       1189.88          20.324          386.65
        latin + new       698.13          34.640          352.78
    
     
  • Zufu Liu

    Zufu Liu - 2019-02-11

    correct web_cn.html

    wget --no-verbose --no-check-certificate --user-agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.96 Safari/537.36" https://www.163.com/ --output-document=web_cn_gbk.html
    iconv -f gbk -t utf-8 web_cn_gbk.html > web_cn.html
    
     
  • Zufu Liu

    Zufu Liu - 2019-02-11

    Update test, it's seems when non-ASCII characters goes to 8% and above, new method will outperform latin+old.

     
1 2 3 > >> (Page 1 of 3)

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.