Scintilla / Feature Requests / #1259 Optimize Document::WordCharacterClass() and CategoriseCharacter()

Zufu Liu - 2019-01-23

By the way, I think call IsXidContinue() and IsXidStart() in LexPython for code page other than UTF-8 is wild.

The following pieces of code is valid on Windows system with code page 932 (GBK).

#!/usr/bin/env python3 #-*- coding: ANSI -*- def 问候(名字): print(名字 + "，你好") 问候("Bob")
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Neil Hodgson - 2019-01-24
  
  This was likely contributed by a project that only supports UTF-8. The controlling property is called lexer.python.unicode.identifiers so I wouldn't expect it to work with other DBCS code pages. There could be an additional check so this is only executed for UTF-8.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Zufu Liu - 2019-01-24

How about the splitbins proposal? If it can be considered, I can adopt GenerateCharacterCategory.py in Notepad2 to generate tables for CategoriseCharacter() (or maybe standalone table for WordCharacterClass(), with which treating CJK word character as a distinct class can be implemented).

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Zufu Liu - 2019-01-25

I find that, by split t2 inside splitbins(), index table for CategoriseCharacter() can reduced to 17.7 KB.
attachments are the script, and changes in CharacterCategory.cxx.
parameter type for CategoriseCharacter() is changed to unsigned to avoid one comparison.

CharacterCategory.cxx

CharacterCategory.h

GenerateCharacterCategory.py

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Neil Hodgson - 2019-02-08

To consider speed, I'd prefer to run CategoriseCharacter on every character in an example document that represents the sort of thing Scintilla is used for (source code and markup, not novels) so perhaps an East Asian UTF-8 HTML document such as the contents of https://www.asahi.com/ . The new code should be compared against the current code and the current code with a lookup table for 0..0xff as that would be a much simpler change.

This code seems a little unclear as it appears that 2 indices were jammed into a single CharacterCategoryTable1 variable. The changed code is more complex than that which it is replacing so needs to be presented as clearly as possible. This needs to be maintainable and debuggable by people trying to understand problems.

There should be a better explanation of what splitbins does and what it depends on. It expects that its input is organised as a sequence of equal length (power-of-two) subsequences where subsequences may be similar to other subsequences.

Generated code should be limited to data tables where possible so CategoriseCharacter should be outside the generated block and depend on the table values and some declared constants. This minimizes the chance of behaviour changing unexpectedly when a script is run and allows maintainers to more easily modify the code.

The signature of CategoriseCharacter has changed to take an unsigned int instead of an int. StyleContext::ch and related variablkes are int. CategoriseCharacter is called with int arguments in current lexers so this is a mismatch and differs from other APIs like IsXidContinue. While unsigned int or char32_t may have benefits changing to them has wide-ranging implications so should be avoided here.

Using ch for the Unicode input as well as 2 different indices in CategoriseCharacter does not help understanding.

Wide lines fail with many tools so the tables should be line-wrapped at either 1, 8, 10, or 16 elements.

Code that is not used, such as updateCharClassifyTable should not be included.

splitbins is copied from Python's source code so is licensed under Python's license and the conditions in that license need to be followed. That probably means including copyright notices and license conditions somewhere.
https://github.com/python/cpython/blob/master/LICENSE

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Zufu Liu - 2019-02-08

Added test source code.

Adding a table for [0, 0xff] would make CategoriseCharacter faster for normal (code other than plain text) usage.

The binary search method has drawback for Greek, Cyrillic and other languages
https://en.wikipedia.org/wiki/Plane_(Unicode)#Basic_Multilingual_Plane

test-cat.zip

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Zufu Liu - 2019-02-09

Update test, adding test for WordCharacterClass and copyright notices for splitbins.

test-cat2.zip

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Neil Hodgson - 2019-02-09

The MSVC compilation should use /O2 instead of /Ox as /Ox is discouraged by Microsoft. Examples include MSDN

In general, specify /O2 (Maximize Speed) instead of /Ox, and /O1 (Minimize Size) instead of /Oxs

and from STL (Stephan T. Lavavej, maintainer of the MSVC standard library)

Please don't use /Ox with MSVC. The compiler backend devs regard it as an ancient abomination. Instead, use /O2 (which adds optimizations that are always desirable).

Add /GL as this allows link-time inlining and which is most likely to benefit short simple functions like "new method".

Attached a script which gets the pages which may help others.

The result.txt shows speeds which appear quite slow - which compiler are these with and are the results from a laptop or virtual machine? A shorter format specifier like %6.0f would make the results more readable.

splitbins could be in a separate .py file to more clearly delimit the code copied from Python.

GrabPages.bat

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Zufu Liu - 2019-02-09

Updated, the result of GCC seems unreliable.

The time unit is us. Run from a Win10 desktop with AMD Athlon II X4 640.

test-cat3.zip

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Zufu Liu - 2019-02-09
  
  Updated runTestCat.bat.
  
  runTestCat.bat
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

The Python license is complex and appears to need the actual license agreement text to be included in derivative works, not a reference to the license:

provided, however, that PSF's License Agreement and ... are retained in Python alone or in any derivative version prepared by Licensee.

The build command for Clang on both macOS and Linux is

clang++ -O2 -flto -std=c++17 -Wall testCat.cpp src/UniConversion.cxx lexlib/CharacterCategory.cxx -Isrc -Ilexlib -o testCat

On Linux (Ubuntu 18.10 64-bit in a VM), Clang 7 appears much faster than GCC 8.2, to the extent (maximum speed-up of 8:1) that I think something is wrong. VMs are sometimes weird but Clang behaves well. The middle column here is speed-up over 'old method'.

GCC 8.2.0

time for file web_greek.html [445481 -> 438917]:
    old method       8340.55           1.000      363.86
    latin + old      4145.83           2.012      365.14
    new method       4766.56           1.750      326.24
    latin + new      4173.22           1.999      314.71


Clang 7.0.0

time for file web_greek.html [445481 -> 438917]:
    old method       4867.22           1.000      324.25
    latin + old       457.50          10.639      332.06
    new method        570.90           8.526      393.39
    latin + new       481.05          10.118      255.64

Neil Hodgson - 2019-02-12

A possible explanation for the bad GCC Linux times could be that this GCC includes mitigations for the Spectre security vulnerability which can target indirect branches like calling the CategoriseCharacterSig function in TestWordClass. Making TestWordClass a template over the function may avoid this and would remove indirect function overhead from measurements including other platforms.
Spectre

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Zufu Liu - 2019-02-12
  
  Seems no difference on Windows (or the template is wrong?), at least Spectre mitigation is not enabled by default (requires extra flags and libs for MSVC).
  
  testCat.cpp
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Neil Hodgson - 2019-02-13
    
    I'm seeing some large speed ups from the template version. It seems to fix GCC on Linux although basic 'new method' is still quite slow -- maybe there is more indirection.
    
    GCC 8.2.0 time for file web_greek.html [445481 -> 438917, 1.47]: old method 6235.78 1.000 379.49 latin + old 456.33 13.665 494.91 new method 2161.84 2.884 519.86 latin + new 400.31 15.578 352.70
    
    On MSVC 64-bit, templating seems to remove some testing overhead on my setup. The functions being tested are so short (often 3 comparisons and an array index) that replacing an indirect call with a direct call or inlining (since indirect calls can't be inlined) could make a large difference.
    
    Preshift: MSVC 19.16.27026.1 time for file src\Editor.cxx [256856 -> 256856]: old method 4223.15 1.000 168.81 latin + old 342.12 12.344 157.42 new method 435.10 9.706 177.02 latin + new 497.96 8.481 164.68 Template MSVC 19.16.27026.1 time for file src\Editor.cxx [256856 -> 256856, 0.00]: old method 3545.61 1.000 140.35 latin + old 261.62 13.553 144.33 new method 422.17 8.399 140.73 latin + new 210.76 16.823 160.43
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Zufu Liu - 2019-02-14
      
      Result on Ubuntu 18.10 VirtualBox 6.0.4 (Win10 i5-4570 host).
      
      with -O2 -flto -fuse-linker-plugin -finline-functions:
      
      template: true, loop: 100 GCC 8.2.0 time for file src/Editor.cxx [248583 -> 248583, 0.00]: old method 19507.80 1.000 1838.42 latin + old 3077.94 6.338 1928.46 new method 1239.70 15.736 1921.52 latin + new 2175.20 8.968 1919.02 template: false, loop: 100 GCC 8.2.0 time for file src/Editor.cxx [248583 -> 248583, 0.00]: old method 18636.62 1.000 1942.83 latin + old 3007.50 6.197 1925.58 new method 1361.78 13.685 1921.28 latin + new 2144.87 8.689 1924.81
      
      but still slow than Clang:
      
      template: true, loop: 100 Clang 7.0.0 time for file src/Editor.cxx [248583 -> 248583, 0.00]: old method 28283.76 1.000 1883.65 latin + old 2242.81 12.611 1948.66 new method 1199.97 23.570 1965.05 latin + new 2268.21 12.470 2020.46 template: false, loop: 100 Clang 7.0.0 time for file src/Editor.cxx [248583 -> 248583, 0.00]: old method 29227.12 1.000 1875.16 latin + old 2234.05 13.083 1956.60 new method 1234.61 23.673 2058.96 latin + new 2259.55 12.935 2036.34
      
      test-cat-0214.zip
      
      vm-info.png
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
      - Neil Hodgson - 2019-02-14
        
        These results are really weird and differ greatly from my results, with an i7-6700 (Win10 host) through VirtualBox 5.2.26. I'm fairly slow at updating to new major VirtualBox versions as they can take a while to shake out bugs.
        
        template: true, loop: 100 GCC 8.2.0 time for file src/Editor.cxx [256856 -> 256856, 0.00]: old method 2851.11 1.000 133.68 latin + old 321.84 8.859 134.39 new method 323.54 8.812 134.55 latin + new 205.44 13.878 136.17 template: true, loop: 100 Clang 7.0.0 time for file src/Editor.cxx [256856 -> 256856, 0.00]: old method 2963.94 1.000 137.99 latin + old 210.44 14.084 215.08 new method 314.81 9.415 154.85 latin + new 208.50 14.216 210.85
        
        Your VM results are around 10 times slower than these and also slower than your Windows native results. Benchmarks show around a 30% advantage for the i7-6700 over the i5-4570 so the processor is nowhere near enough to produce this difference.
        
        Your results show "new method" as around twice as fast as "latin + new" for Editor.cxx which also makes little sense to me. For Editor.cxx, every byte is ASCII so both the "latin +" cases should only run the first section of their code, indexing once into the Latin table which should always be faster than the 3 look ups and extra calculations of "new method". There is one comparison removed for "new method" over "latin + new", but that shouldn't be significant and should always be correctly predicted for Editor.cxx. About the only possibility that makes some sense is that "new method" is a preferred candidate for inlining over "latin + new".
        
        It would also be an idea to include the bit-width of the build in the output. Something like
        
        constexpr int bits = static_cast<int>(sizeof(void *)*8); #if defined(__clang__) printf("Clang %d.%d.%d %dbit\n", __clang_major__, __clang_minor__, __clang_patchlevel__, bits); #elif defined(__GNUC__) printf("GCC %d.%d.%d %dbit\n", __GNUC__, __GNUC_MINOR__, __GNUC_PATCHLEVEL__, bits); #elif defined(_MSC_VER) printf("MSVC %2d.%2d.%5d.%d %dbit\n", (_MSC_VER / 100), (_MSC_VER % 100), (_MSC_FULL_VER % 100000), _MSC_BUILD, bits); #endif
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        Zufu Liu - 2019-02-15
        
        added bits.
        
        testCat.cpp
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        Neil Hodgson - 2019-02-15
        
        Updated to VirtualBox 6.0.4 with no significant change to results. Here is my VirtualBox configuration. You seem to be missing PAE/NX although I wouldn't expect them to be a problem. Make sure you have installed the current version of the Guest Additions.
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        Zufu Liu - 2019-02-15
        
        Host (desktop) and VM (with PAE/NX enabled) is update to date. output by vbox-greeter:
        
        ubuntu@Ubuntu:~$ vbox-greeter 00:00:00.002089 main vbox-greeter 6.0.4 r128413 (verbosity: 0) linux.amd64 (Jan 25 2019 18:34:19) release log 00:00:00.002098 main Log opened 2019-02-15T04:56:57.364018000Z 00:00:00.002153 main OS Product: Linux 00:00:00.002161 main OS Release: 4.18.0-15-generic 00:00:00.002169 main OS Version: #16-Ubuntu SMP Thu Feb 7 10:56:39 UTC 2019 00:00:00.002177 main OS Service Pack: #16-Ubuntu SMP Thu Feb 7 10:56:39 UTC 2019 00:00:00.002186 main Executable: /opt/VBoxGuestAdditions-6.0.4/sbin/vbox-greeter 00:00:00.002189 main Process ID: 2435 00:00:00.002191 main Package type: LINUX_64BITS_GENERIC ** (process:2435): WARNING **: 12:56:57.549: No LIGHTDM_TO_SERVER_FD environment variable 00:00:00.187481 main vbox-lightdm-greeter: error: unable to connect to LightDM server, aborting ubuntu@Ubuntu:~$
        
        attached the detail compiler output (./runTestCat.sh -v and ./runTestCat.sh clang -v )
        
        clang-cmd.txt
        
        gcc-cmd.txt
        
        testCat-Clang-350.log
        
        testCat-GCC-22540.log
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        Zufu Liu - 2019-02-15
        
        Clang enabled -vectorize-loops -vectorize-slp on O2, enable -ftree-vectorize for GCC seems not affect speed on my VM and host.
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Update the test, changing + to |.
result on i5-4700 Win10 x64. Clang 9.0.0 SVN r351376 from https://llvm.org/builds/

Clang 9.0.0

time for file web_greek.html [442946 -> 437550]:
    old method       5616.25           1.000          311.81
    latin + old       353.82          15.873          321.97
    new method        695.00           8.081          257.70
    latin + new       344.36          16.309          264.38

MSVC 19.16.27026.1

time for file web_greek.html [442946 -> 437550]:
    old method       7343.81           1.000          356.22
    latin + old       954.68           7.692          346.85
    new method       1106.87           6.635          270.44
    latin + new       877.17           8.372          272.02

test-cat4.zip

With pre-shift for catTable2.

Clang 9.0.0

time for file web_ru.html [178720 -> 156321]:
    old method       2026.91           1.000          340.50
    latin + old       377.05           5.376          384.73
    new method        229.27           8.841          157.73
    latin + new       192.82          10.512          164.64

time for file web_greek.html [442946 -> 437550]:
    old method       5602.68           1.000          305.05
    latin + old       353.18          15.863          315.69
    new method        639.62           8.759          259.62
    latin + new       382.85          14.634          264.72

MSVC 19.16.27026.1

time for file web_ru.html [178720 -> 156321]:
    old method       2781.68           1.000          479.67
    latin + old       646.47           4.303          506.95
    new method        355.24           7.831          183.96
    latin + new       363.52           7.652          200.21

time for file web_greek.html [442946 -> 437550]:
    old method       7421.27           1.000          358.16
    latin + old       945.82           7.846          350.07
    new method       1016.68           7.299          274.71
    latin + new       891.86           8.321          274.06

test-cat5.zip

pre shift catTabel1 and catTable2.
result on MacBookPro (i5, Late 2011) using Clang 10 (Xcode 10.1).

Clang 10.0.0

time for file web_ru.html [178630 -> 156315]:
    old method       8289.49           1.000         1432.08
    latin + old      1374.64           6.030         1319.16
    new method        343.18          24.155          215.30
    latin + new       299.20          27.705          212.43

time for file web_greek.html [442544 -> 437445]:
    old method      24183.10           1.000          728.91
    latin + old       899.30          26.891          680.42
    new method       1189.88          20.324          386.65
    latin + new       698.13          34.640          352.78

test-cat-preshift.zip

correct web_cn.html

wget --no-verbose --no-check-certificate --user-agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.96 Safari/537.36" https://www.163.com/ --output-document=web_cn_gbk.html
iconv -f gbk -t utf-8 web_cn_gbk.html > web_cn.html

Zufu Liu - 2019-02-11

Update test, it's seems when non-ASCII characters goes to 8% and above, new method will outperform latin+old.

testCat.cpp

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Optimize Document::WordCharacterClass() and CategoriseCharacter()

Group

Searches

Help

#1259 Optimize Document::WordCharacterClass() and CategoriseCharacter()

Discussion