Menu

#1370 Optimize CellBuffer::BasicInsertString() for non UTF-8 line endings

Initial
open
None
5
2021-07-22
2020-08-09
Zufu Liu
No

This continue [feature-requests:#1347].

The entire loop is simplified, the table-lookup is optimized to use bit test for non UTF-8 line endings.

1 Attachments

Related

Feature Requests: #1347

Discussion

  • Zufu Liu

    Zufu Liu - 2020-08-11

    timing (milliseconds) on an i5 with 911 MB (20,468,259 lines) enwiki-20200801-pages-articles-multistream-index.txt from https://dumps.wikimedia.org/enwiki/20200801/

    method current with set init line count
    UTF-8 912 777
    This 848 669
    SSE2 365 188
     
  • Neil Hodgson

    Neil Hodgson - 2020-08-11

    Is a 7% improvement in speed (but only when file doesn't have to really be loaded) worth adding 20 lines of code? The more code there is, the more difficult it is to change or add features.

     
  • Neil Hodgson

    Neil Hodgson - 2020-08-12

    Extracted 'set init line count' and renamed to SCI_ALLOCATELINES as it is similar to SCI_ALLOCATE.

     
    • Zufu Liu

      Zufu Liu - 2020-08-12

      I don't figured out exactly why, but with body->ReAllocate(newSize + 1); in Partitioning::ReAllocate(), last plv->InsertLines() call in CellBuffer::BasicInsertString() may cause a reallocation for starts, change it to body->ReAllocate(newSize + 2);, the reallocation is always avoided.

      e.g. for enwiki-20200801-pages-articles-multistream1.xml-p1p30303, with SCI_ALLOCATELINES(4740896), last plv->InsertLines() will allocates starts as part1Length=4740866, gapLength=1048638, lengthBody=4740866, growSize=1048576.

       
      • Neil Hodgson

        Neil Hodgson - 2020-08-12

        It'd be best to track this down to be accurate in the code - its too easy to 'fix' things by fudging which can cause later issues.

        Perhaps the problem here is that there is code that thinks that the number of lines is the same as the number of line ends instead of one more than the number of line ends? This is the case even when the file ends with a line end as Scintilla treats this as having an empty last line.

        Another possibility is for a line to be added then removed such as when the block ends with a CR and there is an LF after the block but this shouldn't be the case when loading.

         
        • Neil Hodgson

          Neil Hodgson - 2020-08-12

          Its likely the underlying issue is that SplitVector::RoomFor expands storage if the new data would exactly fit in existing storage.

           
          • Zufu Liu

            Zufu Liu - 2020-08-13

            It's indeed expanded in SplitVector::RoomFor, a rough stack trace for following 260 lines file:

            doc = '\n'.join(str(i + 1) for i in range(260))
            f = open('260.txt','wb'); f.write(doc.encode('utf-8')); f.close();
            
            LineVector::InsertLines(257, 3, true)
                Partitioning::InsertPartitionsWithCast(257, 3)
                    SplitVector::InsertEmpty(257, 3)
                        SplitVector::RoomFor(3)
                            where gapLength = insertLength = 3
                            SplitVector::ReAllocate(520)
            
             
          • Neil Hodgson

            Neil Hodgson - 2020-08-13

            Due to upcoming major version, this will not be progressed until 5.x is stable.

             
  • Zufu Liu

    Zufu Liu - 2020-08-12

    The little improvement is because each line in enwiki-20200801-pages-articles-multistream-index.txt is very short (article titles), consider other two files on the same page:

    652 MB (4,740,896 lines) enwiki-20200801-pages-articles-multistream1.xml-p1p30303

    method current set init line count
    UTF-8 442 402
    This 314 261
    SSE2 138 88

    1.22 GB (14,220,042 lines) enwiki-20200801-pages-articles-multistream3.xml-p88445p200509

    method current set init line count
    UTF-8 946 840
    This 720 577
    SSE2 346 200
     
  • Zufu Liu

    Zufu Liu - 2020-08-14

    Attachment is source for the program to count ASCII bytes in files, following is result for the four wiki dumps.

    enwiki-20200801-pages-articles-multistream1.xml-p1p30303
        size=684289215, ascii=680620843, 99.46%
    enwiki-20200801-pages-articles-multistream3.xml-p88445p200509
        size=1318900811, ascii=1313733612, 99.61%
    enwiki-20200801-pages-articles-multistream-index.txt
        size=955976005, ascii=951667531, 99.55%
    zhwiki-20200801-pages-articles-multistream1.xml-p1p162886
        size=576108366, ascii=249904736, 43.38%
    

    The simplified loop avoided testing for most characters, except CR, LF and TAB for normal text files, which makes it faster.

     
  • Neil Hodgson

    Neil Hodgson - 2021-07-15

    Patches updated for current repository state attached.

    With MSVC, the mask approach from BasicInsertString.diff was slower than testing for the range \n .. \r likely because the assembly is longer. This would let through VT and FF but these are rare and filtered out by the switch.

    /* BasicInsertString.diff */
                while (ptr < end && ((ch = *ptr++) > '\r' || ((mask >> ch) & 1) == 0)) {
                    ...
    00007FF75BE8669D  cmp         cl,0Dh  
    00007FF75BE866A0  ja          Scintilla::Internal::CellBuffer::BasicInsertString+432h (07FF75BE86692h)  
    00007FF75BE866A2  mov         eax,2400h  
    00007FF75BE866A7  shr         eax,cl  
    00007FF75BE866A9  test        al,1  
    00007FF75BE866AB  je          Scintilla::Internal::CellBuffer::BasicInsertString+432h (07FF75BE86692h)  
    
    /* Range */
                while (ptr < end && ((ch = *ptr++) > '\r' || (ch < '\n'))) {
                    ...
    00007FF699986693  cmp         al,0Dh  
    00007FF699986695  ja          Scintilla::Internal::CellBuffer::BasicInsertString+428h (07FF699986688h)  
    00007FF699986697  cmp         al,0Ah  
    00007FF699986699  jb          Scintilla::Internal::CellBuffer::BasicInsertString+428h (07FF699986688h)  
    
     
  • Zufu Liu

    Zufu Liu - 2021-07-15

    clang and gcc compiles (mask >> ch) & 1) == 0) into bit test other than shift and test.
    https://godbolt.org/z/fbjEh71sd
    Will measure range test (ch < '\n') tomorrow.

     
  • Neil Hodgson

    Neil Hodgson - 2021-07-16

    Committed SCI_ALLOCATELINES as [69fd19]. The previous change [5fbd71] may avoid extra reallocations.

     

    Related

    Commit: [5fbd71]
    Commit: [69fd19]

  • Neil Hodgson

    Neil Hodgson - 2021-07-16

    Edit: I was looking at assembly with the wrong optimization options.

    Benchmarks for storing and removing Editor.cxx (CR+LF line ends) from a Document many times.

    compiler original feature 1370 skip loop: or (ch < '\n') remove skip loop
    MSVC /O2 /Oi /GL 5335 5556 5127 5832
    g++ 9.2 -Os 13506 8718 8073 8220
    g++ 9.2 -O2 8213 8521 7516 7821

    If it wasn't for the -Os result, the original code would be fine.

     

    Last edit: Neil Hodgson 2021-07-16
  • Zufu Liu

    Zufu Liu - 2021-07-17

    I don't know why, on my very old AMD Athlon II X4 640 Win10, msvc skip block with ((mask >> ch) & 1) == 0 is much faster than with ch < '\n'. result for mahotilo's consumption.csv (https://sourceforge.net/p/scintilla/feature-requests/1381/#7237) for both LF and CR+LF) line endings.

    consumption.csv LF:

    Method MSVC 2019 GCC 10.3.0-5 llvm-mingw 12.0.0 clang-cl 12.0.1
    SSE2 56 54 52 54
    utf-8 table 277 324 323 305
    skip mask 243 321 332 399
    skip range 263 402 264 256

    consumption.csv CR+LF:

    Method MSVC 2019 GCC 10.3.0-5 llvm-mingw 12.0.0 clang-cl 12.0.1
    SSE2 54 52 47 50
    utf-8 table 277 324 321 305
    skip mask 243 321 332 399
    skip range 263 401 264 254
     

    Last edit: Zufu Liu 2021-07-20
    • Neil Hodgson

      Neil Hodgson - 2021-07-18

      With the (outside) range version, the compiler can use unsigned subtraction underflow to collapse 2 comparisons down to 1:

      lea     edx, [rax-10]
      // unsigned int edx = ch - 10
      // ch < \n -> underflow to large number
      // ch >= \n -> smaller number
      cmp     dl, 3
      // is dl > 3 means is ch > \r or ch < \n
      ja      .L3
      

      Maybe the Athlon 2 is slow at lea or branch prediction isn't working well in this case.

       
    • Neil Hodgson

      Neil Hodgson - 2021-07-20

      I see a benefit on consumption.csv with MSVC 19 but its something like 13% compared to the large difference in your results and outweighed by improvements with gcc and clang.

      Why the difference between files? consumption.csv has very long (70K average) lines. The difference could be that the mask version discards bytes > '\r' quicker but is then slower with handling '\r' and '\n'. That would make the speed depend on how often line ends occur.

       
  • Zufu Liu

    Zufu Liu - 2021-07-20

    I updated the result for msvc skip with mask: 170 => 243.

    compared to the plain version (where most characters will be compared twice, 4 instructions per byte):

    while (ptr < end) {
        switch (*ptr++) {
        case '\r':
            ...
        case '\n':
            ...
        }
    }
    

    The purpose of the skip block is to discard most commonly used characters except tab, so that most characters will only be compared once (2 instructions per byte).
    On x86, the range version is 3 instructions per byte.

    mask version for gcc and clang can be tuned with __builtin expect() (prevent clang merges flags for comparison and bit test then do a single jump), which is then non-portable unless use C++20 likely and unlikely attributes.

    for the mask version, if msvc can hoist mask outside the skip loop (like arm64, but performance on arm64 not tested) and do a bit test, it might even faster.

    ; msvc x86 range
        xor cl, cl
    $LL5@BasicInser:
        cmp rbx, r15
        jae SHORT $LN6@BasicInser
        movzx   ecx, BYTE PTR [rbx]
        inc rbx
        lea eax, DWORD PTR [rcx-10]
        cmp al, 3
        ja  SHORT $LL5@BasicInser
    
    ; msvc x86 mask
        xor cl, cl
    $LL5@BasicInser:
        cmp rbx, r15
        jae SHORT $LN6@BasicInser
        movzx   ecx, BYTE PTR [rbx]
        inc rbx
        cmp cl, 13
        ja  SHORT $LL5@BasicInser
        mov eax, 9216               ; 00002400H
        shr eax, cl
        test    al, 1
        je  SHORT $LL5@BasicInser
    
    ; msvc arm64 range
        mov         w10,#0
    |$LL5@BasicInser|
        cmp         x19,x21
        bhs         |$LN6@BasicInser|
        ldrb        w10,[x19]
        add         x19,x19,#1
        add         w8,w10,#0xF6
        uxtb        w9,w8
        cmp         w9,#3
        bhi         |$LL5@BasicInser|
    
    ; msvc arm64 mask
        mov         w9,#0
        mov         w27,#0x2400
    |$LL5@BasicInser|
        cmp         x19,x23
        bhs         |$LN6@BasicInser|
        ldrb        w9,[x19]
        add         x19,x19,#1
        cmp         w9,#0xD
        bhi         |$LL5@BasicInser|
        lsr         w8,w27,w9
        tbz         x8,#0,|$LL5@BasicInser|
    
     ; clang x86 mask
     .LBB40_49:                              
        movzx   eax, byte ptr [rbx]
        add rbx, 1
        cmp al, 14
        setb    cl
        bt  edi, eax
        setb    dl
        test    cl, dl
        jne .LBB40_51                            
        cmp rbx, rbp
        jb  .LBB40_49
    .LBB40_51:                             
    
    ; gcc x86 mask
    .L1335:
        movq    %rdi, %rdx
    .L1333:
        movzbl  (%rdi), %eax
        addq    $1, %rdi
        cmpb    $13, %al
        ja  .L1338
        movl    $9216, %ecx
        btl %eax, %ecx
        jc  .L1337
    .L1338:
        cmpq    %rbx, %rdi
        jne .L1335
    .L1337:
    
     
  • Zufu Liu

    Zufu Liu - 2021-07-22

    I found following simple code is faster for Clang and GCC, however the it's slower for MSVC (been compiled as if (ptr < end) do { ... } while (ptr < end);).

    while (ptr < end) {
        // skip to line end
        const uint8_t ch = *ptr++;
        constexpr uint32_t mask = ((1 << '\r') - 1) ^ (1 << '\n');
        if (ch > '\r' || ((mask >> ch) & 1) != 0) {
            continue;
        }
        if (ch == '\r' && *ptr == '\n') {
            ++ptr;
        }
        positions[nPositions++] = position + ptr - s;
        if (nPositions == PositionBlockSize) {
            plv->InsertLines(lineInsert, positions, nPositions, atLineStart);
            lineInsert += nPositions;
            nPositions = 0;
        }
    }
    

    MSVC for arm64 compiles following code to sub + cmp + jump:

    int ch = 0; // can not use char or uint8_t
    while (ptr < end && ((ch = *ptr++) > '\r' || ch < '\n')) {
        // nop
    }
    
     

Log in to post a comment.