Scintilla / Feature Requests / #1370 Optimize CellBuffer::BasicInsertString() for non UTF-8 line endings

Zufu Liu - 2020-08-11

timing (milliseconds) on an i5 with 911 MB (20,468,259 lines) enwiki-20200801-pages-articles-multistream-index.txt from https://dumps.wikimedia.org/enwiki/20200801/

method current with set init line count

UTF-8 912 777

This 848 669

SSE2 365 188

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

method	current	with set init line count
UTF-8	912	777
This	848	669
SSE2	365	188

Neil Hodgson - 2020-08-11

Is a 7% improvement in speed (but only when file doesn't have to really be loaded) worth adding 20 lines of code? The more code there is, the more difficult it is to change or add features.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Neil Hodgson - 2020-08-12

Extracted 'set init line count' and renamed to SCI_ALLOCATELINES as it is similar to SCI_ALLOCATE.

AllocateLines.patch

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Zufu Liu - 2020-08-12
  
  I don't figured out exactly why, but with body->ReAllocate(newSize + 1); in Partitioning::ReAllocate(), last plv->InsertLines() call in CellBuffer::BasicInsertString() may cause a reallocation for starts, change it to body->ReAllocate(newSize + 2);, the reallocation is always avoided.
  
  e.g. for enwiki-20200801-pages-articles-multistream1.xml-p1p30303, with SCI_ALLOCATELINES(4740896), last plv->InsertLines() will allocates starts as part1Length=4740866, gapLength=1048638, lengthBody=4740866, growSize=1048576.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Neil Hodgson - 2020-08-12
    
    It'd be best to track this down to be accurate in the code - its too easy to 'fix' things by fudging which can cause later issues.
    
    Perhaps the problem here is that there is code that thinks that the number of lines is the same as the number of line ends instead of one more than the number of line ends? This is the case even when the file ends with a line end as Scintilla treats this as having an empty last line.
    
    Another possibility is for a line to be added then removed such as when the block ends with a CR and there is an LF after the block but this shouldn't be the case when loading.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Neil Hodgson - 2020-08-12
      
      Its likely the underlying issue is that SplitVector::RoomFor expands storage if the new data would exactly fit in existing storage.
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
      - Zufu Liu - 2020-08-13
        
        It's indeed expanded in SplitVector::RoomFor, a rough stack trace for following 260 lines file:
        
        doc = '\n'.join(str(i + 1) for i in range(260)) f = open('260.txt','wb'); f.write(doc.encode('utf-8')); f.close();
        
        LineVector::InsertLines(257, 3, true) Partitioning::InsertPartitionsWithCast(257, 3) SplitVector::InsertEmpty(257, 3) SplitVector::RoomFor(3) where gapLength = insertLength = 3 SplitVector::ReAllocate(520)
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
      - Neil Hodgson - 2020-08-13
        
        Due to upcoming major version, this will not be progressed until 5.x is stable.
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Zufu Liu - 2020-08-12

The little improvement is because each line in enwiki-20200801-pages-articles-multistream-index.txt is very short (article titles), consider other two files on the same page:

652 MB (4,740,896 lines) enwiki-20200801-pages-articles-multistream1.xml-p1p30303

method current set init line count

UTF-8 442 402

This 314 261

SSE2 138 88

1.22 GB (14,220,042 lines) enwiki-20200801-pages-articles-multistream3.xml-p88445p200509

method current set init line count

UTF-8 946 840

This 720 577

SSE2 346 200

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

method	current	set init line count
UTF-8	442	402
This	314	261
SSE2	138	88

method	current	set init line count
UTF-8	946	840
This	720	577
SSE2	346	200

Zufu Liu - 2020-08-12

above enwiki files contians mostly only 7-bit ASCII, consider anothor file from https://dumps.wikimedia.org/zhwiki/20200801/ ( https://dumps.wikimedia.org/ Database backup dumps https://dumps.wikimedia.org/backup-index.html )

549 MB (7,499,537 lines) zhwiki-20200801-pages-articles-multistream1.xml-p1p162886

method current set init line count

UTF-8 555 500

This 339 268

SSE2 169 98

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

method	current	set init line count
UTF-8	555	500
This	339	268
SSE2	169	98

Zufu Liu - 2020-08-14

Attachment is source for the program to count ASCII bytes in files, following is result for the four wiki dumps.

enwiki-20200801-pages-articles-multistream1.xml-p1p30303 size=684289215, ascii=680620843, 99.46% enwiki-20200801-pages-articles-multistream3.xml-p88445p200509 size=1318900811, ascii=1313733612, 99.61% enwiki-20200801-pages-articles-multistream-index.txt size=955976005, ascii=951667531, 99.55% zhwiki-20200801-pages-articles-multistream1.xml-p1p162886 size=576108366, ascii=249904736, 43.38%

The simplified loop avoided testing for most characters, except CR, LF and TAB for normal text files, which makes it faster.

count_ascii.c
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Patches updated for current repository state attached.

With MSVC, the mask approach from BasicInsertString.diff was slower than testing for the range \n .. \r likely because the assembly is longer. This would let through VT and FF but these are rare and filtered out by the switch.

/* BasicInsertString.diff */
            while (ptr < end && ((ch = *ptr++) > '\r' || ((mask >> ch) & 1) == 0)) {
                ...
00007FF75BE8669D  cmp         cl,0Dh  
00007FF75BE866A0  ja          Scintilla::Internal::CellBuffer::BasicInsertString+432h (07FF75BE86692h)  
00007FF75BE866A2  mov         eax,2400h  
00007FF75BE866A7  shr         eax,cl  
00007FF75BE866A9  test        al,1  
00007FF75BE866AB  je          Scintilla::Internal::CellBuffer::BasicInsertString+432h (07FF75BE86692h)  

/* Range */
            while (ptr < end && ((ch = *ptr++) > '\r' || (ch < '\n'))) {
                ...
00007FF699986693  cmp         al,0Dh  
00007FF699986695  ja          Scintilla::Internal::CellBuffer::BasicInsertString+428h (07FF699986688h)  
00007FF699986697  cmp         al,0Ah  
00007FF699986699  jb          Scintilla::Internal::CellBuffer::BasicInsertString+428h (07FF699986688h)

8869_allocate_lines.patch

8870_specialize_cr_lf.patch

Zufu Liu - 2021-07-15

clang and gcc compiles (mask >> ch) & 1) == 0) into bit test other than shift and test.
https://godbolt.org/z/fbjEh71sd
Will measure range test (ch < '\n') tomorrow.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Neil Hodgson - 2021-07-16

Committed SCI_ALLOCATELINES as [69fd19]. The previous change [5fbd71] may avoid extra reallocations.

Related

Commit: [5fbd71]
Commit: [69fd19]

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Neil Hodgson - 2021-07-16

Edit: I was looking at assembly with the wrong optimization options.

Benchmarks for storing and removing Editor.cxx (CR+LF line ends) from a Document many times.

compiler original feature 1370 skip loop: or (ch < '\n') remove skip loop

MSVC /O2 /Oi /GL 5335 5556 5127 5832

g++ 9.2 -Os 13506 8718 8073 8220

g++ 9.2 -O2 8213 8521 7516 7821

If it wasn't for the -Os result, the original code would be fine.

Last edit: Neil Hodgson 2021-07-16

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

compiler	original	feature 1370	skip loop: or (ch < '\n')	remove skip loop
MSVC /O2 /Oi /GL	5335	5556	5127	5832
g++ 9.2 -Os	13506	8718	8073	8220
g++ 9.2 -O2	8213	8521	7516	7821

Zufu Liu - 2021-07-17

I don't know why, on my very old AMD Athlon II X4 640 Win10, msvc skip block with ((mask >> ch) & 1) == 0 is much faster than with ch < '\n'. result for mahotilo's consumption.csv (https://sourceforge.net/p/scintilla/feature-requests/1381/#7237) for both LF and CR+LF) line endings.

consumption.csv LF:

Method MSVC 2019 GCC 10.3.0-5 llvm-mingw 12.0.0 clang-cl 12.0.1

SSE2 56 54 52 54

utf-8 table 277 324 323 305

skip mask 243 321 332 399

skip range 263 402 264 256

consumption.csv CR+LF:

Method MSVC 2019 GCC 10.3.0-5 llvm-mingw 12.0.0 clang-cl 12.0.1

SSE2 54 52 47 50

utf-8 table 277 324 321 305

skip mask 243 321 332 399

skip range 263 401 264 254

Last edit: Zufu Liu 2021-07-20

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Neil Hodgson - 2021-07-18
  
  With the (outside) range version, the compiler can use unsigned subtraction underflow to collapse 2 comparisons down to 1:
  
  lea edx, [rax-10] // unsigned int edx = ch - 10 // ch < \n -> underflow to large number // ch >= \n -> smaller number cmp dl, 3 // is dl > 3 means is ch > \r or ch < \n ja .L3
  
  Maybe the Athlon 2 is slow at lea or branch prediction isn't working well in this case.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Neil Hodgson - 2021-07-20
  
  I see a benefit on consumption.csv with MSVC 19 but its something like 13% compared to the large difference in your results and outweighed by improvements with gcc and clang.
  
  Why the difference between files? consumption.csv has very long (70K average) lines. The difference could be that the mask version discards bytes > '\r' quicker but is then slower with handling '\r' and '\n'. That would make the speed depend on how often line ends occur.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Method	MSVC 2019	GCC 10.3.0-5	llvm-mingw 12.0.0	clang-cl 12.0.1
SSE2	56	54	52	54
utf-8 table	277	324	323	305
skip mask	243	321	332	399
skip range	263	402	264	256

Method	MSVC 2019	GCC 10.3.0-5	llvm-mingw 12.0.0	clang-cl 12.0.1
SSE2	54	52	47	50
utf-8 table	277	324	321	305
skip mask	243	321	332	399
skip range	263	401	264	254

I updated the result for msvc skip with mask: 170 => 243.

compared to the plain version (where most characters will be compared twice, 4 instructions per byte):

while (ptr < end) {
    switch (*ptr++) {
    case '\r':
        ...
    case '\n':
        ...
    }
}

The purpose of the skip block is to discard most commonly used characters except tab, so that most characters will only be compared once (2 instructions per byte).
On x86, the range version is 3 instructions per byte.

mask version for gcc and clang can be tuned with __builtin expect() (prevent clang merges flags for comparison and bit test then do a single jump), which is then non-portable unless use C++20 likely and unlikely attributes.

for the mask version, if msvc can hoist mask outside the skip loop (like arm64, but performance on arm64 not tested) and do a bit test, it might even faster.

; msvc x86 range
    xor cl, cl
$LL5@BasicInser:
    cmp rbx, r15
    jae SHORT $LN6@BasicInser
    movzx   ecx, BYTE PTR [rbx]
    inc rbx
    lea eax, DWORD PTR [rcx-10]
    cmp al, 3
    ja  SHORT $LL5@BasicInser

; msvc x86 mask
    xor cl, cl
$LL5@BasicInser:
    cmp rbx, r15
    jae SHORT $LN6@BasicInser
    movzx   ecx, BYTE PTR [rbx]
    inc rbx
    cmp cl, 13
    ja  SHORT $LL5@BasicInser
    mov eax, 9216               ; 00002400H
    shr eax, cl
    test    al, 1
    je  SHORT $LL5@BasicInser

; msvc arm64 range
    mov         w10,#0
|$LL5@BasicInser|
    cmp         x19,x21
    bhs         |$LN6@BasicInser|
    ldrb        w10,[x19]
    add         x19,x19,#1
    add         w8,w10,#0xF6
    uxtb        w9,w8
    cmp         w9,#3
    bhi         |$LL5@BasicInser|

; msvc arm64 mask
    mov         w9,#0
    mov         w27,#0x2400
|$LL5@BasicInser|
    cmp         x19,x23
    bhs         |$LN6@BasicInser|
    ldrb        w9,[x19]
    add         x19,x19,#1
    cmp         w9,#0xD
    bhi         |$LL5@BasicInser|
    lsr         w8,w27,w9
    tbz         x8,#0,|$LL5@BasicInser|

 ; clang x86 mask
 .LBB40_49:                              
    movzx   eax, byte ptr [rbx]
    add rbx, 1
    cmp al, 14
    setb    cl
    bt  edi, eax
    setb    dl
    test    cl, dl
    jne .LBB40_51                            
    cmp rbx, rbp
    jb  .LBB40_49
.LBB40_51:                             

; gcc x86 mask
.L1335:
    movq    %rdi, %rdx
.L1333:
    movzbl  (%rdi), %eax
    addq    $1, %rdi
    cmpb    $13, %al
    ja  .L1338
    movl    $9216, %ecx
    btl %eax, %ecx
    jc  .L1337
.L1338:
    cmpq    %rbx, %rdi
    jne .L1335
.L1337:

I found following simple code is faster for Clang and GCC, however the it's slower for MSVC (been compiled as if (ptr < end) do { ... } while (ptr < end);).

while (ptr < end) {
    // skip to line end
    const uint8_t ch = *ptr++;
    constexpr uint32_t mask = ((1 << '\r') - 1) ^ (1 << '\n');
    if (ch > '\r' || ((mask >> ch) & 1) != 0) {
        continue;
    }
    if (ch == '\r' && *ptr == '\n') {
        ++ptr;
    }
    positions[nPositions++] = position + ptr - s;
    if (nPositions == PositionBlockSize) {
        plv->InsertLines(lineInsert, positions, nPositions, atLineStart);
        lineInsert += nPositions;
        nPositions = 0;
    }
}

MSVC for arm64 compiles following code to sub + cmp + jump:

int ch = 0; // can not use char or uint8_t
while (ptr < end && ((ch = *ptr++) > '\r' || ch < '\n')) {
    // nop
}

Optimize CellBuffer::BasicInsertString() for non UTF-8 line endings

Group

Searches

Help

#1370 Optimize CellBuffer::BasicInsertString() for non UTF-8 line endings

Related

Discussion

Related