Is a 7% improvement in speed (but only when file doesn't have to really be loaded) worth adding 20 lines of code? The more code there is, the more difficult it is to change or add features.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I don't figured out exactly why, but with body->ReAllocate(newSize + 1); in Partitioning::ReAllocate(), last plv->InsertLines() call in CellBuffer::BasicInsertString() may cause a reallocation for starts, change it to body->ReAllocate(newSize + 2);, the reallocation is always avoided.
e.g. for enwiki-20200801-pages-articles-multistream1.xml-p1p30303, with SCI_ALLOCATELINES(4740896), last plv->InsertLines() will allocates starts as part1Length=4740866, gapLength=1048638, lengthBody=4740866, growSize=1048576.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
It'd be best to track this down to be accurate in the code - its too easy to 'fix' things by fudging which can cause later issues.
Perhaps the problem here is that there is code that thinks that the number of lines is the same as the number of line ends instead of one more than the number of line ends? This is the case even when the file ends with a line end as Scintilla treats this as having an empty last line.
Another possibility is for a line to be added then removed such as when the block ends with a CR and there is an LF after the block but this shouldn't be the case when loading.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The little improvement is because each line in enwiki-20200801-pages-articles-multistream-index.txt is very short (article titles), consider other two files on the same page:
Patches updated for current repository state attached.
With MSVC, the mask approach from BasicInsertString.diff was slower than testing for the range \n .. \r likely because the assembly is longer. This would let through VT and FF but these are rare and filtered out by the switch.
clang and gcc compiles (mask >> ch) & 1) == 0) into bit test other than shift and test. https://godbolt.org/z/fbjEh71sd
Will measure range test (ch < '\n') tomorrow.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I don't know why, on my very old AMD Athlon II X4 640 Win10, msvc skip block with ((mask >> ch) & 1) == 0 is much faster than with ch < '\n'. result for mahotilo's consumption.csv (https://sourceforge.net/p/scintilla/feature-requests/1381/#7237) for both LF and CR+LF) line endings.
consumption.csv LF:
Method
MSVC 2019
GCC 10.3.0-5
llvm-mingw 12.0.0
clang-cl 12.0.1
SSE2
56
54
52
54
utf-8 table
277
324
323
305
skip mask
243
321
332
399
skip range
263
402
264
256
consumption.csv CR+LF:
Method
MSVC 2019
GCC 10.3.0-5
llvm-mingw 12.0.0
clang-cl 12.0.1
SSE2
54
52
47
50
utf-8 table
277
324
321
305
skip mask
243
321
332
399
skip range
263
401
264
254
Last edit: Zufu Liu 2021-07-20
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
With the (outside) range version, the compiler can use unsigned subtraction underflow to collapse 2 comparisons down to 1:
leaedx,[rax-10]// unsigned int edx = ch - 10// ch < \n -> underflow to large number// ch >= \n -> smaller numbercmpdl,3// is dl > 3 means is ch > \r or ch < \nja.L3
Maybe the Athlon 2 is slow at lea or branch prediction isn't working well in this case.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I see a benefit on consumption.csv with MSVC 19 but its something like 13% compared to the large difference in your results and outweighed by improvements with gcc and clang.
Why the difference between files? consumption.csv has very long (70K average) lines. The difference could be that the mask version discards bytes > '\r' quicker but is then slower with handling '\r' and '\n'. That would make the speed depend on how often line ends occur.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The purpose of the skip block is to discard most commonly used characters except tab, so that most characters will only be compared once (2 instructions per byte).
On x86, the range version is 3 instructions per byte.
mask version for gcc and clang can be tuned with __builtin expect() (prevent clang merges flags for comparison and bit test then do a single jump), which is then non-portable unless use C++20 likely and unlikely attributes.
for the mask version, if msvc can hoist mask outside the skip loop (like arm64, but performance on arm64 not tested) and do a bit test, it might even faster.
; msvc x86 rangexorcl,cl$LL5@BasicInser:cmprbx,r15jaeSHORT$LN6@BasicInsermovzxecx,BYTEPTR[rbx]incrbxleaeax,DWORDPTR[rcx-10]cmpal,3jaSHORT$LL5@BasicInser; msvc x86 maskxorcl,cl$LL5@BasicInser:cmprbx,r15jaeSHORT$LN6@BasicInsermovzxecx,BYTEPTR[rbx]incrbxcmpcl,13jaSHORT$LL5@BasicInsermoveax,9216; 00002400Hshreax,cltestal,1jeSHORT$LL5@BasicInser; msvc arm64 rangemovw10,#0|$LL5@BasicInser|cmpx19,x21bhs|$LN6@BasicInser|ldrbw10,[x19]addx19,x19,#1addw8,w10,#0xF6uxtbw9,w8cmpw9,#3bhi|$LL5@BasicInser|; msvc arm64 maskmovw9,#0movw27,#0x2400|$LL5@BasicInser|cmpx19,x23bhs|$LN6@BasicInser|ldrbw9,[x19]addx19,x19,#1cmpw9,#0xDbhi|$LL5@BasicInser|lsrw8,w27,w9tbzx8,#0,|$LL5@BasicInser|; clang x86 mask.LBB40_49:movzxeax,byteptr[rbx]addrbx,1cmpal,14setbclbtedi,eaxsetbdltestcl,dljne.LBB40_51cmprbx,rbpjb.LBB40_49.LBB40_51:; gcc x86 mask.L1335:movq%rdi,%rdx.L1333:movzbl(%rdi),%eaxaddq$1,%rdicmpb$13,%alja.L1338movl$9216,%ecxbtl%eax,%ecxjc.L1337.L1338:cmpq%rbx,%rdijne.L1335.L1337:
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I found following simple code is faster for Clang and GCC, however the it's slower for MSVC (been compiled as if (ptr < end) do { ... } while (ptr < end);).
while(ptr<end){// skip to line endconstuint8_tch=*ptr++;constexpruint32_tmask=((1<<'\r')-1)^(1<<'\n');if(ch>'\r'||((mask>>ch)&1)!=0){continue;}if(ch=='\r'&&*ptr=='\n'){++ptr;}positions[nPositions++]=position+ptr-s;if(nPositions==PositionBlockSize){plv->InsertLines(lineInsert,positions,nPositions,atLineStart);lineInsert+=nPositions;nPositions=0;}}
MSVC for arm64 compiles following code to sub + cmp + jump:
intch=0;// can not use char or uint8_twhile(ptr<end&&((ch=*ptr++)>'\r'||ch<'\n')){// nop}
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
timing (milliseconds) on an i5 with 911 MB (20,468,259 lines) enwiki-20200801-pages-articles-multistream-index.txt from https://dumps.wikimedia.org/enwiki/20200801/
Is a 7% improvement in speed (but only when file doesn't have to really be loaded) worth adding 20 lines of code? The more code there is, the more difficult it is to change or add features.
Extracted 'set init line count' and renamed to SCI_ALLOCATELINES as it is similar to SCI_ALLOCATE.
I don't figured out exactly why, but with
body->ReAllocate(newSize + 1);inPartitioning::ReAllocate(), lastplv->InsertLines()call inCellBuffer::BasicInsertString()may cause a reallocation for starts, change it tobody->ReAllocate(newSize + 2);, the reallocation is always avoided.e.g. for enwiki-20200801-pages-articles-multistream1.xml-p1p30303, with SCI_ALLOCATELINES(4740896), last
plv->InsertLines()will allocates starts as part1Length=4740866, gapLength=1048638, lengthBody=4740866, growSize=1048576.It'd be best to track this down to be accurate in the code - its too easy to 'fix' things by fudging which can cause later issues.
Perhaps the problem here is that there is code that thinks that the number of lines is the same as the number of line ends instead of one more than the number of line ends? This is the case even when the file ends with a line end as Scintilla treats this as having an empty last line.
Another possibility is for a line to be added then removed such as when the block ends with a CR and there is an LF after the block but this shouldn't be the case when loading.
Its likely the underlying issue is that SplitVector::RoomFor expands storage if the new data would exactly fit in existing storage.
It's indeed expanded in SplitVector::RoomFor, a rough stack trace for following 260 lines file:
Due to upcoming major version, this will not be progressed until 5.x is stable.
The little improvement is because each line in enwiki-20200801-pages-articles-multistream-index.txt is very short (article titles), consider other two files on the same page:
652 MB (4,740,896 lines) enwiki-20200801-pages-articles-multistream1.xml-p1p30303
1.22 GB (14,220,042 lines) enwiki-20200801-pages-articles-multistream3.xml-p88445p200509
above enwiki files contians mostly only 7-bit ASCII, consider anothor file from https://dumps.wikimedia.org/zhwiki/20200801/ ( https://dumps.wikimedia.org/ Database backup dumps https://dumps.wikimedia.org/backup-index.html )
549 MB (7,499,537 lines) zhwiki-20200801-pages-articles-multistream1.xml-p1p162886
Attachment is source for the program to count ASCII bytes in files, following is result for the four wiki dumps.
The simplified loop avoided testing for most characters, except CR, LF and TAB for normal text files, which makes it faster.
Patches updated for current repository state attached.
With MSVC, the mask approach from BasicInsertString.diff was slower than testing for the range
\n .. \rlikely because the assembly is longer. This would let through VT and FF but these are rare and filtered out by the switch.clang and gcc compiles
(mask >> ch) & 1) == 0)into bit test other than shift and test.https://godbolt.org/z/fbjEh71sd
Will measure range test
(ch < '\n')tomorrow.Committed SCI_ALLOCATELINES as [69fd19]. The previous change [5fbd71] may avoid extra reallocations.
Related
Commit: [5fbd71]
Commit: [69fd19]
Edit: I was looking at assembly with the wrong optimization options.
Benchmarks for storing and removing Editor.cxx (CR+LF line ends) from a Document many times.
If it wasn't for the -Os result, the original code would be fine.
Last edit: Neil Hodgson 2021-07-16
I don't know why, on my very old AMD Athlon II X4 640 Win10, msvc skip block with
((mask >> ch) & 1) == 0is much faster than withch < '\n'. result for mahotilo's consumption.csv (https://sourceforge.net/p/scintilla/feature-requests/1381/#7237) for both LF and CR+LF) line endings.consumption.csv LF:
consumption.csv CR+LF:
Last edit: Zufu Liu 2021-07-20
With the (outside) range version, the compiler can use unsigned subtraction underflow to collapse 2 comparisons down to 1:
Maybe the Athlon 2 is slow at
leaor branch prediction isn't working well in this case.I see a benefit on consumption.csv with MSVC 19 but its something like 13% compared to the large difference in your results and outweighed by improvements with gcc and clang.
Why the difference between files? consumption.csv has very long (70K average) lines. The difference could be that the mask version discards bytes > '\r' quicker but is then slower with handling '\r' and '\n'. That would make the speed depend on how often line ends occur.
I updated the result for msvc skip with mask: 170 => 243.
compared to the plain version (where most characters will be compared twice, 4 instructions per byte):
The purpose of the skip block is to discard most commonly used characters except tab, so that most characters will only be compared once (2 instructions per byte).
On x86, the range version is 3 instructions per byte.
mask version for gcc and clang can be tuned with
__builtin expect()(prevent clang merges flags for comparison and bit test then do a single jump), which is then non-portable unless use C++20 likely and unlikely attributes.for the mask version, if msvc can hoist mask outside the skip loop (like arm64, but performance on arm64 not tested) and do a bit test, it might even faster.
I found following simple code is faster for Clang and GCC, however the it's slower for MSVC (been compiled as
if (ptr < end) do { ... } while (ptr < end);).MSVC for arm64 compiles following code to sub + cmp + jump: