Scintilla currently only executes on the main user interface thread except to allow loading documents from background threads. Laying out text could be sped up by measuring text positions in parallel in some cases. Wide lines can cause user-perceived hangs and hitches due to the time required to measure the positions of text. Current mainstream processors contain multiple (2-16) cores and may also implement hyper-threading. On a 4 core 8 thread Intel i7-6700 Windows 10 system, the attached patch achieves a better than 4x speed up on long lines using Visual C++ 2022 and DirectWrite.
This was originally in response to [feature-requests:#1422] but is useful in other cases.
The patch has benefits but also has shortcomings, so a different approach may be better.
The core change is to separate order-dependent and order-independent parts of the layout loop in EditView::LayoutLine
and perform the order-independent code in parallel. The parallelized portion calls Surface::MeasureWidths*
either directly or indirectly through PositionCache
to discover the relative positions of characters in each measurement segment. This portion accounts for almost all (around 99%) of the runtime in most cases with almost all of that in DirectWrite's IDWriteTextLayout::GetClusterMetrics
. After all the segments are measured, the segment positions are made absolute and tabs are expanded in a sequential loop.
The C++17 for_each
parallel algorithm is used to perform code in parallel using multiple threads. The C++ runtime library decides how many threads to use and coordinates them.
std::for_each(std::execution::par, segments.begin(), segments.end(), [this, &surfacePool, &vstyle, &model, &ll](const TextSegment &ts) { // ... measure the characters in ts });
The same code can be run sequentially by replacing std::execution::par
with std::execution::seq
.
The MSVC implementation of parallel algorithms uses the Windows system thread pool to avoid excessive thread creation and contention.
With DirectWrite, an example of wide styled text (110 MB of SQL insert data with short segments and some repeated lexemes) on a single line was 4.4 times faster (9.85 versus 43.5 seconds). The example heavy line from [feature-requests:#1422] consisting of many non-ASCII characters with no spaces was 4.1 times faster (9.9 versus 41 seconds) on the test machine.
When the monospaced ASCII feature is strongly active (the text is almost all ASCII and only monospaced fonts are used) then DirectWrite
is rarely called and the std::for_each(std::execution::
approach is slightly (7%) slower. This overhead could be (nearly) eliminated with better Surface
management to avoid locking overhead.
GDI is also sped up by this code but it is not known if GDI text measurement is really thread-safe.
By default #include <execution>
fails compilation with g++. g++ implements std::execution::par
by calling Intel's Threading Building Blocks (TBB) library but g++ does not install TBB. Requiring TBB will make it more difficult to build Scintilla for something that is just an optimization. The patch uses the preprocessor to avoid TBB inclusion and convert the parallel code to sequential for g++. Further preprocessor use could enable users with TBB to optionally allow it.
Xcode has a null <execution>
that implements neither std::execution::par
nor std::execution::seq
. So there would need to be a fallback for macOS.
GTK is, itself, not thread safe. The Pango library used for text measurement is but not on Win32.
https://gitlab.gnome.org/GNOME/pango/-/issues/256
Its possible that, on Unix, the Pango resources required for text measurement could be created from their GTK contexts by single threaded initialization code then used inside the parallel portion.
Haven't looked at Qt.
The application may have a better idea about whether to use threads (perhaps there are memory or power constraints or underlying libraries are not thread-safe) so there should be a runtime switch.
Since there are problems with <execution>
, lower level threading could be used. SciTE has used C++ standard library std::thread
and std::mutex
successfully for 2 years on Win32, Linux, and macOS. A simple design with std::thread
would divide the segments for each long line up into n
portions for n
new threads where n
is a setting that can be chosen by the application. This will be more work and may be less efficient without the benefits of the Windows system thread pool or work stealing.
EditView::LayoutLine
doesn't have access to the window, technology or bidi setting so this is bodged in the patch. This data needs to be correctly transmitted down to this code to ensure that the layout is correct.
Since Surface
is stateful, each thread needs an independent instance. As it could be expensive to create a new measurement surface for each text segment, a thread-safe pool is implemented in the patch as SurfacePool
but this is a minor issue (extra 0.25% time to create a new Surface
for each segment) for DirectWrite.
mutexCache
can be changed tostd::shared_mutex
https://en.cppreference.com/w/cpp/thread/shared_mutex, which is Slim Reader/Writer (SRW) Lock on Windows.std::shared_mutex
should work well in this code but tried it and performance was a little worse on the test file. That may be because the SQL test didn't effectively hit the cache. There's a comment on StackOverflowCases with a higher cache hit ratio should be common so should check a file with more repeated lexemes.
std::shared_mutex
didn't work with the XP build, probably because SRW locks were introduced in Vista.With
parallel3.patch
seeing 17% better performance of shared mutex once file is lexed (first number for each technique is unlexed):mutex
10.1557 for 112974535 segments = 1235124
2.82384 for 112974535 segments = 14152623
shared_mutex
10.2396 for 112974535 segments = 1235124
2.41063 for 112974535 segments = 14152623
macOS 10.12 (2016) is required for shared_mutex which is a jump from currently supported 10.9.
OK, for VC STL,
std::mutex
is SRW lock for Win7+, and critical section for XP and Vista, see https://github.com/microsoft/STL/blob/main/stl/src/primitives.hpp#L111.Some ideas:
1. Use
durationWrapOneByte
to estimate whether current line can be layout within a small time (e.g. several hundred milliseconds) without UI blocking.2. (to make all compiler work) native thread pool
in platform layerforPLAT_XXX
, e.g. win32QueueUserWorkItem()
orCreateThreadpoolWork()
(used by MSVC parallel algorithms, macOSdispatch_async()
; or naive thread pool (forpthread
) withstd::thread::hardware_concurrency()
.3. surface pool, or surface cache (based on style integer value or font, call
SetFont()
once and reuse the surface forMeasureWidths()
).4. lock free work list
Last edit: Zufu Liu 2022-01-09
(2) C++ has
std::async
withstd::launch::async
but, like<execution>
andstd::shared_mutex
there could be portability issues. There's astd::future
returned at each call that should be waited for which may take time.https://en.cppreference.com/w/cpp/thread/async
(3)
SurfaceD2D::SetFont()
isn't prominent in profiles and much of its code is not needed forMeasureWidths
as it affects the render target which is not referenced later. The core DirectWrite object for measuring positions that takes the time isIDWriteTextLayout
which is created and used insideMeasureWidths
and not remembered. It may be possible to make Surface implementations (particularlySurfaceD2D
) less stateful (use a local forpTextFormat
instead of a field) soMeasureWidths
is reentrant and multiple threads don't interfere then there is no need for multiple surfaces and aSurfacePool
. As other platforms differ, there could be a 'reentrant MeasureWidths' feature support flag.std::async
is available on each platform (although only MSVC uses a thread pool) and performs OK with moderate numbers of calls, including the [feature-requests:#1422] case. However, the SQL example which was fast withstd::execution::par
was almost as slow as the original sequential implementation and used much memory. Creating all the (14 million) tasks took ~20 seconds and waiting for and closing their futures took about half as much. More real threads (19) were used thanstd::execution::par
(8). Could be viable with some chunking of segments so that fewer tasks are created.Related
Feature Requests: #1422
Patch to make SurfaceD2D measurement function stateless, including following changes:
1. Changed
maxWidthMeasure
andmaxLenText
in SurfaceGDI tostatic constexpr
.2. Removed null test for
pIDWriteFactory
andpD2DFactory
, when any of them is null, technology can't be D2D.3. Removed
codePageText
,pTextFormat
,yAscent
,yDescent
,yInternalLeading
from SurfaceD2D4. Renamed
SurfaceD2D::SetFont()
toSurfaceD2D::SetFontQuality()
and only call it once insideSurfaceD2D::DrawTextCommon()
, other places don't need to touchpRenderTarget
.Some changes seems not worthy:
1. Reorder fields in SurfaceGDI and SurfaceD2D to reduce padding
2. Change
Win32MapFontQuality()
andDWriteMapFontQuality()
to direct computing:For parallel performance with surface pool, how about move
surface->MeasureWidths()
out fromPositionCache::MeasureWidths()
, and split the code before and after the line as getter (returns bool and probe, avoid allocate/lock surface for monospace ASCII or text in cache) and setter.I find another issue: SurfaceD2D is not aware of cleartype font settings due to the cased rendering params, maybe something can be added into
WM_SETTINGCHANGE
.All the fonts and surfaces should be recreated after
WM_SETTINGCHANGE
so rendering params shouldn't be stale.defaultRenderingParams
andcustomClearTypeRenderingParams
created insideLoadD2DOnce()
are stale afterWM_SETTINGCHANGE
or when the window moved to another monitor. there is sample code for the later https://docs.microsoft.com/en-us/windows/win32/DirectWrite/how-to-add-support-for-multiple-monitorsSo, have to move
defaultRenderingParams
andcustomClearTypeRenderingParams
intoSurfaceD2D
and set from monitor of window in bothSurfaceD2D::Init
s.Committed with [9502f2] and [9be3ea].
Related
Commit: [9502f2]
Commit: [9be3ea]
SetFontQuality
can be renamed asSetTextRenderingParams
.Something come into my head for slow SQL example: how about add a field in ViewStyle to indicate that all font are monospaced, then scan whole line to find first non-ASCII position, then go back to the last punctuation, space or control position; text before this position as a simple segment, remaining text goes into BreakFinder.
Last edit: Zufu Liu 2022-01-10
I assume this is where all the styles use the same monospaced font.
This will be faster but should be measured to see if it is worth the complexity. Monospaced is already very fast at 130 MB / second.
Updated version that works out how many threads to use then
std::async
over that many threads then loop over multiple segments in each thread. If just 1 thread then use main thread withlaunch::deferred
. Since the threads are longer-lasting, they hold onto a surface and theSurfacePool
is dropped. This is similar in performance to parallelfor_each
but is portable.Since this is statically assigning ranges of the line to threads, it could end up with the last slow thread (due to difficult range or bad scheduling luck) running with no concurrency. An alternative would be to distribute segments as needed but this would have more locking and there may also be memory locality and caching effects.
Lines that are simple enough to not be worth the cost of thread creation could set
threads
to 1 but this should be informed by measurement of thread creation and may differ between operating systems.std::thread
could be used instead ofstd::async
, but that doesn't get the benefit of the thread pool on Win32 and may need to repeat the code for the single threaded case as it doesn't have alaunch::deferred
option.The core code looks like this:
with
std::mutex
inPositionCache::MeasureWidths()
and something similar to following code, using Notepad2's current SurfaceD2D, parallel time for the 7,266,428 byte long line reduced from 9 seconds to 2.2 seconds (1.8 seconds withmemory_order::seq_cst
but which produces overlapped positions I don't figured why), while sequence time is 28 seconds with D2D and 14 seconds with GDI.My current current SurfaceD2D changes:
1. https://sourceforge.net/p/scintilla/feature-requests/1422/#8c46 increase
lengthStartSubdivision
andlengthEachSubdivision
2. https://sourceforge.net/p/scintilla/feature-requests/1427/#f884
SetFont()
changes, after this changeD2D1_FACTORY_TYPE_MULTI_THREADED
is not needed since all measurement function never usepD2DFactory
or any object it creates (pRenderTarget
) . onlypIDWriteFactory
is used butDWRITE_FACTORY_TYPE_SHARED
already granted thread safe.using atomic index reduced time may because each thread runs in nearby segments and runs in similar total time than the task division approach or
std::execution::par
(the total time which is then depends on the hardest/complex text part).Also partial line layout can also implement with atomic index (replace
while (true)
).Memory order is the second argument to
fetch_add
soshould be
std::memory_order_acq_rel
is 4 on MSVC 2019 with c++17 (and a plain enum so poor type checking) so the code was stepping through the segments by 4 (0, 4, 8, ...) and not performing most work.Performance is similar to
parallel3.patch
after changing the increment. That 4x speed up would have been great.Using 16 threads could be faster in some circumstances but I see a slow-down over 8 threads on my 4 core 8 hardware thread machine.
Last edit: Neil Hodgson 2022-01-12
Thanks for the correction, but the time on your i7 is slow than expected, The 9 second for me is on i5 with 4 core 4 logical processors, I expected it to be 4 ~ 5 seconds on your 8 logical processors i7 using Notepad2's latest code (in
src\Notepad2.c
, uncommentAttachConsole()
block, then inEditView::LayoutLine()
uncomment measurement codeLayoutWorker
block, end ofEditor::WrapLines()
has other timing log).Following are all my local changes (not much slow then Notepad2) to Scintilla's src and win32 sources (mostly from https://github.com/zufuliu/notepad2/commit/419cbafa60f7097f7ea8fea9c039db4b4801da93 and https://github.com/zufuliu/notepad2/commit/2ab9eda02d5c31eef6e09b0eef3c16da88279fde)
1. independent part moved into
LayoutWorker
class.2. increased
lengthStartSubdivision
andlengthEachSubdivision
(the changes in PlatWin.cxx).3. Added
durationWrapOneThread
to measure single thread performance, and use it to determine whether to use parallel layout for current line. This fixes slow ASCII long line case.4. empty line is special handled to not layout or wrap.
5. Other small changes: Changed
ElapsedPeriod::Duration()
toconst noexcept
; movedpdoc
to first member ofEditModel
, as it's frequently accessed.Partial line can be implemented by change
EditModel::UseParallelLayout()
to return length in 10 milliseconds, and check time after processed that many of bytes.Unfortunately, hyper-threading is not as good as additional cores. Intel claimed 15-30% improvement and the 9% improvement above isn't unusual. Hyper-threading effectively uses some memory stalls but the tests may be (mostly) cache-resident and have few stalls to provide opportunities for the other thread.
https://en.wikipedia.org/wiki/Hyper-threading#Performance_claims
There may be negative effects of this if it decreases the number of segments on medium length lines and reduces the potential of parallelization. One of the main pieces needed for parallelization is measurements on each platform to determine when to go parallel and how many threads to use. It may be something as simple as
threads = segments / 8
orlinelength / 4000
capped by hardware concurrency or it may be more complex if the different platforms have significantly different costs per thread which I am expecting as only MSVC has a thread pool.Added reentrant
MeasureWidths
andMeasureWidthsUTF8
for Cocoa with [2a972e].The macOS laptop system tested has a 2.0 GHz i7-4750HQ with 4 cores and hyperthreading. It throttles each core more than a desktop system as more threads are active. The most important
CTLine*
andCTRun*
calls are generally slower (scaling by processor speed) thanDirectWrite
for simpler ASCII text but faster for the 'unusual' non-ASCII text of the log example.There's also some text not displayed where there are combining characters (2DEA[COMBINING CYRILLIC LETTER O] [ⷪ]) combined with characters not present in a font (2BEA[STAR WITH LEFT HALF BLACK] [⯪]).
For the [feature-requests:#1422] non-ASCII log example:
The SQL file used for testing is at https://www.scintilla.org/enwik.zip and is based on file linked from [feature-requests:#1373] but modified to have a single extremely long line to take a long time and produce good profiles.
On this setup, threading is important for the unlexed state where there is little locking as long lexemes don't go into cache. Once the text is lexed then the number of threads is not as beneficial, likely because of locking overhead.
An as yet unexplored issue is how well the performance will scale to 8 and 16 core processors. Using
shared_mutex
should benefit more cores as there will be more threads contending on access to the position cache. However,shared_mutex
is never better in these tests and is dramatically worse at 8 threads. This could be a problem with the macOS implementation ofshared_mutex
not being as good as Win32's SRW locks.https://turingcompl33t.github.io/RWLock-Benchmark/
Related
Feature Requests:
#1373Feature Requests: #1422
Commit: [2a972e]
Last edit: Neil Hodgson 2022-01-17
Profiling PositionCache::MeasureWidths on the Win32 8 thread version with enwik.sql shows that locking is an important contributor to time. 56.7% with std::mutex and 39.7% with std::shared_mutex.
The time for enwiki.sql is doubtful, use Notepad2 with 4 threads on i5, measure width (with monospaced font and styling) block takes about 1.2 second, while wrap line block takes about 8 seconds.
This is with a proportional font. The purpose isn't to find the minimum time for that file but to work out any problems with the cache and how to scale parallelization. Similar files may contain non-ASCII text. For me, this is the main use case for threading: wide textual data files that do not benefit from the ASCII monospace feature. The log example is really a binary file (or combined text and binary file) so will be encountered much less often.
Profiles with 2 threads and 1 thread but with locking appear to show that the locks are slower when they are contended.
If the lock time is due to contention, then reducing contention should improve speed. A technique could be to divide the cache into sections, each section having its own lock.
I measured Notepad2 (cache size 2048, string length less than 64 are cached) for enwiki.sql with proportional font (can be toggled by the menu
Scheme -> Use Default Code Style
), the lock time (const CacheReadLock readLock;
andconst CacheWriteLock writeLock;
, both are SRWLockExclusive) for 4 threads is only about 4%, total LayoutWorker block is about 6.5 seconds.https://github.com/zufuliu/notepad2/blob/main/scintilla/src/PositionCache.cxx#L1140
I see around the same 33% locking cost with 4 threads and shared_mutex either with exclusive or shared locking over the read portion:
Are you isolating just the multi-threaded (intensive) section of the profile after the text has been lexed or is other code like loading and lexing included? 6.5 seconds sounds like the time to lex and layout.
std::shared_mutex
could be adding costs over SRW but it appears thin to me. Its possible I am seeing a profiling artifact if the profiler behaves poorly with multiple threads.