Scintilla / Feature Requests / #1427 Improve Scintilla speed with parallel layout

Zufu Liu - 2022-01-08

mutexCache can be changed to std::shared_mutex https://en.cppreference.com/w/cpp/thread/shared_mutex, which is Slim Reader/Writer (SRW) Lock on Windows.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Neil Hodgson - 2022-01-08
  
  std::shared_mutex should work well in this code but tried it and performance was a little worse on the test file. That may be because the SQL test didn't effectively hit the cache. There's a comment on StackOverflow
  
  Consider SRW locks if you are reading at least 4:1 vs writing
  
  Cases with a higher cache hit ratio should be common so should check a file with more repeated lexemes.
  
  std::shared_mutex didn't work with the XP build, probably because SRW locks were introduced in Vista.
  
  ...\positioncache.cxx(377): error C2039: 'shared_mutex': is not a member of 'std'
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Neil Hodgson - 2022-01-11
    
    With parallel3.patch seeing 17% better performance of shared mutex once file is lexed (first number for each technique is unlexed):
    
    mutex
    
    10.1557 for 112974535 segments = 1235124
    2.82384 for 112974535 segments = 14152623
    
    shared_mutex
    
    10.2396 for 112974535 segments = 1235124
    2.41063 for 112974535 segments = 14152623
    
    macOS 10.12 (2016) is required for shared_mutex which is a jump from currently supported 10.9.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Zufu Liu - 2022-01-09

OK, for VC STL, std::mutex is SRW lock for Win7+, and critical section for XP and Vista, see https://github.com/microsoft/STL/blob/main/stl/src/primitives.hpp#L111.

Some ideas:
1. Use durationWrapOneByte to estimate whether current line can be layout within a small time (e.g. several hundred milliseconds) without UI blocking.
2. (to make all compiler work) native thread pool ~~in platform layer~~ for PLAT_XXX, e.g. win32 QueueUserWorkItem() or CreateThreadpoolWork() (used by MSVC parallel algorithms, macOS dispatch_async(); or naive thread pool (for pthread) with std::thread::hardware_concurrency().
3. surface pool, or surface cache (based on style integer value or font, call SetFont() once and reuse the surface for MeasureWidths()).
4. lock free work list

std::vector<TextSegment> segmentList; std::atomic<int> segmentIndex; // increment only

Last edit: Zufu Liu 2022-01-09
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Neil Hodgson - 2022-01-09
  
  (2) C++ has std::async with std::launch::async but, like <execution> and std::shared_mutex there could be portability issues. There's a std::future returned at each call that should be waited for which may take time.
  https://en.cppreference.com/w/cpp/thread/async
  
  (3) SurfaceD2D::SetFont() isn't prominent in profiles and much of its code is not needed for MeasureWidths as it affects the render target which is not referenced later. The core DirectWrite object for measuring positions that takes the time is IDWriteTextLayout which is created and used inside MeasureWidths and not remembered. It may be possible to make Surface implementations (particularly SurfaceD2D) less stateful (use a local for pTextFormat instead of a field) so MeasureWidths is reentrant and multiple threads don't interfere then there is no need for multiple surfaces and a SurfacePool. As other platforms differ, there could be a 'reentrant MeasureWidths' feature support flag.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Neil Hodgson - 2022-01-10
    
    std::async is available on each platform (although only MSVC uses a thread pool) and performs OK with moderate numbers of calls, including the [feature-requests:#1422] case. However, the SQL example which was fast with std::execution::par was almost as slow as the original sequential implementation and used much memory. Creating all the (14 million) tasks took ~20 seconds and waiting for and closing their futures took about half as much. More real threads (19) were used than std::execution::par (8). Could be viable with some chunking of segments so that fewer tasks are created.
    
    Related
    
    Feature Requests: #1422
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Zufu Liu - 2022-01-10

Patch to make SurfaceD2D measurement function stateless, including following changes:
1. Changed maxWidthMeasure and maxLenText in SurfaceGDI to static constexpr.
2. Removed null test for pIDWriteFactory and pD2DFactory, when any of them is null, technology can't be D2D.
3. Removed codePageText, pTextFormat, yAscent, yDescent, yInternalLeading from SurfaceD2D
4. Renamed SurfaceD2D::SetFont() to SurfaceD2D::SetFontQuality() and only call it once inside SurfaceD2D::DrawTextCommon(), other places don't need to touch pRenderTarget.

Some changes seems not worthy:
1. Reorder fields in SurfaceGDI and SurfaceD2D to reduce padding
2. Change Win32MapFontQuality() and DWriteMapFontQuality() to direct computing:

constexpr BYTE Win32MapFontQuality(FontQuality extraFontFlag) noexcept { constexpr UINT mask = (DEFAULT_QUALITY << static_cast<int>(FontQuality::QualityDefault)) | (NONANTIALIASED_QUALITY << (4 *static_cast<int>(FontQuality::QualityNonAntialiased))) | (ANTIALIASED_QUALITY << (4 * static_cast<int>(FontQuality::QualityAntialiased))) | (CLEARTYPE_QUALITY << (4 * static_cast<int>(FontQuality::QualityLcdOptimized))) ; return static_cast<BYTE>((mask >> (4*static_cast<int>(extraFontFlag & FontQuality::QualityMask))) & 15); } constexpr D2D1_TEXT_ANTIALIAS_MODE DWriteMapFontQuality(FontQuality extraFontFlag) noexcept { constexpr UINT mask = (D2D1_TEXT_ANTIALIAS_MODE_DEFAULT << static_cast<int>(FontQuality::QualityDefault)) | (D2D1_TEXT_ANTIALIAS_MODE_ALIASED << (2 * static_cast<int>(FontQuality::QualityNonAntialiased))) | (D2D1_TEXT_ANTIALIAS_MODE_GRAYSCALE << (2 * static_cast<int>(FontQuality::QualityAntialiased))) | (D2D1_TEXT_ANTIALIAS_MODE_CLEARTYPE << (2 * static_cast<int>(FontQuality::QualityLcdOptimized))) ; return static_cast<D2D1_TEXT_ANTIALIAS_MODE>((mask >> (2 * static_cast<int>(extraFontFlag & FontQuality::QualityMask))) & 3); }

For parallel performance with surface pool, how about move surface->MeasureWidths() out from PositionCache::MeasureWidths() , and split the code before and after the line as getter (returns bool and probe, avoid allocate/lock surface for monospace ASCII or text in cache) and setter.

SurfaceD2D-0110.diff
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Zufu Liu - 2022-01-10
  
  I find another issue: SurfaceD2D is not aware of cleartype font settings due to the cased rendering params, maybe something can be added into WM_SETTINGCHANGE.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Neil Hodgson - 2022-01-11
    
    All the fonts and surfaces should be recreated after WM_SETTINGCHANGE so rendering params shouldn't be stale.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Zufu Liu - 2022-01-11
      
      defaultRenderingParams and customClearTypeRenderingParams created inside LoadD2DOnce() are stale after WM_SETTINGCHANGE or when the window moved to another monitor. there is sample code for the later https://docs.microsoft.com/en-us/windows/win32/DirectWrite/how-to-add-support-for-multiple-monitors
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
      - Neil Hodgson - 2022-01-12
        
        So, have to move defaultRenderingParams and customClearTypeRenderingParams into SurfaceD2D and set from monitor of window in both SurfaceD2D::Inits.
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Neil Hodgson - 2022-01-13
  
  Committed with [9502f2] and [9be3ea].
  
  Related
  
  Commit: [9502f2]
  Commit: [9be3ea]
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Zufu Liu - 2022-01-10

SetFontQuality can be renamed as SetTextRenderingParams.

Something come into my head for slow SQL example: how about add a field in ViewStyle to indicate that all font are monospaced, then scan whole line to find first non-ASCII position, then go back to the last punctuation, space or control position; text before this position as a simple segment, remaining text goes into BreakFinder.

Last edit: Zufu Liu 2022-01-10

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Neil Hodgson - 2022-01-11
  
  add a field in ViewStyle to indicate that all font are monospaced
  
  I assume this is where all the styles use the same monospaced font.
  
  This will be faster but should be measured to see if it is worth the complexity. Monospaced is already very fast at 130 MB / second.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Updated version that works out how many threads to use then std::async over that many threads then loop over multiple segments in each thread. If just 1 thread then use main thread with launch::deferred. Since the threads are longer-lasting, they hold onto a surface and the SurfacePool is dropped. This is similar in performance to parallel for_each but is portable.

Since this is statically assigning ranges of the line to threads, it could end up with the last slow thread (due to difficult range or bad scheduling luck) running with no concurrency. An alternative would be to distribute segments as needed but this would have more locking and there may also be memory locality and caching effects.

Lines that are simple enough to not be worth the cost of thread creation could set threads to 1 but this should be informed by measurement of thread creation and may differ between operating systems. std::thread could be used instead of std::async, but that doesn't get the benefit of the thread pool on Win32 and may need to repeat the code for the single threaded case as it doesn't have a launch::deferred option.

The core code looks like this:

const size_t threads = std::min<size_t>(segments.size(), std::thread::hardware_concurrency());

// TODO: pass in window,technology,mode to match what is used to make surface or use reentrant surface
if (threads) {
    // If only 1 thread needed then use the main thread, else spin up multiple
    const std::launch policy = (threads == 1) ? std::launch::deferred : std::launch::async;

    // ceil of division since floor may lead to last thread doing more
    const size_t perThread = 1 + (segments.size() - 1) / threads;

    std::vector<std::future<void>> futures;
    for (size_t th = 0; th < threads; th++) {

        // Find relative positions of everything except for tabs
        std::future<void> fut = std::async(policy,
            [this, surface, &vstyle, &model, &ll, perThread, &segments](size_t thread) {
                const size_t first = thread * perThread;
                const size_t last = std::min(first + perThread, segments.size());
                std::unique_ptr<Surface> surfaceSeparate;
                Surface *surfaceHere = surface;
                if (thread > 0) {
                    surfaceSeparate = Surface::Allocate(Technology::DirectWrite);
                    surfaceSeparate->Init(0);
                    surfaceSeparate->SetMode(SurfaceMode(model.pdoc->dbcsCodePage, false));
                    surfaceHere = surfaceSeparate.get();
                }
                for (size_t i = first; i < last; i++) {
                    const TextSegment &ts = segments[i];
                    // ...
                }
            }, th);
        futures.push_back(std::move(fut));
    }
    for (std::future<void> &f : futures) {
        f.wait();
    }
}

parallel3.patch

Zufu Liu - 2022-01-11

with std::mutex in PositionCache::MeasureWidths() and something similar to following code, using Notepad2's current SurfaceD2D, parallel time for the 7,266,428 byte long line reduced from 9 seconds to 2.2 seconds (1.8 seconds with memory_order::seq_cst but which produces overlapped positions I don't figured why), while sequence time is 28 seconds with D2D and 14 seconds with GDI.

My current current SurfaceD2D changes:
1. https://sourceforge.net/p/scintilla/feature-requests/1422/#8c46 increase lengthStartSubdivision and lengthEachSubdivision
2. https://sourceforge.net/p/scintilla/feature-requests/1427/#f884 SetFont() changes, after this change D2D1_FACTORY_TYPE_MULTI_THREADED is not needed since all measurement function never usepD2DFactory or any object it creates (pRenderTarget) . only pIDWriteFactory is used but DWRITE_FACTORY_TYPE_SHARED already granted thread safe.

std::atomic<uint32_t> nextIndex = 0; void Layout(const TextSegment &ts) { const Style &style = vstyle.styles[ll->styles[ts.start]]; if (style.visible) { // } } std::vector<std::future<void>> futures; constexpr size_t threads = 4*4; for (size_t th = 0; th < threads; th++) { std::future<void> fut = std::async(std::launch::async, [this] { while (true) { const uint32_t index = nextIndex.fetch_add(std::memory_order_acq_rel); if (index < segments.size()) { Layout(segments[index]); } else { break; } } }); futures.push_back(std::move(fut)); } for (std::future<void> &f : futures) { f.wait(); }

using atomic index reduced time may because each thread runs in nearby segments and runs in similar total time than the task division approach or std::execution::par (the total time which is then depends on the hardest/complex text part).
Also partial line layout can also implement with atomic index (replace while (true)).
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Neil Hodgson - 2022-01-12

Memory order is the second argument to fetch_add so

const uint32_t index = nextIndex.fetch_add(std::memory_order_acq_rel);

should be

const uint32_t index = nextIndex.fetch_add(1, std::memory_order_acq_rel);

std::memory_order_acq_rel is 4 on MSVC 2019 with c++17 (and a plain enum so poor type checking) so the code was stepping through the segments by 4 (0, 4, 8, ...) and not performing most work.

enum memory_order { memory_order_relaxed, memory_order_consume, memory_order_acquire, memory_order_release, memory_order_acq_rel, memory_order_seq_cst };

Performance is similar to parallel3.patch after changing the increment. That 4x speed up would have been great.

Threads Time

2 19.66

4 10.88

8 9.98

16 10.64

Using 16 threads could be faster in some circumstances but I see a slow-down over 8 threads on my 4 core 8 hardware thread machine.

Last edit: Neil Hodgson 2022-01-12
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Threads	Time
2	19.66
4	10.88
8	9.98
16	10.64

Zufu Liu - 2022-01-16

Thanks for the correction, but the time on your i7 is slow than expected, The 9 second for me is on i5 with 4 core 4 logical processors, I expected it to be 4 ~ 5 seconds on your 8 logical processors i7 using Notepad2's latest code (in src\Notepad2.c, uncomment AttachConsole() block, then in EditView::LayoutLine() uncomment measurement code LayoutWorker block, end of Editor::WrapLines() has other timing log).

Following are all my local changes (not much slow then Notepad2) to Scintilla's src and win32 sources (mostly from https://github.com/zufuliu/notepad2/commit/419cbafa60f7097f7ea8fea9c039db4b4801da93 and https://github.com/zufuliu/notepad2/commit/2ab9eda02d5c31eef6e09b0eef3c16da88279fde)
1. independent part moved into LayoutWorker class.
2. increased lengthStartSubdivision and lengthEachSubdivision (the changes in PlatWin.cxx).
3. Added durationWrapOneThread to measure single thread performance, and use it to determine whether to use parallel layout for current line. This fixes slow ASCII long line case.
4. empty line is special handled to not layout or wrap.
5. Other small changes: Changed ElapsedPeriod::Duration() to const noexcept; moved pdoc to first member of EditModel, as it's frequently accessed.

Partial line can be implemented by change EditModel::UseParallelLayout() to return length in 10 milliseconds, and check time after processed that many of bytes.

LayoutWorker-0116.zip

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Neil Hodgson - 2022-01-17

the time on your i7 is slow than expected, The 9 second for me is on i5 with 4 core 4 logical processors, I expected it to be 4 ~ 5 seconds on your 8 logical processors i7

Unfortunately, hyper-threading is not as good as additional cores. Intel claimed 15-30% improvement and the 9% improvement above isn't unusual. Hyper-threading effectively uses some memory stalls but the tests may be (mostly) cache-resident and have few stalls to provide opportunities for the other thread.
https://en.wikipedia.org/wiki/Hyper-threading#Performance_claims

increased lengthStartSubdivision and lengthEachSubdivision (

There may be negative effects of this if it decreases the number of segments on medium length lines and reduces the potential of parallelization. One of the main pieces needed for parallelization is measurements on each platform to determine when to go parallel and how many threads to use. It may be something as simple as threads = segments / 8 or linelength / 4000capped by hardware concurrency or it may be more complex if the different platforms have significantly different costs per thread which I am expecting as only MSVC has a thread pool.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Neil Hodgson - 2022-01-17

Added reentrant MeasureWidths and MeasureWidthsUTF8 for Cocoa with [2a972e].

The macOS laptop system tested has a 2.0 GHz i7-4750HQ with 4 cores and hyperthreading. It throttles each core more than a desktop system as more threads are active. The most important CTLine* and CTRun* calls are generally slower (scaling by processor speed) than DirectWrite for simpler ASCII text but faster for the 'unusual' non-ASCII text of the log example.

There's also some text not displayed where there are combining characters (2DEA[COMBINING CYRILLIC LETTER O] [ⷪ]) combined with characters not present in a font (2BEA[STAR WITH LEFT HALF BLACK] [⯪]).

For the [feature-requests:#1422] non-ASCII log example:

Threads Time

1 31.7

2 14.8

4 9.8

8 8.0

The SQL file used for testing is at https://www.scintilla.org/enwik.zip and is based on file linked from [feature-requests:#1373] but modified to have a single extremely long line to take a long time and produce good profiles.

Threads before lexed lexed + mutex lexed + shared_mutex

1 105.0 7.7 8.0

2 54.0 7.0 7.4

4 29.3 6.9 7.4

8 29.0 8.8 14.9

On this setup, threading is important for the unlexed state where there is little locking as long lexemes don't go into cache. Once the text is lexed then the number of threads is not as beneficial, likely because of locking overhead.

An as yet unexplored issue is how well the performance will scale to 8 and 16 core processors. Using shared_mutex should benefit more cores as there will be more threads contending on access to the position cache. However, shared_mutexis never better in these tests and is dramatically worse at 8 threads. This could be a problem with the macOS implementation of shared_mutex not being as good as Win32's SRW locks.

https://turingcompl33t.github.io/RWLock-Benchmark/

Related

Feature Requests: ~~#1373~~
Feature Requests: #1422
Commit: [2a972e]

Last edit: Neil Hodgson 2022-01-17

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Threads	Time
1	31.7
2	14.8
4	9.8
8	8.0

Threads	before lexed	lexed + mutex	lexed + shared_mutex
1	105.0	7.7	8.0
2	54.0	7.0	7.4
4	29.3	6.9	7.4
8	29.0	8.8	14.9

Profiling PositionCache::MeasureWidths on the Win32 8 thread version with enwik.sql shows that locking is an important contributor to time. 56.7% with std::mutex and 39.7% with std::shared_mutex.

Using std::mutex
Function Name   Total CPU [unit, %] Self CPU [unit, %]
| - Scintilla::Internal::PositionCache::MeasureWidths   14650 (97.01%)  568 (3.76%)
        mtx_do_lock 7854 (52.01%)   7854 (52.01%)
| - Scintilla::Internal::SurfaceD2D::MeasureWidths  5272 (34.91%)   36 (0.24%)
        dwrite.dll!0x00007ffa4ed8089b   4013 (26.57%)   4013 (26.57%)
        _Mtx_unlock 707 (4.68%) 707 (4.68%)
        DWriteFactory::CreateTextLayout 458 (3.03%) 458 (3.03%)
        dwrite.dll!0x00007ffa4ed808ad   371 (2.46%) 371 (2.46%)
| - Scintilla::Internal::ReleaseUnknown<ID2D1BitmapBrush>   346 (2.29%) 1 (0.01%)
        dwrite.dll!0x00007ffa4ed18d2e   317 (2.10%) 317 (2.10%)
Total mutex = 52.01 + 4.68 = 56.7

Using std::shared_mutex
Function Name   Total CPU [unit, %] Self CPU [unit, %]
| - Scintilla::Internal::PositionCache::MeasureWidths   10169 (96.19%)  416 (3.93%)
| - Scintilla::Internal::SurfaceD2D::MeasureWidths  5299 (50.12%)   37 (0.35%)
        dwrite.dll!0x00007ffa4ed8089b   4034 (38.16%)   4034 (38.16%)
        RtlAcquireSRWLockShared 3419 (32.34%)   3419 (32.34%)
        RtlReleaseSRWLockShared 616 (5.83%) 616 (5.83%)
        DWriteFactory::CreateTextLayout 466 (4.41%) 466 (4.41%)
| - Scintilla::Internal::ReleaseUnknown<ID2D1BitmapBrush>   364 (3.44%) 2 (0.02%)
        dwrite.dll!0x00007ffa4ed808ad   342 (3.23%) 342 (3.23%)
        dwrite.dll!0x00007ffa4ed18d2e   327 (3.09%) 327 (3.09%)
        RtlAcquireSRWLockExclusive  168 (1.59%) 168 (1.59%)
Read mutex = 32.3 + 5.8 = 38.1
Write mutex = 1.6 + 0 (no entries) = 1.6
Total mutex = 38.1 + 1.6 = 39.7

Zufu Liu - 2022-01-17

The time for enwiki.sql is doubtful, use Notepad2 with 4 threads on i5, measure width (with monospaced font and styling) block takes about 1.2 second, while wrap line block takes about 8 seconds.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Neil Hodgson - 2022-01-17
  
  This is with a proportional font. The purpose isn't to find the minimum time for that file but to work out any problems with the cache and how to scale parallelization. Similar files may contain non-ASCII text. For me, this is the main use case for threading: wide textual data files that do not benefit from the ASCII monospace feature. The log example is really a binary file (or combined text and binary file) so will be encountered much less often.
  
  Profiles with 2 threads and 1 thread but with locking appear to show that the locks are slower when they are contended.
  
  1 thread still locking each retrieval RtlAcquireSRWLockShared 56 (1.66%) RtlReleaseSRWLockShared 54 (1.60%) RtlAcquireSRWLockExclusive 6 (0.18%) 2 threads RtlAcquireSRWLockShared 384 (9.23%) RtlReleaseSRWLockShared 154 (3.70%) RtlAcquireSRWLockExclusive 10 (0.24%) 4 threads RtlAcquireSRWLockShared 1628 (25.74%) RtlReleaseSRWLockShared 335 (5.30%) RtlAcquireSRWLockExclusive 40 (0.63%)
  
  If the lock time is due to contention, then reducing contention should improve speed. A technique could be to divide the cache into sections, each section having its own lock.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Zufu Liu - 2022-01-19
    
    I measured Notepad2 (cache size 2048, string length less than 64 are cached) for enwiki.sql with proportional font (can be toggled by the menu Scheme -> Use Default Code Style), the lock time (const CacheReadLock readLock; and const CacheWriteLock writeLock;, both are SRWLockExclusive) for 4 threads is only about 4%, total LayoutWorker block is about 6.5 seconds.
    https://github.com/zufuliu/notepad2/blob/main/scintilla/src/PositionCache.cxx#L1140
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Neil Hodgson - 2022-01-20
      
      I see around the same 33% locking cost with 4 threads and shared_mutex either with exclusive or shared locking over the read portion:
      
      4 threads shared_mutex lock shared (first numbers are times in milliseconds for whole LayoutLine and just the threaded portion) LayoutLine: 2545.23 1731.47 for 112974535 segments = 14152623 threads=4 RtlAcquireSRWLockShared 1886 (27.89%) RtlReleaseSRWLockShared 408 (6.03%) RtlAcquireSRWLockExclusive 26 (0.38%) 4 threads shared_mutex lock exclusive LayoutLine: 2534.85 1733.53 for 112974535 segments = 14152623 threads=4 RtlAcquireSRWLockExclusive 2266 (32.31%) RtlpWakeSRWLock 44 (0.63%)
      
      Are you isolating just the multi-threaded (intensive) section of the profile after the text has been lexed or is other code like loading and lexing included? 6.5 seconds sounds like the time to lex and layout. std::shared_mutex could be adding costs over SRW but it appears thin to me. Its possible I am seeing a profiling artifact if the profiler behaves poorly with multiple threads.
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Improve Scintilla speed with parallel layout

Group

Searches

Help

#1427 Improve Scintilla speed with parallel layout

Benefits

Problems

Related

Discussion

Related

Related

Related