Menu

#1427 Improve Scintilla speed with parallel layout

Committed
closed
5
2022-09-05
2022-01-07
No

Benefits

Scintilla currently only executes on the main user interface thread except to allow loading documents from background threads. Laying out text could be sped up by measuring text positions in parallel in some cases. Wide lines can cause user-perceived hangs and hitches due to the time required to measure the positions of text. Current mainstream processors contain multiple (2-16) cores and may also implement hyper-threading. On a 4 core 8 thread Intel i7-6700 Windows 10 system, the attached patch achieves a better than 4x speed up on long lines using Visual C++ 2022 and DirectWrite.

This was originally in response to [feature-requests:#1422] but is useful in other cases.

The patch has benefits but also has shortcomings, so a different approach may be better.

The core change is to separate order-dependent and order-independent parts of the layout loop in EditView::LayoutLine and perform the order-independent code in parallel. The parallelized portion calls Surface::MeasureWidths* either directly or indirectly through PositionCache to discover the relative positions of characters in each measurement segment. This portion accounts for almost all (around 99%) of the runtime in most cases with almost all of that in DirectWrite's IDWriteTextLayout::GetClusterMetrics. After all the segments are measured, the segment positions are made absolute and tabs are expanded in a sequential loop.

The C++17 for_each parallel algorithm is used to perform code in parallel using multiple threads. The C++ runtime library decides how many threads to use and coordinates them.

std::for_each(std::execution::par, segments.begin(), segments.end(),
    [this, &surfacePool, &vstyle, &model, &ll](const TextSegment &ts) {
    // ... measure the characters in ts
});

The same code can be run sequentially by replacing std::execution::par with std::execution::seq.

The MSVC implementation of parallel algorithms uses the Windows system thread pool to avoid excessive thread creation and contention.

With DirectWrite, an example of wide styled text (110 MB of SQL insert data with short segments and some repeated lexemes) on a single line was 4.4 times faster (9.85 versus 43.5 seconds). The example heavy line from [feature-requests:#1422] consisting of many non-ASCII characters with no spaces was 4.1 times faster (9.9 versus 41 seconds) on the test machine.

When the monospaced ASCII feature is strongly active (the text is almost all ASCII and only monospaced fonts are used) then DirectWrite is rarely called and the std::for_each(std::execution:: approach is slightly (7%) slower. This overhead could be (nearly) eliminated with better Surface management to avoid locking overhead.

GDI is also sped up by this code but it is not known if GDI text measurement is really thread-safe.

Problems

By default #include <execution> fails compilation with g++. g++ implements std::execution::par by calling Intel's Threading Building Blocks (TBB) library but g++ does not install TBB. Requiring TBB will make it more difficult to build Scintilla for something that is just an optimization. The patch uses the preprocessor to avoid TBB inclusion and convert the parallel code to sequential for g++. Further preprocessor use could enable users with TBB to optionally allow it.

Xcode has a null <execution> that implements neither std::execution::par nor std::execution::seq. So there would need to be a fallback for macOS.

GTK is, itself, not thread safe. The Pango library used for text measurement is but not on Win32.
https://gitlab.gnome.org/GNOME/pango/-/issues/256
Its possible that, on Unix, the Pango resources required for text measurement could be created from their GTK contexts by single threaded initialization code then used inside the parallel portion.

Haven't looked at Qt.

The application may have a better idea about whether to use threads (perhaps there are memory or power constraints or underlying libraries are not thread-safe) so there should be a runtime switch.

Since there are problems with <execution>, lower level threading could be used. SciTE has used C++ standard library std::thread and std::mutex successfully for 2 years on Win32, Linux, and macOS. A simple design with std::thread would divide the segments for each long line up into n portions for n new threads where n is a setting that can be chosen by the application. This will be more work and may be less efficient without the benefits of the Windows system thread pool or work stealing.

EditView::LayoutLine doesn't have access to the window, technology or bidi setting so this is bodged in the patch. This data needs to be correctly transmitted down to this code to ensure that the layout is correct.

Since Surface is stateful, each thread needs an independent instance. As it could be expensive to create a new measurement surface for each text segment, a thread-safe pool is implemented in the patch as SurfacePool but this is a minor issue (extra 0.25% time to create a new Surface for each segment) for DirectWrite.

1 Attachments

Related

Bugs: #927
Feature Requests: #1422
Feature Requests: #1432

Discussion

1 2 3 > >> (Page 1 of 3)
  • Zufu Liu

    Zufu Liu - 2022-01-08

    mutexCache can be changed to std::shared_mutex https://en.cppreference.com/w/cpp/thread/shared_mutex, which is Slim Reader/Writer (SRW) Lock on Windows.

     
    • Neil Hodgson

      Neil Hodgson - 2022-01-08

      std::shared_mutex should work well in this code but tried it and performance was a little worse on the test file. That may be because the SQL test didn't effectively hit the cache. There's a comment on StackOverflow

      Consider SRW locks if you are reading at least 4:1 vs writing

      Cases with a higher cache hit ratio should be common so should check a file with more repeated lexemes.

      std::shared_mutex didn't work with the XP build, probably because SRW locks were introduced in Vista.

      ...\positioncache.cxx(377): error C2039: 'shared_mutex': is not a member of 'std'
      
       
      • Neil Hodgson

        Neil Hodgson - 2022-01-11

        With parallel3.patch seeing 17% better performance of shared mutex once file is lexed (first number for each technique is unlexed):

        mutex

        10.1557 for 112974535 segments = 1235124
        2.82384 for 112974535 segments = 14152623

        shared_mutex

        10.2396 for 112974535 segments = 1235124
        2.41063 for 112974535 segments = 14152623

        macOS 10.12 (2016) is required for shared_mutex which is a jump from currently supported 10.9.

         
  • Zufu Liu

    Zufu Liu - 2022-01-09

    OK, for VC STL, std::mutex is SRW lock for Win7+, and critical section for XP and Vista, see https://github.com/microsoft/STL/blob/main/stl/src/primitives.hpp#L111.

    Some ideas:
    1. Use durationWrapOneByte to estimate whether current line can be layout within a small time (e.g. several hundred milliseconds) without UI blocking.
    2. (to make all compiler work) native thread pool in platform layer for PLAT_XXX, e.g. win32 QueueUserWorkItem() or CreateThreadpoolWork() (used by MSVC parallel algorithms, macOS dispatch_async(); or naive thread pool (for pthread) with std::thread::hardware_concurrency().
    3. surface pool, or surface cache (based on style integer value or font, call SetFont() once and reuse the surface for MeasureWidths()).
    4. lock free work list

    std::vector<TextSegment> segmentList;
    std::atomic<int> segmentIndex; // increment only
    
     

    Last edit: Zufu Liu 2022-01-09
    • Neil Hodgson

      Neil Hodgson - 2022-01-09

      (2) C++ has std::async with std::launch::async but, like <execution> and std::shared_mutex there could be portability issues. There's a std::future returned at each call that should be waited for which may take time.
      https://en.cppreference.com/w/cpp/thread/async

      (3) SurfaceD2D::SetFont() isn't prominent in profiles and much of its code is not needed for MeasureWidths as it affects the render target which is not referenced later. The core DirectWrite object for measuring positions that takes the time is IDWriteTextLayout which is created and used inside MeasureWidths and not remembered. It may be possible to make Surface implementations (particularly SurfaceD2D) less stateful (use a local for pTextFormat instead of a field) so MeasureWidths is reentrant and multiple threads don't interfere then there is no need for multiple surfaces and a SurfacePool. As other platforms differ, there could be a 'reentrant MeasureWidths' feature support flag.

       
      • Neil Hodgson

        Neil Hodgson - 2022-01-10

        std::async is available on each platform (although only MSVC uses a thread pool) and performs OK with moderate numbers of calls, including the [feature-requests:#1422] case. However, the SQL example which was fast with std::execution::par was almost as slow as the original sequential implementation and used much memory. Creating all the (14 million) tasks took ~20 seconds and waiting for and closing their futures took about half as much. More real threads (19) were used than std::execution::par (8). Could be viable with some chunking of segments so that fewer tasks are created.

         

        Related

        Feature Requests: #1422

  • Zufu Liu

    Zufu Liu - 2022-01-10

    Patch to make SurfaceD2D measurement function stateless, including following changes:
    1. Changed maxWidthMeasure and maxLenText in SurfaceGDI to static constexpr.
    2. Removed null test for pIDWriteFactory and pD2DFactory, when any of them is null, technology can't be D2D.
    3. Removed codePageText, pTextFormat, yAscent, yDescent, yInternalLeading from SurfaceD2D
    4. Renamed SurfaceD2D::SetFont() to SurfaceD2D::SetFontQuality() and only call it once inside SurfaceD2D::DrawTextCommon(), other places don't need to touch pRenderTarget.

    Some changes seems not worthy:
    1. Reorder fields in SurfaceGDI and SurfaceD2D to reduce padding
    2. Change Win32MapFontQuality() and DWriteMapFontQuality() to direct computing:

    constexpr BYTE Win32MapFontQuality(FontQuality extraFontFlag) noexcept {
        constexpr UINT mask = (DEFAULT_QUALITY << static_cast<int>(FontQuality::QualityDefault))
            | (NONANTIALIASED_QUALITY << (4 *static_cast<int>(FontQuality::QualityNonAntialiased)))
            | (ANTIALIASED_QUALITY << (4 * static_cast<int>(FontQuality::QualityAntialiased)))
            | (CLEARTYPE_QUALITY << (4 * static_cast<int>(FontQuality::QualityLcdOptimized)))
            ;
        return static_cast<BYTE>((mask >> (4*static_cast<int>(extraFontFlag & FontQuality::QualityMask))) & 15);
    }
    
    constexpr D2D1_TEXT_ANTIALIAS_MODE DWriteMapFontQuality(FontQuality extraFontFlag) noexcept {
        constexpr UINT mask = (D2D1_TEXT_ANTIALIAS_MODE_DEFAULT << static_cast<int>(FontQuality::QualityDefault))
            | (D2D1_TEXT_ANTIALIAS_MODE_ALIASED << (2 * static_cast<int>(FontQuality::QualityNonAntialiased)))
            | (D2D1_TEXT_ANTIALIAS_MODE_GRAYSCALE << (2 * static_cast<int>(FontQuality::QualityAntialiased)))
            | (D2D1_TEXT_ANTIALIAS_MODE_CLEARTYPE << (2 * static_cast<int>(FontQuality::QualityLcdOptimized)))
            ;
        return static_cast<D2D1_TEXT_ANTIALIAS_MODE>((mask >> (2 * static_cast<int>(extraFontFlag & FontQuality::QualityMask))) & 3);
    }
    

    For parallel performance with surface pool, how about move surface->MeasureWidths() out from PositionCache::MeasureWidths() , and split the code before and after the line as getter (returns bool and probe, avoid allocate/lock surface for monospace ASCII or text in cache) and setter.

     
    • Zufu Liu

      Zufu Liu - 2022-01-10

      I find another issue: SurfaceD2D is not aware of cleartype font settings due to the cased rendering params, maybe something can be added into WM_SETTINGCHANGE.

       
      • Neil Hodgson

        Neil Hodgson - 2022-01-11

        All the fonts and surfaces should be recreated after WM_SETTINGCHANGE so rendering params shouldn't be stale.

         
        • Zufu Liu

          Zufu Liu - 2022-01-11

          defaultRenderingParams and customClearTypeRenderingParams created inside LoadD2DOnce() are stale after WM_SETTINGCHANGE or when the window moved to another monitor. there is sample code for the later https://docs.microsoft.com/en-us/windows/win32/DirectWrite/how-to-add-support-for-multiple-monitors

           
          • Neil Hodgson

            Neil Hodgson - 2022-01-12

            So, have to move defaultRenderingParams and customClearTypeRenderingParams into SurfaceD2D and set from monitor of window in both SurfaceD2D::Inits.

             
    • Neil Hodgson

      Neil Hodgson - 2022-01-13

      Committed with [9502f2] and [9be3ea].

       

      Related

      Commit: [9502f2]
      Commit: [9be3ea]

  • Zufu Liu

    Zufu Liu - 2022-01-10

    SetFontQuality can be renamed as SetTextRenderingParams.

    Something come into my head for slow SQL example: how about add a field in ViewStyle to indicate that all font are monospaced, then scan whole line to find first non-ASCII position, then go back to the last punctuation, space or control position; text before this position as a simple segment, remaining text goes into BreakFinder.

     

    Last edit: Zufu Liu 2022-01-10
    • Neil Hodgson

      Neil Hodgson - 2022-01-11

      add a field in ViewStyle to indicate that all font are monospaced

      I assume this is where all the styles use the same monospaced font.

      This will be faster but should be measured to see if it is worth the complexity. Monospaced is already very fast at 130 MB / second.

       
  • Neil Hodgson

    Neil Hodgson - 2022-01-11

    Updated version that works out how many threads to use then std::async over that many threads then loop over multiple segments in each thread. If just 1 thread then use main thread with launch::deferred. Since the threads are longer-lasting, they hold onto a surface and the SurfacePool is dropped. This is similar in performance to parallel for_each but is portable.

    Since this is statically assigning ranges of the line to threads, it could end up with the last slow thread (due to difficult range or bad scheduling luck) running with no concurrency. An alternative would be to distribute segments as needed but this would have more locking and there may also be memory locality and caching effects.

    Lines that are simple enough to not be worth the cost of thread creation could set threads to 1 but this should be informed by measurement of thread creation and may differ between operating systems. std::thread could be used instead of std::async, but that doesn't get the benefit of the thread pool on Win32 and may need to repeat the code for the single threaded case as it doesn't have a launch::deferred option.

    The core code looks like this:

    const size_t threads = std::min<size_t>(segments.size(), std::thread::hardware_concurrency());
    
    // TODO: pass in window,technology,mode to match what is used to make surface or use reentrant surface
    if (threads) {
        // If only 1 thread needed then use the main thread, else spin up multiple
        const std::launch policy = (threads == 1) ? std::launch::deferred : std::launch::async;
    
        // ceil of division since floor may lead to last thread doing more
        const size_t perThread = 1 + (segments.size() - 1) / threads;
    
        std::vector<std::future<void>> futures;
        for (size_t th = 0; th < threads; th++) {
    
            // Find relative positions of everything except for tabs
            std::future<void> fut = std::async(policy,
                [this, surface, &vstyle, &model, &ll, perThread, &segments](size_t thread) {
                    const size_t first = thread * perThread;
                    const size_t last = std::min(first + perThread, segments.size());
                    std::unique_ptr<Surface> surfaceSeparate;
                    Surface *surfaceHere = surface;
                    if (thread > 0) {
                        surfaceSeparate = Surface::Allocate(Technology::DirectWrite);
                        surfaceSeparate->Init(0);
                        surfaceSeparate->SetMode(SurfaceMode(model.pdoc->dbcsCodePage, false));
                        surfaceHere = surfaceSeparate.get();
                    }
                    for (size_t i = first; i < last; i++) {
                        const TextSegment &ts = segments[i];
                        // ...
                    }
                }, th);
            futures.push_back(std::move(fut));
        }
        for (std::future<void> &f : futures) {
            f.wait();
        }
    }
    
     
  • Zufu Liu

    Zufu Liu - 2022-01-11

    with std::mutex in PositionCache::MeasureWidths() and something similar to following code, using Notepad2's current SurfaceD2D, parallel time for the 7,266,428 byte long line reduced from 9 seconds to 2.2 seconds (1.8 seconds with memory_order::seq_cst but which produces overlapped positions I don't figured why), while sequence time is 28 seconds with D2D and 14 seconds with GDI.

    My current current SurfaceD2D changes:
    1. https://sourceforge.net/p/scintilla/feature-requests/1422/#8c46 increase lengthStartSubdivision and lengthEachSubdivision
    2. https://sourceforge.net/p/scintilla/feature-requests/1427/#f884 SetFont() changes, after this change D2D1_FACTORY_TYPE_MULTI_THREADED is not needed since all measurement function never usepD2DFactory or any object it creates (pRenderTarget) . only pIDWriteFactory is used but DWRITE_FACTORY_TYPE_SHARED already granted thread safe.

    std::atomic<uint32_t> nextIndex = 0;
    void Layout(const TextSegment &ts) {
        const Style &style = vstyle.styles[ll->styles[ts.start]];
        if (style.visible) {
            //
        }
    }
    
    std::vector<std::future<void>> futures;
    constexpr size_t threads = 4*4;
    for (size_t th = 0; th < threads; th++) {
        std::future<void> fut = std::async(std::launch::async, [this] {
            while (true) {
                const uint32_t index = nextIndex.fetch_add(std::memory_order_acq_rel);
                if (index < segments.size()) {
                    Layout(segments[index]);
                } else {
                    break;
                }
            }
        });
        futures.push_back(std::move(fut));
    }
    for (std::future<void> &f : futures) {
        f.wait();
    }
    

    using atomic index reduced time may because each thread runs in nearby segments and runs in similar total time than the task division approach or std::execution::par (the total time which is then depends on the hardest/complex text part).
    Also partial line layout can also implement with atomic index (replace while (true)).

     
  • Neil Hodgson

    Neil Hodgson - 2022-01-12

    Memory order is the second argument to fetch_add so

    const uint32_t index = nextIndex.fetch_add(std::memory_order_acq_rel);
    

    should be

    const uint32_t index = nextIndex.fetch_add(1, std::memory_order_acq_rel);
    

    std::memory_order_acq_rel is 4 on MSVC 2019 with c++17 (and a plain enum so poor type checking) so the code was stepping through the segments by 4 (0, 4, 8, ...) and not performing most work.

    enum memory_order {
        memory_order_relaxed,
        memory_order_consume,
        memory_order_acquire,
        memory_order_release,
        memory_order_acq_rel,
        memory_order_seq_cst
    };
    

    Performance is similar to parallel3.patch after changing the increment. That 4x speed up would have been great.

    Threads Time
    2 19.66
    4 10.88
    8 9.98
    16 10.64

    Using 16 threads could be faster in some circumstances but I see a slow-down over 8 threads on my 4 core 8 hardware thread machine.

     

    Last edit: Neil Hodgson 2022-01-12
  • Zufu Liu

    Zufu Liu - 2022-01-16

    Thanks for the correction, but the time on your i7 is slow than expected, The 9 second for me is on i5 with 4 core 4 logical processors, I expected it to be 4 ~ 5 seconds on your 8 logical processors i7 using Notepad2's latest code (in src\Notepad2.c, uncomment AttachConsole() block, then in EditView::LayoutLine() uncomment measurement code LayoutWorker block, end of Editor::WrapLines() has other timing log).

    Following are all my local changes (not much slow then Notepad2) to Scintilla's src and win32 sources (mostly from https://github.com/zufuliu/notepad2/commit/419cbafa60f7097f7ea8fea9c039db4b4801da93 and https://github.com/zufuliu/notepad2/commit/2ab9eda02d5c31eef6e09b0eef3c16da88279fde)
    1. independent part moved into LayoutWorker class.
    2. increased lengthStartSubdivision and lengthEachSubdivision (the changes in PlatWin.cxx).
    3. Added durationWrapOneThread to measure single thread performance, and use it to determine whether to use parallel layout for current line. This fixes slow ASCII long line case.
    4. empty line is special handled to not layout or wrap.
    5. Other small changes: Changed ElapsedPeriod::Duration() to const noexcept; moved pdoc to first member of EditModel, as it's frequently accessed.

    Partial line can be implemented by change EditModel::UseParallelLayout() to return length in 10 milliseconds, and check time after processed that many of bytes.

     
  • Neil Hodgson

    Neil Hodgson - 2022-01-17

    the time on your i7 is slow than expected, The 9 second for me is on i5 with 4 core 4 logical processors, I expected it to be 4 ~ 5 seconds on your 8 logical processors i7

    Unfortunately, hyper-threading is not as good as additional cores. Intel claimed 15-30% improvement and the 9% improvement above isn't unusual. Hyper-threading effectively uses some memory stalls but the tests may be (mostly) cache-resident and have few stalls to provide opportunities for the other thread.
    https://en.wikipedia.org/wiki/Hyper-threading#Performance_claims

    1. increased lengthStartSubdivision and lengthEachSubdivision (

    There may be negative effects of this if it decreases the number of segments on medium length lines and reduces the potential of parallelization. One of the main pieces needed for parallelization is measurements on each platform to determine when to go parallel and how many threads to use. It may be something as simple as threads = segments / 8 or linelength / 4000capped by hardware concurrency or it may be more complex if the different platforms have significantly different costs per thread which I am expecting as only MSVC has a thread pool.

     
  • Neil Hodgson

    Neil Hodgson - 2022-01-17

    Added reentrant MeasureWidths and MeasureWidthsUTF8 for Cocoa with [2a972e].

    The macOS laptop system tested has a 2.0 GHz i7-4750HQ with 4 cores and hyperthreading. It throttles each core more than a desktop system as more threads are active. The most important CTLine* and CTRun* calls are generally slower (scaling by processor speed) than DirectWrite for simpler ASCII text but faster for the 'unusual' non-ASCII text of the log example.

    There's also some text not displayed where there are combining characters (2DEA[COMBINING CYRILLIC LETTER O] [ⷪ]) combined with characters not present in a font (2BEA[STAR WITH LEFT HALF BLACK] [⯪]).

    For the [feature-requests:#1422] non-ASCII log example:

    Threads Time
    1 31.7
    2 14.8
    4 9.8
    8 8.0

    The SQL file used for testing is at https://www.scintilla.org/enwik.zip and is based on file linked from [feature-requests:#1373] but modified to have a single extremely long line to take a long time and produce good profiles.

    Threads before lexed lexed + mutex lexed + shared_mutex
    1 105.0 7.7 8.0
    2 54.0 7.0 7.4
    4 29.3 6.9 7.4
    8 29.0 8.8 14.9

    On this setup, threading is important for the unlexed state where there is little locking as long lexemes don't go into cache. Once the text is lexed then the number of threads is not as beneficial, likely because of locking overhead.

    An as yet unexplored issue is how well the performance will scale to 8 and 16 core processors. Using shared_mutex should benefit more cores as there will be more threads contending on access to the position cache. However, shared_mutexis never better in these tests and is dramatically worse at 8 threads. This could be a problem with the macOS implementation of shared_mutex not being as good as Win32's SRW locks.

    https://turingcompl33t.github.io/RWLock-Benchmark/

     

    Related

    Feature Requests: #1373
    Feature Requests: #1422
    Commit: [2a972e]


    Last edit: Neil Hodgson 2022-01-17
  • Neil Hodgson

    Neil Hodgson - 2022-01-17

    Profiling PositionCache::MeasureWidths on the Win32 8 thread version with enwik.sql shows that locking is an important contributor to time. 56.7% with std::mutex and 39.7% with std::shared_mutex.

    Using std::mutex
    Function Name   Total CPU [unit, %] Self CPU [unit, %]
    | - Scintilla::Internal::PositionCache::MeasureWidths   14650 (97.01%)  568 (3.76%)
            mtx_do_lock 7854 (52.01%)   7854 (52.01%)
    | - Scintilla::Internal::SurfaceD2D::MeasureWidths  5272 (34.91%)   36 (0.24%)
            dwrite.dll!0x00007ffa4ed8089b   4013 (26.57%)   4013 (26.57%)
            _Mtx_unlock 707 (4.68%) 707 (4.68%)
            DWriteFactory::CreateTextLayout 458 (3.03%) 458 (3.03%)
            dwrite.dll!0x00007ffa4ed808ad   371 (2.46%) 371 (2.46%)
    | - Scintilla::Internal::ReleaseUnknown<ID2D1BitmapBrush>   346 (2.29%) 1 (0.01%)
            dwrite.dll!0x00007ffa4ed18d2e   317 (2.10%) 317 (2.10%)
    Total mutex = 52.01 + 4.68 = 56.7
    
    Using std::shared_mutex
    Function Name   Total CPU [unit, %] Self CPU [unit, %]
    | - Scintilla::Internal::PositionCache::MeasureWidths   10169 (96.19%)  416 (3.93%)
    | - Scintilla::Internal::SurfaceD2D::MeasureWidths  5299 (50.12%)   37 (0.35%)
            dwrite.dll!0x00007ffa4ed8089b   4034 (38.16%)   4034 (38.16%)
            RtlAcquireSRWLockShared 3419 (32.34%)   3419 (32.34%)
            RtlReleaseSRWLockShared 616 (5.83%) 616 (5.83%)
            DWriteFactory::CreateTextLayout 466 (4.41%) 466 (4.41%)
    | - Scintilla::Internal::ReleaseUnknown<ID2D1BitmapBrush>   364 (3.44%) 2 (0.02%)
            dwrite.dll!0x00007ffa4ed808ad   342 (3.23%) 342 (3.23%)
            dwrite.dll!0x00007ffa4ed18d2e   327 (3.09%) 327 (3.09%)
            RtlAcquireSRWLockExclusive  168 (1.59%) 168 (1.59%)
    Read mutex = 32.3 + 5.8 = 38.1
    Write mutex = 1.6 + 0 (no entries) = 1.6
    Total mutex = 38.1 + 1.6 = 39.7
    
     
  • Zufu Liu

    Zufu Liu - 2022-01-17

    The time for enwiki.sql is doubtful, use Notepad2 with 4 threads on i5, measure width (with monospaced font and styling) block takes about 1.2 second, while wrap line block takes about 8 seconds.

     
    • Neil Hodgson

      Neil Hodgson - 2022-01-17

      This is with a proportional font. The purpose isn't to find the minimum time for that file but to work out any problems with the cache and how to scale parallelization. Similar files may contain non-ASCII text. For me, this is the main use case for threading: wide textual data files that do not benefit from the ASCII monospace feature. The log example is really a binary file (or combined text and binary file) so will be encountered much less often.

      Profiles with 2 threads and 1 thread but with locking appear to show that the locks are slower when they are contended.

      1 thread still locking each retrieval
      RtlAcquireSRWLockShared 56 (1.66%)
      RtlReleaseSRWLockShared 54 (1.60%)
      RtlAcquireSRWLockExclusive  6 (0.18%)
      
      2 threads
      RtlAcquireSRWLockShared 384 (9.23%)
      RtlReleaseSRWLockShared 154 (3.70%)
      RtlAcquireSRWLockExclusive  10 (0.24%)
      
      4 threads
      RtlAcquireSRWLockShared 1628 (25.74%)
      RtlReleaseSRWLockShared 335 (5.30%)
      RtlAcquireSRWLockExclusive  40 (0.63%)
      

      If the lock time is due to contention, then reducing contention should improve speed. A technique could be to divide the cache into sections, each section having its own lock.

       
      • Zufu Liu

        Zufu Liu - 2022-01-19

        I measured Notepad2 (cache size 2048, string length less than 64 are cached) for enwiki.sql with proportional font (can be toggled by the menu Scheme -> Use Default Code Style), the lock time (const CacheReadLock readLock; and const CacheWriteLock writeLock;, both are SRWLockExclusive) for 4 threads is only about 4%, total LayoutWorker block is about 6.5 seconds.
        https://github.com/zufuliu/notepad2/blob/main/scintilla/src/PositionCache.cxx#L1140

         
        • Neil Hodgson

          Neil Hodgson - 2022-01-20

          I see around the same 33% locking cost with 4 threads and shared_mutex either with exclusive or shared locking over the read portion:

          4 threads shared_mutex lock shared
          
          (first numbers are times in milliseconds for whole LayoutLine
          and just the threaded portion) 
          LayoutLine:   2545.23     1731.47  for   112974535   segments = 14152623 threads=4
          
          RtlAcquireSRWLockShared 1886 (27.89%)
          RtlReleaseSRWLockShared 408 (6.03%)
          RtlAcquireSRWLockExclusive  26 (0.38%)
          
          4 threads shared_mutex lock exclusive
          LayoutLine:   2534.85     1733.53  for   112974535   segments = 14152623 threads=4
          
          RtlAcquireSRWLockExclusive  2266 (32.31%)
          RtlpWakeSRWLock 44 (0.63%)
          

          Are you isolating just the multi-threaded (intensive) section of the profile after the text has been lexed or is other code like loading and lexing included? 6.5 seconds sounds like the time to lex and layout. std::shared_mutex could be adding costs over SRW but it appears thin to me. Its possible I am seeing a profiling artifact if the profiler behaves poorly with multiple threads.

           
1 2 3 > >> (Page 1 of 3)

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.