Menu

#166 New OWLNext example - FolderSize

8
pending
1
2025-04-09
2020-07-12
No

I am making a new OWLNext example - a separate application to calculate folder sizes.

Background

No matter how large hard disk drive you have, over time it fills up and you need to clean it. To do that effectively, you need to easily know how much disk space each folder and subfolder occupies. There are probably many applications doing that — in the past I wrote one such in C#/WPF for my own use.

Now I am trying to make the same using OWLNext and Modern C++.

Goals

Since calculating the size of all files and subfolders can take quite a long time, the processing should be done in the background and not block the UI. Instead, it should show progress and allow cancellation.

Related

Discussion: OWLNext 7.0.2 and 6.44.12 updates
Wiki: OWLNext_Roadmap_and_Prereleases

Discussion

1 2 3 4 > >> (Page 1 of 4)
  • Ognyan Chernokozhev

    Initial example added in [r5222].

     

    Related

    Commit: [r5222]


    Last edit: Ognyan Chernokozhev 2020-07-12
    • Ognyan Chernokozhev

      Some notes and questions on the implementation so far:

      • Initially I tried to send a message from the background thread to the UI htread each time a folder is added or processed, and update the tree immediately. But that lead to too many updates and the UI becoming unresponsive. Now I tried to store the processed folder data in a structure, and approximately each second have the UI thread read the structure (protected by a mutex to avoid race conditions) and update the tree. Any better ideas? It is still a bit slow, a better implementation may be not to update the whole tree, but to make use of TVN_ITEMEXPANDING to add the folders only when the user opens them.
      • Using a recursive stucture to store the processed folders:
        struct FolderInfo
        {
        // data
        ...
        // subfolders
          vector<unique_ptr<FolderInfo>> subfolders;
      

      Is this the best way to handle it?

      Future enhancements:

      • Show icons for folders and files in the tree and list view.
      • Toolbar and context menu commands: Open folder in Explorer, Stop current processing, Refresh, Copy full path, etc.
      • Recent Folders list in the File menu.
      • Display tree view info tips, maybe with information how much time it took to process the folder, number of files/subfolders, etc.
      • Better status bar under the tree view.
       

      Last edit: Ognyan Chernokozhev 2020-07-12
  • Sebastian Ledesma

    Thanks Jogy!
    I was looking for the use of the "SysTreeView32" control so I will review the code to learn.

    Sebas

     
  • Vidar Hasfjord

    Vidar Hasfjord - 2020-07-14

    Hi Jogy, really cool example project!

    I use TreeSize for this purpose, by the way.

    Here is some feedback on your notes and implementation so far:

    • The recursive implementation of depth-first traversal may run out of stack space.

    • To improve the user experience, use breadth-first traversal rather than depth-first. With breadth-first traversal, you will update the top level of the tree first, which is what the user initially sees. Then, as the user navigates the tree deeper and deeper, the background traversal will process the tree likewise. You may even consider prioritising the branch of the tree that the user is currently navigating (i.e. breadth-first within the selected folder).

    • Don't use shared data, mutexes and locks. Instead, let the background process create the data and pass it through messages. If a message has more data than can be passed in the message parameters (WPARAM and LPARAM), then allocate the data dynamically and pass ownership via pointer (see OWLMaker build output handling, in particular TBuildOutputWindow::EvLaunchAppMessage).

    • The use of mutexes and locks causes continuous start-and-stop throttling of the UI and worker threads, as well as relatively slow data synchronisation between the threads (due to memory and cache coherency protocols). It may not have any significant performance penalty in this program, compared to the other slow operations involved, but it is good programming style to avoid lock-based solutions if you can.

    • If you need to use mutexes, don't lock and unlock them manually. Use an RAII sentry, such as std::lock_guard, to ensure the mutex is unlocked in case of exceptions.

    • Do not traverse the whole tree for every message from the worker thread. Instead, build and update the tree incrementally with the new data passed, using minimal traversal. If Tree View updates are still too slow, then store the passed data in a cached data structure, and use a deferred scheme as you suggest, by only updating the Tree View on demand as the user navigates it.

    • You currently use a std::map to associate path and HTREEITEM. Every iteration in the tree traversal calls std::map::find, which has O(log(n)) complexity, hence every new folder addition to the tree has O(n * log(n)) complexity, and the whole tree completion has O(n * n * log(n)) complexity. You can optimise a little by using std::unordered_map instead (O(1) complexity), but you can go much further by eliminating full tree traversal for every insertion and HTREEITEM lookup for every iteration, e.g. by caching the last insertion point, and by adding structural messages (WM_BEGINFOLDER, WM_ENDFOLDER) so that the construction of the tree structure becomes implicit in the message handling. You may also look into optimising the manipulation of the Tree View; in particular, adding nodes to a tree should ideally be done in reverse order (i.e. inserting at or near the root), since appending at the bottom requires tree traversal (see OldNewThing).

    • To improve UI reponsiveness further, you could look into prioritising user input and UI updates by deferring the handling of the messages from the worker thread, e.g. by putting them in a queue that you handle in batches in a WM_TIMER event handler or in TWindow::IdleAction.

    • Regarding the ugly code for formatting the date and time, I don't think C++17 has anything better, but C++20 has got better facilities (std::chrono::local_time with support for streaming).

     
    • Ognyan Chernokozhev

      Hi, Vidar,
      thanks for the suggestions, I implemented some changes based on them in [r5223].

      To improve the user experience, use breadth-first traversal rather than depth-first. With breadth-first traversal, you will update the top level of the tree first, which is what the user initially sees. Then, as the user navigates the tree deeper and deeper, the background traversal will process the tree likewise. You may even consider prioritising the branch of the tree that the user is currently navigating (i.e. breadth-first within the selected folder).

      I changed the traversal to breadth-first, so that the immediate children of the root folder are displayed. Note that they will ne in Pending state, and each will stay that way untill all it's subfolders are processed - since I display the total sum of files in all subfolders. I don't think logic to try to prioritize the calculation fir a selected folder is neccessary, the process is quite fast (at least on my machine) and will introduce a lot of complexiy in the code.

      Don't use shared data, mutexes and locks. Instead, let the background process create the data and pass it through messages. If a message has more data than can be passed in the message parameters (WPARAM and LPARAM), then allocate the data dynamically and pass ownership via pointer (see OWLMaker build output handling, in particular TBuildOutputWindow::EvLaunchAppMessage).

      I now changed the code so that the background thread sends messages when a folder is added or finished, and the UI htread adds them to a queue, then processes the queue, and during queue processing calls Application::PumpWaitingMessages as to not makde the UI unresponsive.

      You currently use a std::map to associate path and HTREEITEM. Every iteration in the tree traversal calls std::map::find, which has O(log(n)) complexity, hence every new folder addition to the tree has O(n * log(n)) complexity, and the whole tree completion has O(n * n * log(n)) complexity. You can optimise a little by using std::unordered_map instead (O(1) complexity), but you can go much further by eliminating full tree traversal for every insertion and HTREEITEM lookup for every iteration, e.g. by caching the last insertion point, and by adding structural messages (WM_BEGINFOLDER, WM_ENDFOLDER) so that the construction of the tree structure becomes implicit in the message handling. You may also look into optimising the manipulation of the Tree View; in particular, adding nodes to a tree should ideally be done in reverse order (i.e. inserting at or near the root), since appending at the bottom requires tree traversal (see OldNewThing).

      I still need the mapping, since when a folder is finished (all it's children are traversed), I need to find it and update it's text with the calculated size. When initially adding the folders from the queue, it may be possible to have a stack that references the last added tree node, instead of searching the while map. Still, even with the map, it does not seem to be really slowing the process.
      As for inserting the folders in the tree in reverse order, that is an interesting idea. It may have effect only when processing folders that have a really large number of subfolders, and maybe the most effective way to implement it would be if the directory_iterator can be made to iterae a folder in reverse order. Or to first retrieve all subfolders, then sort them in reverse order and then start adding them ...

      To improve UI reponsiveness further, you could look into prioritising user input and UI updates by deferring the handling of the messages from the worker thread, e.g. by putting them in a queue that you handle in batches in a WM_TIMER event handler or in TWindow::IdleAction.
      I can try this too, instead of calling PumpWaitingMessages whe processing the queue, to process only small parts of the queue in IdleAction, untill it is finished.

       

      Related

      Commit: [r5223]

  • Vidar Hasfjord

    Vidar Hasfjord - 2020-07-16

    Hi Jogy,

    I didn't mean that you can simply remove the synchronisation. You can only do so if you rewrite the code to not use shared data. In your current code, you still very much use shared data. Your threads share the FolderInfo data structure as well as the "stopped" flag, and you hence need some synchronisation mechanism (e.g. mutex, lock, condition variable, signal, atomic, memory fence) to ensure they see a consistent synchronised version of memory. Otherwise, the threads may see different versions of the data.

    Ideally, you want to minimise shared data. In this case, the worker thread does not need any access to the FolderInfo data structure. Building up that structure can be fully encapsulated within the the main thread. The worker thread just needs to pass enough information in the messages (i.e. folder path and size). To pass the path you can use dynamic memory, as I alluded to in my previous post, which is what OWLMaker does. The important difference here, compared to shared memory, is that this dynamic memory is a parameter package, and the worker thread never accesses it again after passing it.

    Aside: Thinking about this carefully, there might be a problem with passing arguments between threads via dynamic memory: the thread passing the argument and the thread receiving it may not have a consistent view of memory (cache effect). I have assumed that there is some memory synchronisation in the message passing that ensures that the receiving thread sees the updated version of the dynamic memory, but I am not sure that is the case. This potential problem applies to OWLMaker build output as well. It seems to work fine, but I am not sure it is guaranteed. It would make sense if PostMessage has some memory synchronisation though, since it is common for Windows messages to pass pointers to structures. I need to research this further.

    I have experimented with making this change to your code (see attached patch). SetRoot and PostNotification (the worker thread) now look as follows:

    void SetRoot(const path& folder)
    {
      RootFolder.reset();
      Stopped = false;
    
      auto process = [this](path folder, HWND w) -> uintmax_t
      {
        PostNotification(w, WM_FOLDERADDED, folder, 0);
        auto folderSize = ProcessFolder(folder, w);
        PostNotification(w, WM_FOLDERSPROCESSED, folder, folderSize);
        return folderSize;
      };
    
      FolderSize = async(launch::async, process, folder, GetHandle());
      TreeLastUpdateTime = {};
      StaticLastUpdateTime = {};
    }
    
    auto ProcessFolder(path folder, HWND w) -> uintmax_t
    {
      auto folderSize = uintmax_t{0};
      auto subfolders = vector<path>{};
      try
      {
        for (const auto& entry : directory_iterator(folder))
        {
          if (Stopped) return 0;
          if (entry.is_directory())
          {
            const auto p = entry.path();
            subfolders.push_back(p);
            PostNotification(w, WM_FOLDERADDED, p, 0);
          }
          else
            folderSize += entry.file_size();
        }
      }
      catch ([[maybe_unused]] const std::exception& x)
      {
        TRACE("Error processing folder \"" << folder << "\": " << x.what());
        return 0;
      }
    
      // Process subfolders.
      //
      for (const auto& f : subfolders)
      {
        if (Stopped) return 0;
        folderSize += ProcessFolder(f, w);
      }
    
      PostNotification(w, WM_FOLDERSIZECALCULATED, folder, folderSize);
      return folderSize;
    }
    

    However, the worker thread is now so fast that it floods the message queue until PostMessage fails.

    For example, my OWLNext project folder currently is 20 GiB with 67470 files, as I have working copies for all the release branches, and I have built the documentation in each one (Doxygen generates a lot of subfolders, 16 x 256). Testing this folder with my modified code crashes, due to PostMessage failure (ERROR_NOT_ENOUGH_QUOTA), as the worker thread posts exceed the limit of the message queue:

    "There is a limit of 10,000 posted messages per message queue."
    (PostMessage documentation)

    During this message flood, the UI is non-responsive. It seems UI messages can not progress through the clogged message queue in a timely manner. In the original code, the error code from PostMessage was not checked, so the message was lost, and hence the program got into an invalid state (WM_FOLDERSIZECALCULATED with no WM_FOLDERADDED).

    With more robust message posting code, which on failure goes to sleep for a few milliseconds before retrying PostMessage, the program no longer crashes, but the UI remains unresponsive due to the clogged message queue.

    Slowing down the worker thread, by going to sleep for a millisecond after each notification message, unclogs the message queue and makes the UI responsive again.

    To deal with this robustly, I encapsulated the message passing in PostNotification. It simply posts the given notification message, passing the path string (dynamically allocated, with ownership transfer to the receiver) and the folder size. If PostMessage fails (ERROR_NOT_ENOUGH_QUOTA), it pauses and retries until it succeeds (or some different error occurs, in which case it aborts).

    // 
    // Posts the given notification message to the given window.
    //
    // Passes the size of the non-folder items and the full path of the folder.
    // Note that the path string is passed via pointer to a heap-allocated copy. Ownership of this
    // copy is implicitly transferred to the receiver, i.e. the message must be handled, and the
    // receiver must deallocate the string.
    //
    static void PostNotification(HWND receiver, uint notification, const path& folder, uintmax_t folderSize)
    {
      [[maybe_unused]] const auto notificationStr =
        (notification == WM_FOLDERADDED) ? _T("WM_FOLDERADDED") :
        (notification == WM_FOLDERSIZECALCULATED) ? _T("WM_FOLDERSIZECALCULATED") :
        (notification == WM_FOLDERSPROCESSED) ? _T("WM_FOLDERSPROCESSED") :
        nullptr;
    
      auto s = unique_ptr<const tchar[]>(_tcsdup(to_tstring(folder).c_str()));
      do
      {
        const auto r = ::PostMessage(receiver, notification, folderSize, reinterpret_cast<TParam2>(s.get()));
        if (r)
        {
          s.release(); // Receiver has ownership.
          TRACE(_T("[") << (this_thread::get_id()) << _T("] Posted ") << notificationStr
            << _T(" (") << folderSize << _T(", \"") << to_tstring(folder) << _T("\")"));
        }
        else
        {
          const auto error = GetLastError();
          if (error != ERROR_NOT_ENOUGH_QUOTA) throw runtime_error{"PostNotification: PostMessage failed, GetLastError: " + to_tstring(error)};
          TRACE(_T("[") << (this_thread::get_id()) << _T("] PostNotification: PostMessage failed (") << notificationStr
            << _T(", ") << folderSize << _T(", \"") << to_tstring(folder) << _T("\"): ")
            << _T("ERROR_NOT_ENOUGH_QUOTA; pausing and retrying..."));
          Sleep(50); // ms
        }
      }
      while (s);
    
    #if defined(_DEBUG)
    
      // Give the main thread some slack to process the message queue.
      // If we go full tilt, we will clog up the message queue and make the UI unresponsive.
      //
      Sleep(1); // ms
    
    #endif
    }
    

    By the way, PumpWaitingMessages is no good for implementing UI responsiveness in this case. The processing time within the event handler is not the issue; the message flooding is the problem. Actually, when testing with a large folder, the execution gets stuck in the very first call to PumpWaitingMessages, due to the worker thread filling the message queue faster than it is processed. Since UpdateFolder is guarded against reentrancy (with the flag "processingQueue"), all further calls to UpdateFolder (as part of pumping waiting messages) are ignored, thus leaving the UI frozen. In my patch, I have removed PumpWaitingMessages and the reentrancy flag. UpdateFolder has been simplified, incorporating the functionality in AddFolder, which has been removed.

    The cancellation flag, member "Stopped", is shared data, so requires synchronisation. By making it atomic, i.e std::atomic<bool>, C++ guarantees that changes to the flag will be visible to other threads, i.e. synchronised. I've also added code in CleanupWindow that waits for the thread to finish and join after the stop token has been set.

    Regarding further improvement, it is interesting to consider the design and options for minimising the message traffic. The worker thread could build larger parameter packages between messages. It could even adjust dynamically, making sure it does not flood the message queue, e.g. by throttling message passing based on the message queue load, or on frequency. There is no need to update the UI thread thousands of times per second.

    Alternatively, a shared data approach with no message traffic can be envisaged, in which the UI takes a lock at some interval, at which point it inspects the work and progress of the worker thread, updating the UI accordingly. The shared data approach may have an issue with lock overhead and contention, but that may not be significant with just two threads. After all, the UI thread will be requesting a lock relatively infrequently.

    The other interesting aspect to consider is the amount of data produced by a worker thread, to the extent that presenting the data becomes the bottleneck, i.e. when the worker thread produces data faster than you can present it (this is to some degree an issue in OWLMaker as well, if you run heavy parallel builds with the Output pane open). In this case, a deferred scheme makes sense, in which the UI thread defers the presentation of the data until it is actually requested by the user. For example, you mentioned using TVN_ITEMEXPANDING to display data on demand, rather than populating the whole tree beforehand.

    PS. You display file size units with SI prefixes correctly (KB = 1000, MB = 1000000, etc.), while Windows uses non-standard units for these prefixes (KB = 1024, MB = 1048576, etc.). Consider using the ISO/IEC standard for binary prefixes (KiB = 1024, MiB = 1048576, etc.) for clarity (Wikipedia).

     

    Last edit: Vidar Hasfjord 2020-07-16
    • Ognyan Chernokozhev

      Alternatively, a shared data approach with no message traffic can be envisaged, in which the UI takes a lock at some interval, at which point it inspects the work and progress of the worker thread, updating the UI accordingly. The shared data approach may have an issue with lock overhead and contention, but that may not be significant with just two threads. After all, the UI thread will be requesting a lock relatively infrequently.

      I like this idea. There is no reason to post a Windows message each time a folder is added or finished processing.

      The background thread can populate a special queue structure, with the added and finished folders, and the UI thread on a timer or IdleAction reads that structure and updates the UI.
      It can also not be a standard std::queue object with full locking around each access, but some kind of custom structure - the background thread only adds to it's end, and does not modify already present data, and the UI only reads from the beginning. It can keep an index where it has reached, and do a conversion of the queue to a tree structure, that is then used to populate the tree (maybe with defered adding of child nodes).
      So, the only point of contention between the two threads is if the UI has reached the end of the queue, and at the same time the background thread adds data there.
      Would it work if simply the queue count property is somethng like std::atomic<int> ?</int>

      PS. You display file size units with SI prefixes correctly (KB = 1000, MB = 1000000, etc.), while Windows uses non-standard units for these prefixes (KB = 1024, MB = 1048576, etc.). Consider using the ISO/IEC standard for binary prefixes (KiB = 1024, MiB = 1048576, etc.) for clarity (Wikipedia).

      Thanks, I will correct it.

       
  • Vidar Hasfjord

    Vidar Hasfjord - 2020-07-17

    I like this idea. There is no reason to post a Windows message each time a folder is added or finished processing.

    Agree. Thinking about it, this is a typical producer-consumer problem, which is elegantly solved using an intermediate queue. However, the windows message queue is not suited for this. With the producer and consumer running concurrently, the whole point is that the producer runs at full speed filling the queue, while the consumer runs full speed emptying it. If the producer is faster, the queue will fill. The windows message queue isn't suited for this, since it needs to process UI events in a timely manner. A dedicated queue will eliminate these problems.

    Another thought is to use a coroutine solution, in which the coroutine traverses the folder producing each item as a result. Native coroutines are coming in C++20. This is perhaps a simpler and better solution, if there is little to gain by thread parallelism. You would just call the coroutine in IdleAction to produce as many folder items as you want to process at a time.

    Would it work if simply the queue count property is somethng like std::atomic<int>?

    See Wikipedia for more on the producer-consumer problem, implementation correctness and optimisation.

    The background thread can populate a special queue structure, with the added and finished folders, and the UI thread on a timer or IdleAction reads that structure and updates the UI.

    That's my thinking as well. IdleAction sounds the better way to go. You don't want to artificially limit the update speed by setting a fixed timer. Go full tilt when there is work to do and no messages to process. Aim for the consumer to go faster than the producer. Make the UI thread dynamically consume as much from the queue at a time, as is tolerable for a responsive UI. As you've mentioned, defer invisible updates to the Tree View. Eliminate all useless and redundant UI updates. And optimise the code to the hilt. Challenge: Beat Tree Size. :-)

    PS. I added a Ready flag to my version of your code, to stop the worker thread generating messages when the main thread is processing. The Ready flag is then reset to true in IdleAction. This runs pretty smooth in both debug and release mode, but it has the downside that the worker thread idles a lot. And, fundamentally, the windows message queue, with system calls and elaborate message dispatch, is a relatively slow way to implement a producer-consumer algorithm.

     

    Last edit: Vidar Hasfjord 2020-07-17
    • Ognyan Chernokozhev

      Here is an idea I have in my head about the implementation of a producer-consumer queue for this case:
      Since the size of the queue is not known in advance, and we don't want to deal with array resizing and realocations, implement instead a simple linked list. Each node consits of a path (or maybe a list of paths) and size for it/them (if it is calculated), and a pointer to the next node.
      The producer thread first creates a new node, fills it up, and only then sets the next node pointr in the last node to the new one.
      The consumer reads the nodes, keeping a pointer to the last processed node. When it reaches a node where the next node pointer is null, it stops for the moment. In the next IdleAction, it will continue from that node, and so on, untill a Finished flag is set. It seems to me that no locking is neccessary in this case, since the next node pointer will always be either null, or point to a valid next node.

      Optionally the consumer thread it can release the memory for the processed nodes, or this can be done for all once the processing is finished. Also the allocation of the nodes can be improved with a custom memory allocator.

       
  • Vidar Hasfjord

    Vidar Hasfjord - 2020-07-18

    It seems to me that no locking is neccessary in this case

    Well, you still want to communicate information between the threads, i.e. the UI thread wants to see the modifications the worker thread makes to the data structure. Unless you synchronise, there is no guarantee you will see those updates. The C++ memory model has specific rules for this; when you are guaranteed to see an update and not, i.e. memory ordering of read and writes, with the headache-inducing concept of memory fences.

    All of this has a lot to do with multicore, simultaneous multithreading (SMT) and caches in modern CPUs. Synchronising the caches between cores is costly. So to run at full speed, you want to allow each thread to work on its own separate non-shared memory as much as possible with no cache synchronisation (see cache coherence at Wikipedia). So the memory model is pretty relaxed to allow that. I find all of this mindboggling. I have seen good presentations on this stuff, and I tried to follow the C++11 developments closely, but I have little practical experience with concurrency and parallelism, so thinking about these things quickly makes my head spin.

    since the next node pointer will always be either null, or point to a valid next node.

    That pointer has be atomically updated, so that you don't get tearing (partial updates visible on a thread). You can e.g. use std::atomic for this, which is implicitly synchronised. As I understand it, any synchronisation will require some sort of memory fence, i.e. a cache synchronisation on the CPU, which will impose a performance cost.

    My advice is to start with a simple std::queue guarded by a mutex. That should be safe. Then benchmark. If you find that the locking is a performance bottleneck, then start thinking about optimising these things.

    In my version of your code, I replaced the use of the Windows message queue by a private std::queue synchronised by a std::mutex. It was perhaps a little faster than the version using the Windows queue (with Ready-flag throttling). For my OWLNext folder (~20 GiB), I think I measured 34 vs 38 seconds in Debug (x64) mode, but it could be within error margins.

    Replacing the std::map by std::unordered_map had much larger effect, with the latter running over three times faster in Debug mode. However, that difference disappears completely in Release mode, with the bottleneck apparently moving elsewhere.

    I note that, with the worker thread allowed to go full speed, it finishes way before the UI thread. So the bottleneck is now in the message dispatch and Tree View building. As you mentioned, the message dispatch can be eliminated (for simplicity, and for comparison, I currently just have the worker thread posting Windows messages to the private queue, which are then dispatched manually to TWindow::HandleMessage in IdleAction).

    Here are some measurements with my version of the code, using a private synchronised queue with the worker thread going full speed (no Ready-flag throttling) and with IdleAction processing of the private message queue in 20ms batches:

    Debug (x64), directory OWLNext:
      map: 34s
      unordered_map: 11s
    
    Debug (x64), System drive (289 GiB):
      map: 384s (6:24)
      unordered_map: 119s (1:59)
    
    Release (x64), System drive (289 GiB):
      map: 28s
      unordered_map: 28s
    
    Tree Size, System drive (289 GiB):
      ~20s (including program startup)
    
     

    Last edit: Vidar Hasfjord 2020-07-18
  • Vidar Hasfjord

    Vidar Hasfjord - 2020-07-18

    I note that, with the worker thread allowed to go full speed, it finishes way before the UI thread.

    Actually, with further testing, I note that this is only true in Debug mode. In Release mode, the UI thread keeps up with the worker thread. So it seems the bottleneck is in the folder traversal or message passing (either in the queue locking, or in the dynamic memory allocation, which in itself is synchronised, of course, to avoid heap corruption).

     
  • Vidar Hasfjord

    Vidar Hasfjord - 2020-07-18

    Here is a lock-less scheme you may think about trying, if you find that locking is a bottleneck:

    Use a single pointer as the only shared data, synchronised using std::atomic. Initially set the pointer to nullptr. Let the worker thread produce data in batches, e.g. 20 ms each. When a batch is ready, the worker thread checks whether the shared pointer is nullptr (i.e. the UI thread is ready for more data), and if it is, then sets it to point to the data batch. The worker thread then immediately proceeds work on the next batch. The UI thread checks the pointer, and if set, takes ownership of the data by copying the pointer, then sets the pointer back to nullptr.

    To prevent a data race when checking and setting the pointer, make the worker thread use atomic::compare_exchange_weak in a loop (compare_exchange_weak is allowed to fail the compare part spuriously, but may perform better than compare_exchange_strong). The UI thread can just use atomic::exchange with a nullptr, then check whether it got a nullptr in return (if it did, the worker thread hasn't completed the next batch yet). These operations are guaranteed to be atomic.

    To increase parallelism, you may extend this algorithm slightly: Let the worker thread proceed with the next batch, if it finds that the pointer is not nullptr (i.e. the UI thread has not yet taken ownership of the previous batch). When finished, the next batch is then appended to the current, and the worker thread again tries to communicate it by setting the pointer. If the UI thread still hasn't processed the previous batch and cleared the pointer, the worker thread proceeds work on the next batch, and so on, while there is work left to do. If all work is completed, the worker thread just goes to sleep for a while (e.g. for the specified batch interval), before trying to set the pointer again.

    Further optimisation ideas

    • Minimise dynamic allocations (implicitly synchronised, thus slow), e.g. by moving the filesystem::path objects (obtained from the folder traversal) into the folder messages, rather than creating a dynamic copy of the string representation.

    • Allocate memory for a batch in one go, rather than for each folder message individually. E.g. use a std::vector for the batch data. Set the capacity to the expected number of folder messages in each batch (can be dynamically adjusted).

     
  • Vidar Hasfjord

    Vidar Hasfjord - 2020-07-19

    I have made some further measurements and observations that may be relevant for considering optimisations. Traversal dominates the run time, and of that, most is inherent in std::filesystem, it seems. The bad news is that this means that there are few obvious opportunities for big performance gains. The good news is that locking and data allocation have minor effect on the performance. And the speed is already comparable to Tree Size (which is a mere ~30 percent faster on my system), so that is not too bad. Both are much faster than File Explorer | Properties.

    Test config: Release (x64), System drive (289 GiB, ~269,000 subfolders)
    
    Simple version (private message queue, OWLNext dispatch):
    
      Parallel traversal and UI update:
        28 s
    
      Traversal without data collection (calculating total folder size only):
        26 s
    
      Traversal with data collection, followed by UI update:
        35 s = 27 s (traversal) + 8 s (UI update)
    
    Optimised version (lockless batch processing, no OWLNext dispatch):
    
      Parallel traversal and UI update:
        26 s
    
      Traversal with data collection, followed by UI update:
        33 s = 26 s (traversal) + 6 s (UI update)
    
     
  • Ognyan Chernokozhev

    Hi, Vidar,

    I made a big rewrite of the folder calculation code:

    1. Eliminated the use of the Windows message queue. Instead, use a std::queue class with a mutex to synchronise.
    2. The background thread simply pushes entries in the queue - one when a folder starts being processed, and one when the size of that folder, including subfolders is calculated.
    3. The UI thread calls ProcessQueue on IdleAction, and it processes up to 100 items from the queue.It builds an in-memory tree structure, and also populates the TreeView control.
    4. Populating the TreeView control does not seem to be very slow. I may yet implement handling of TVN_ITEMEXPANDING and deferred adding of nodes to the tree.
    5. A problem with IdleAction is that if the program does not have focus and does not process Windows messages, IdleAction is not called. So maybe there is need for a timer to process the queue.

    Moderator: Fixed list formatting.

     

    Last edit: Vidar Hasfjord 2024-04-21
  • Vidar Hasfjord

    Vidar Hasfjord - 2020-07-19

    Hi Jogy,

    Argh! You have a devious bug in your latest code. It randomly skips subfolders. Which made me pretty certain you had some subtle synchronisation issue going on. After a lenghty session of debugging, code rewriting and head scratching, I finally spotted the issue. It was much more mundane than I thought.

    Can you see it?

    Trace: [Def:0] [TID:16812] Folder processing completed.
    Trace: [Def:0] ProcessQueue: Queue size: 1193
    Trace: [Def:0] ProcessQueue: Processed entry count: 100
    Trace: [Def:0] ProcessQueue: Queue size: 1092
    Trace: [Def:0] ProcessQueue: Processed entry count: 100
    Trace: [Def:0] ProcessQueue: Queue size: 991
    Trace: [Def:0] ProcessQueue: Processed entry count: 100
    Trace: [Def:0] ProcessQueue: Queue size: 890
    Trace: [Def:0] ProcessQueue: Processed entry count: 100
    Trace: [Def:0] ProcessQueue: Queue size: 789
    Trace: [Def:0] Cannot find folder "C:\\Borland\\BC5\\EXAMPLES\\MFC\\OLE\\OCLIENT\\RES"
    Trace: [Def:0] ProcessQueue: Processed entry count: 100
    Trace: [Def:0] ProcessQueue: Queue size: 688
    Trace: [Def:0] Cannot find parent folder for "C:\\Borland\\BC5\\EXAMPLES\\MFC\\UTILITY\\TRACER\\L.JPN"
    Trace: [Def:0] Cannot find folder "C:\\Borland\\BC5\\EXAMPLES\\MFC\\UTILITY\\TRACER\\L.JPN"
    Trace: [Def:0] Cannot find folder "C:\\Borland\\BC5\\EXAMPLES\\MFC\\UTILITY\\TRACER"
    Trace: [Def:0] ProcessQueue: Processed entry count: 100
    Trace: [Def:0] ProcessQueue: Queue size: 587
    Trace: [Def:0] Cannot find folder "C:\\Borland\\BC5\\EXAMPLES\\OWL\\CLASSES\\METAFILE"
    Trace: [Def:0] ProcessQueue: Processed entry count: 100
    Trace: [Def:0] ProcessQueue: Queue size: 486
    Trace: [Def:0] Cannot find folder "C:\\Borland\\BC5\\EXAMPLES\\OWL\\GAMES\\CRIBBAGE"
    Trace: [Def:0] ProcessQueue: Processed entry count: 100
    Trace: [Def:0] ProcessQueue: Queue size: 385
    Trace: [Def:0] Cannot find folder "C:\\Borland\\BC5\\EXAMPLES\\SERIES\\GDIMETA"
    Trace: [Def:0] ProcessQueue: Processed entry count: 100
    Trace: [Def:0] ProcessQueue: Queue size: 284
    Trace: [Def:0] ProcessQueue: Processed entry count: 100
    Trace: [Def:0] ProcessQueue: Queue size: 183
    Trace: [Def:0] ProcessQueue: Processed entry count: 100
    Trace: [Def:0] ProcessQueue: Queue size: 82
    Trace: [Def:0] ProcessQueue: Processed entry count: 82
    
     

    Last edit: Vidar Hasfjord 2020-07-19
    • Ognyan Chernokozhev

      Is it in the code that traverses the tree to find a folder by a path?
      I am not very happy with that code, there should be a better way to do it.

       
  • Vidar Hasfjord

    Vidar Hasfjord - 2020-07-20

    Is it in the code that traverses the tree to find a folder by a path?

    No. It is right there in the trace. What is 1193 minus 100?

    Loops and plus/minus one errors. Still get us after all these years. :-)

    I am not very happy with that code, there should be a better way to do it.

    Use a map, like in your original version, but preferably use a std::unordered_map for constant lookup time on average. Here is a hash function for filesystem::path, as required for it to be used as a key in unordered_map:

    namespace std {
    
    template <>
    struct hash<filesystem::path>
    {
      auto operator()(const filesystem::path& p) const noexcept
      {
        return hash<filesystem::path::string_type>{}(p);
      }
    };
    
    }
    
    1. A problem with IdleAction is that if the program does not have focus and does not process Windows messages, IdleAction is not called.

    Good call. The issue is that the UI thread goes to sleep in the message loop if IdleAction returns false (see TApplication::MessageLoop). Make sure IdleAction returns true while there is work to do. Still, if the worker thread is slow, and there is no work for a split second, the message loop may still end up suspending the UI thread.

    The "correct" way to deal with it is to make the main thread wake up again when there is more work to do. The message loop goes to sleep by calling MsgWaitForMultipleObjects with an array of wait handles. You can add wait handles to something waitable (mutex, event, semaphore, etc.), which will cause the thread to wake up again when signalled. See TApplication::WaitOnObject. The function mutex::native_handle gives you a handle to wait on, or you can use some native synchronisation primitive in the Windows API, e.g. an event (see OWLMaker, which uses an event for the stop token).

    All that said, a nudge timer is a simple and good enough solution.

     

    Last edit: Vidar Hasfjord 2022-01-01
    • Ognyan Chernokozhev

      Ah yes ... silly me. Thanks for pointing that out. Should be fixed now.

      As for the tree, I created this desing with the idea to test it and then maybe try to switch to a lock-less queue, if it turned out that the locking is having noticeable performance effect,
      But it does not seems so.

      The problems is that the background thread processes a tree, but pushes each step of the processing as an independent queue entry, and then the UI htread has to restore the tree from those queue entries.

      I am thinking now of having the background thread to build the tree, and the individual tree nodes can have their own locks, instead of one global lock for the whole structure.

      There could still be a queue, but it will contain only pointers to the added/updated nodes, so that the UI knows which ones to process and what to update in the TreeView control.
      This should eliminate all lookups.

       
  • Vidar Hasfjord

    Vidar Hasfjord - 2020-07-20

    Nice. Starting to take shape now.

    The problems is that the background thread processes a tree, but pushes each step of the processing as an independent queue entry, and then the UI htread has to restore the tree from those queue entries.

    Why is that a problem? Since the file system operations are the slow part of the process, the UI thread has ample processing time to build the structure, leaving the worker thread to just shuttle the data along as quickly as possible as it becomes available from the file system.

    the individual tree nodes can have their own locks.

    I wonder how that will scale. It might work fine for this application, but the code will no longer be a good example of how to do multi-threading well. Avoid shared data if at all possible. Pass results instead. (Shared data isn't really shared — it is essentially automatically copied between caches, by the CPU performing expensive cache-coherency operations.)

    [pointers] should eliminate all lookups.

    Following a pointer is constant time. Lookup with an unordered_map is constant time. Those are pretty much the same from a performance standpoint.

    I am not very happy with [the search] code,

    You can simplify it using path::lexically_relative (although it may be more expensive):

    auto FindFolder(const path& f) -> FolderInfo*
    {
      if (folder == f)
        return this;
    
      FolderInfo* cur = this;
      const auto r = f.lexically_relative(folder);
      for (auto i = r.begin(); i != r.end() && cur; ++i)
        cur = cur->FindChild(*i);
      return cur;
    }
    

    I'd prefer unordered_map though. Faster and even simpler.

     
    • Ognyan Chernokozhev

      Thanks for the idea - I changed the lookups to use unordered_map, and now it is really fast - the UI thread keeps pace with the background one, and fully empties the queue from time to time.

      I am pretty happy now with the program performance.

       
  • Vidar Hasfjord

    Vidar Hasfjord - 2020-07-20

    I am pretty happy now with the program performance.

    Aye. You are now matching my earlier results — about 27 seconds to scan my System drive.

    However, I had to fix an issue in IdleAction to make your version run full tilt. Otherwise, it would idle, and only make small bursts every second.

    auto IdleAction(long idleCount) -> bool override
    {
      const auto r = TPaneSplitter::IdleAction(idleCount);
      ProcessQueue();
      const auto moreWorkToDo = !DisplayedFinishState;
      return moreWorkToDo || r;
    }
    

    Also consider adjusting the work in ProcessQueue according to an allotted time slice, rather than specifying a set number of items. I used 20 ms (50 fps) in my version, and that provides a buttery smooth UI.

    const auto timeSlice = 20ms;
    const auto deadline = chrono::high_resolution_clock::now() + timeSlice;
    auto entry = Queue.GetNext();
    while (entry != nullptr)
    {
      //...
      if (chrono::high_resolution_clock::now() > deadline)
        break;
      entry = Queue.GetNext();  
    }
    
     

    Last edit: Vidar Hasfjord 2020-07-20
  • Vidar Hasfjord

    Vidar Hasfjord - 2020-07-20

    Tip: Rather than commenting out TRACE and other logging in your code, use the extended macros (TRACEX, etc.) with a diagnostic level greater than 0. You can then enable and disable the logging by setting the level in the Diagnostic Window.

    For example, using the default diagnostic group Def and level 1:

    TRACEX(Def, 1, _T("[TID:") << this_thread::get_id()
      << _T("] Folder added: ") << to_tstring(folderInfo.folder));
    
     
  • Vidar Hasfjord

    Vidar Hasfjord - 2020-07-20

    Multi-threading bug: Variable FinishedBackgroundProcessing is shared data without any synchronisation (updates not guaranteed to be ordered/visible across threads).

     
    • Ognyan Chernokozhev

      Thanks, I pushed a fix making it std::atomic - is this the correct way?

       
  • Vidar Hasfjord

    Vidar Hasfjord - 2020-07-20

    Observation: The folder icons in the Tree View seem to be converted to 4-bit colour with dithering. The icons in the List View have no dithering.

     

    Last edit: Vidar Hasfjord 2020-07-21
1 2 3 4 > >> (Page 1 of 4)

Anonymous
Anonymous

Add attachments
Cancel





MongoDB Logo MongoDB