I am making a new OWLNext example - a separate application to calculate folder sizes.
Background
No matter how large hard disk drive you have, over time it fills up and you need to clean it. To do that effectively, you need to easily know how much disk space each folder and subfolder occupies. There are probably many applications doing that — in the past I wrote one such in C#/WPF for my own use.
Now I am trying to make the same using OWLNext and Modern C++.
Goals
Since calculating the size of all files and subfolders can take quite a long time, the processing should be done in the background and not block the UI. Instead, it should show progress and allow cancellation.
Discussion: OWLNext 7.0.2 and 6.44.12 updates
Wiki: OWLNext_Roadmap_and_Prereleases
Anonymous
Initial example added in [r5222].
Related
Commit: [r5222]
Last edit: Ognyan Chernokozhev 2020-07-12
Some notes and questions on the implementation so far:
Is this the best way to handle it?
Future enhancements:
Last edit: Ognyan Chernokozhev 2020-07-12
Thanks Jogy!
I was looking for the use of the "SysTreeView32" control so I will review the code to learn.
Sebas
Hi Jogy, really cool example project!
I use TreeSize for this purpose, by the way.
Here is some feedback on your notes and implementation so far:
The recursive implementation of depth-first traversal may run out of stack space.
To improve the user experience, use breadth-first traversal rather than depth-first. With breadth-first traversal, you will update the top level of the tree first, which is what the user initially sees. Then, as the user navigates the tree deeper and deeper, the background traversal will process the tree likewise. You may even consider prioritising the branch of the tree that the user is currently navigating (i.e. breadth-first within the selected folder).
Don't use shared data, mutexes and locks. Instead, let the background process create the data and pass it through messages. If a message has more data than can be passed in the message parameters (WPARAM and LPARAM), then allocate the data dynamically and pass ownership via pointer (see OWLMaker build output handling, in particular TBuildOutputWindow::EvLaunchAppMessage).
The use of mutexes and locks causes continuous start-and-stop throttling of the UI and worker threads, as well as relatively slow data synchronisation between the threads (due to memory and cache coherency protocols). It may not have any significant performance penalty in this program, compared to the other slow operations involved, but it is good programming style to avoid lock-based solutions if you can.
If you need to use mutexes, don't lock and unlock them manually. Use an RAII sentry, such as std::lock_guard, to ensure the mutex is unlocked in case of exceptions.
Do not traverse the whole tree for every message from the worker thread. Instead, build and update the tree incrementally with the new data passed, using minimal traversal. If Tree View updates are still too slow, then store the passed data in a cached data structure, and use a deferred scheme as you suggest, by only updating the Tree View on demand as the user navigates it.
You currently use a std::map to associate path and HTREEITEM. Every iteration in the tree traversal calls std::map::find, which has O(log(n)) complexity, hence every new folder addition to the tree has O(n * log(n)) complexity, and the whole tree completion has O(n * n * log(n)) complexity. You can optimise a little by using std::unordered_map instead (O(1) complexity), but you can go much further by eliminating full tree traversal for every insertion and HTREEITEM lookup for every iteration, e.g. by caching the last insertion point, and by adding structural messages (WM_BEGINFOLDER, WM_ENDFOLDER) so that the construction of the tree structure becomes implicit in the message handling. You may also look into optimising the manipulation of the Tree View; in particular, adding nodes to a tree should ideally be done in reverse order (i.e. inserting at or near the root), since appending at the bottom requires tree traversal (see OldNewThing).
To improve UI reponsiveness further, you could look into prioritising user input and UI updates by deferring the handling of the messages from the worker thread, e.g. by putting them in a queue that you handle in batches in a WM_TIMER event handler or in TWindow::IdleAction.
Regarding the ugly code for formatting the date and time, I don't think C++17 has anything better, but C++20 has got better facilities (std::chrono::local_time with support for streaming).
Hi, Vidar,
thanks for the suggestions, I implemented some changes based on them in [r5223].
I changed the traversal to breadth-first, so that the immediate children of the root folder are displayed. Note that they will ne in Pending state, and each will stay that way untill all it's subfolders are processed - since I display the total sum of files in all subfolders. I don't think logic to try to prioritize the calculation fir a selected folder is neccessary, the process is quite fast (at least on my machine) and will introduce a lot of complexiy in the code.
I now changed the code so that the background thread sends messages when a folder is added or finished, and the UI htread adds them to a queue, then processes the queue, and during queue processing calls Application::PumpWaitingMessages as to not makde the UI unresponsive.
I still need the mapping, since when a folder is finished (all it's children are traversed), I need to find it and update it's text with the calculated size. When initially adding the folders from the queue, it may be possible to have a stack that references the last added tree node, instead of searching the while map. Still, even with the map, it does not seem to be really slowing the process.
As for inserting the folders in the tree in reverse order, that is an interesting idea. It may have effect only when processing folders that have a really large number of subfolders, and maybe the most effective way to implement it would be if the directory_iterator can be made to iterae a folder in reverse order. Or to first retrieve all subfolders, then sort them in reverse order and then start adding them ...
Related
Commit: [r5223]
Hi Jogy,
I didn't mean that you can simply remove the synchronisation. You can only do so if you rewrite the code to not use shared data. In your current code, you still very much use shared data. Your threads share the FolderInfo data structure as well as the "stopped" flag, and you hence need some synchronisation mechanism (e.g. mutex, lock, condition variable, signal, atomic, memory fence) to ensure they see a consistent synchronised version of memory. Otherwise, the threads may see different versions of the data.
Ideally, you want to minimise shared data. In this case, the worker thread does not need any access to the FolderInfo data structure. Building up that structure can be fully encapsulated within the the main thread. The worker thread just needs to pass enough information in the messages (i.e. folder path and size). To pass the path you can use dynamic memory, as I alluded to in my previous post, which is what OWLMaker does. The important difference here, compared to shared memory, is that this dynamic memory is a parameter package, and the worker thread never accesses it again after passing it.
Aside: Thinking about this carefully, there might be a problem with passing arguments between threads via dynamic memory: the thread passing the argument and the thread receiving it may not have a consistent view of memory (cache effect). I have assumed that there is some memory synchronisation in the message passing that ensures that the receiving thread sees the updated version of the dynamic memory, but I am not sure that is the case. This potential problem applies to OWLMaker build output as well. It seems to work fine, but I am not sure it is guaranteed. It would make sense if PostMessage has some memory synchronisation though, since it is common for Windows messages to pass pointers to structures. I need to research this further.
I have experimented with making this change to your code (see attached patch). SetRoot and PostNotification (the worker thread) now look as follows:
However, the worker thread is now so fast that it floods the message queue until PostMessage fails.
For example, my OWLNext project folder currently is 20 GiB with 67470 files, as I have working copies for all the release branches, and I have built the documentation in each one (Doxygen generates a lot of subfolders, 16 x 256). Testing this folder with my modified code crashes, due to PostMessage failure (ERROR_NOT_ENOUGH_QUOTA), as the worker thread posts exceed the limit of the message queue:
"There is a limit of 10,000 posted messages per message queue."
(PostMessage documentation)
During this message flood, the UI is non-responsive. It seems UI messages can not progress through the clogged message queue in a timely manner. In the original code, the error code from PostMessage was not checked, so the message was lost, and hence the program got into an invalid state (WM_FOLDERSIZECALCULATED with no WM_FOLDERADDED).
With more robust message posting code, which on failure goes to sleep for a few milliseconds before retrying PostMessage, the program no longer crashes, but the UI remains unresponsive due to the clogged message queue.
Slowing down the worker thread, by going to sleep for a millisecond after each notification message, unclogs the message queue and makes the UI responsive again.
To deal with this robustly, I encapsulated the message passing in PostNotification. It simply posts the given notification message, passing the path string (dynamically allocated, with ownership transfer to the receiver) and the folder size. If PostMessage fails (ERROR_NOT_ENOUGH_QUOTA), it pauses and retries until it succeeds (or some different error occurs, in which case it aborts).
By the way, PumpWaitingMessages is no good for implementing UI responsiveness in this case. The processing time within the event handler is not the issue; the message flooding is the problem. Actually, when testing with a large folder, the execution gets stuck in the very first call to PumpWaitingMessages, due to the worker thread filling the message queue faster than it is processed. Since UpdateFolder is guarded against reentrancy (with the flag "processingQueue"), all further calls to UpdateFolder (as part of pumping waiting messages) are ignored, thus leaving the UI frozen. In my patch, I have removed PumpWaitingMessages and the reentrancy flag. UpdateFolder has been simplified, incorporating the functionality in AddFolder, which has been removed.
The cancellation flag, member "Stopped", is shared data, so requires synchronisation. By making it atomic, i.e std::atomic<bool>, C++ guarantees that changes to the flag will be visible to other threads, i.e. synchronised. I've also added code in CleanupWindow that waits for the thread to finish and join after the stop token has been set.
Regarding further improvement, it is interesting to consider the design and options for minimising the message traffic. The worker thread could build larger parameter packages between messages. It could even adjust dynamically, making sure it does not flood the message queue, e.g. by throttling message passing based on the message queue load, or on frequency. There is no need to update the UI thread thousands of times per second.
Alternatively, a shared data approach with no message traffic can be envisaged, in which the UI takes a lock at some interval, at which point it inspects the work and progress of the worker thread, updating the UI accordingly. The shared data approach may have an issue with lock overhead and contention, but that may not be significant with just two threads. After all, the UI thread will be requesting a lock relatively infrequently.
The other interesting aspect to consider is the amount of data produced by a worker thread, to the extent that presenting the data becomes the bottleneck, i.e. when the worker thread produces data faster than you can present it (this is to some degree an issue in OWLMaker as well, if you run heavy parallel builds with the Output pane open). In this case, a deferred scheme makes sense, in which the UI thread defers the presentation of the data until it is actually requested by the user. For example, you mentioned using TVN_ITEMEXPANDING to display data on demand, rather than populating the whole tree beforehand.
PS. You display file size units with SI prefixes correctly (KB = 1000, MB = 1000000, etc.), while Windows uses non-standard units for these prefixes (KB = 1024, MB = 1048576, etc.). Consider using the ISO/IEC standard for binary prefixes (KiB = 1024, MiB = 1048576, etc.) for clarity (Wikipedia).
Last edit: Vidar Hasfjord 2020-07-16
I like this idea. There is no reason to post a Windows message each time a folder is added or finished processing.
The background thread can populate a special queue structure, with the added and finished folders, and the UI thread on a timer or IdleAction reads that structure and updates the UI.
It can also not be a standard std::queue object with full locking around each access, but some kind of custom structure - the background thread only adds to it's end, and does not modify already present data, and the UI only reads from the beginning. It can keep an index where it has reached, and do a conversion of the queue to a tree structure, that is then used to populate the tree (maybe with defered adding of child nodes).
So, the only point of contention between the two threads is if the UI has reached the end of the queue, and at the same time the background thread adds data there.
Would it work if simply the queue count property is somethng like std::atomic<int> ?</int>
Thanks, I will correct it.
Agree. Thinking about it, this is a typical producer-consumer problem, which is elegantly solved using an intermediate queue. However, the windows message queue is not suited for this. With the producer and consumer running concurrently, the whole point is that the producer runs at full speed filling the queue, while the consumer runs full speed emptying it. If the producer is faster, the queue will fill. The windows message queue isn't suited for this, since it needs to process UI events in a timely manner. A dedicated queue will eliminate these problems.
Another thought is to use a coroutine solution, in which the coroutine traverses the folder producing each item as a result. Native coroutines are coming in C++20. This is perhaps a simpler and better solution, if there is little to gain by thread parallelism. You would just call the coroutine in IdleAction to produce as many folder items as you want to process at a time.
See Wikipedia for more on the producer-consumer problem, implementation correctness and optimisation.
That's my thinking as well. IdleAction sounds the better way to go. You don't want to artificially limit the update speed by setting a fixed timer. Go full tilt when there is work to do and no messages to process. Aim for the consumer to go faster than the producer. Make the UI thread dynamically consume as much from the queue at a time, as is tolerable for a responsive UI. As you've mentioned, defer invisible updates to the Tree View. Eliminate all useless and redundant UI updates. And optimise the code to the hilt. Challenge: Beat Tree Size. :-)
PS. I added a Ready flag to my version of your code, to stop the worker thread generating messages when the main thread is processing. The Ready flag is then reset to true in IdleAction. This runs pretty smooth in both debug and release mode, but it has the downside that the worker thread idles a lot. And, fundamentally, the windows message queue, with system calls and elaborate message dispatch, is a relatively slow way to implement a producer-consumer algorithm.
Last edit: Vidar Hasfjord 2020-07-17
Here is an idea I have in my head about the implementation of a producer-consumer queue for this case:
Since the size of the queue is not known in advance, and we don't want to deal with array resizing and realocations, implement instead a simple linked list. Each node consits of a path (or maybe a list of paths) and size for it/them (if it is calculated), and a pointer to the next node.
The producer thread first creates a new node, fills it up, and only then sets the next node pointr in the last node to the new one.
The consumer reads the nodes, keeping a pointer to the last processed node. When it reaches a node where the next node pointer is null, it stops for the moment. In the next IdleAction, it will continue from that node, and so on, untill a Finished flag is set. It seems to me that no locking is neccessary in this case, since the next node pointer will always be either null, or point to a valid next node.
Optionally the consumer thread it can release the memory for the processed nodes, or this can be done for all once the processing is finished. Also the allocation of the nodes can be improved with a custom memory allocator.
Well, you still want to communicate information between the threads, i.e. the UI thread wants to see the modifications the worker thread makes to the data structure. Unless you synchronise, there is no guarantee you will see those updates. The C++ memory model has specific rules for this; when you are guaranteed to see an update and not, i.e. memory ordering of read and writes, with the headache-inducing concept of memory fences.
All of this has a lot to do with multicore, simultaneous multithreading (SMT) and caches in modern CPUs. Synchronising the caches between cores is costly. So to run at full speed, you want to allow each thread to work on its own separate non-shared memory as much as possible with no cache synchronisation (see cache coherence at Wikipedia). So the memory model is pretty relaxed to allow that. I find all of this mindboggling. I have seen good presentations on this stuff, and I tried to follow the C++11 developments closely, but I have little practical experience with concurrency and parallelism, so thinking about these things quickly makes my head spin.
That pointer has be atomically updated, so that you don't get tearing (partial updates visible on a thread). You can e.g. use std::atomic for this, which is implicitly synchronised. As I understand it, any synchronisation will require some sort of memory fence, i.e. a cache synchronisation on the CPU, which will impose a performance cost.
My advice is to start with a simple std::queue guarded by a mutex. That should be safe. Then benchmark. If you find that the locking is a performance bottleneck, then start thinking about optimising these things.
In my version of your code, I replaced the use of the Windows message queue by a private std::queue synchronised by a std::mutex. It was perhaps a little faster than the version using the Windows queue (with Ready-flag throttling). For my OWLNext folder (~20 GiB), I think I measured 34 vs 38 seconds in Debug (x64) mode, but it could be within error margins.
Replacing the std::map by std::unordered_map had much larger effect, with the latter running over three times faster in Debug mode. However, that difference disappears completely in Release mode, with the bottleneck apparently moving elsewhere.
I note that, with the worker thread allowed to go full speed, it finishes way before the UI thread. So the bottleneck is now in the message dispatch and Tree View building. As you mentioned, the message dispatch can be eliminated (for simplicity, and for comparison, I currently just have the worker thread posting Windows messages to the private queue, which are then dispatched manually to TWindow::HandleMessage in IdleAction).Here are some measurements with my version of the code, using a private synchronised queue with the worker thread going full speed (no Ready-flag throttling) and with IdleAction processing of the private message queue in 20ms batches:
Last edit: Vidar Hasfjord 2020-07-18
Actually, with further testing, I note that this is only true in Debug mode. In Release mode, the UI thread keeps up with the worker thread. So it seems the bottleneck is in the folder traversal or message passing (either in the queue locking, or in the dynamic memory allocation, which in itself is synchronised, of course, to avoid heap corruption).
Here is a lock-less scheme you may think about trying, if you find that locking is a bottleneck:
Use a single pointer as the only shared data, synchronised using std::atomic. Initially set the pointer to nullptr. Let the worker thread produce data in batches, e.g. 20 ms each. When a batch is ready, the worker thread checks whether the shared pointer is nullptr (i.e. the UI thread is ready for more data), and if it is, then sets it to point to the data batch. The worker thread then immediately proceeds work on the next batch. The UI thread checks the pointer, and if set, takes ownership of the data by copying the pointer, then sets the pointer back to nullptr.
To prevent a data race when checking and setting the pointer, make the worker thread use atomic::compare_exchange_weak in a loop (compare_exchange_weak is allowed to fail the compare part spuriously, but may perform better than compare_exchange_strong). The UI thread can just use atomic::exchange with a nullptr, then check whether it got a nullptr in return (if it did, the worker thread hasn't completed the next batch yet). These operations are guaranteed to be atomic.
To increase parallelism, you may extend this algorithm slightly: Let the worker thread proceed with the next batch, if it finds that the pointer is not nullptr (i.e. the UI thread has not yet taken ownership of the previous batch). When finished, the next batch is then appended to the current, and the worker thread again tries to communicate it by setting the pointer. If the UI thread still hasn't processed the previous batch and cleared the pointer, the worker thread proceeds work on the next batch, and so on, while there is work left to do. If all work is completed, the worker thread just goes to sleep for a while (e.g. for the specified batch interval), before trying to set the pointer again.
Further optimisation ideas
Minimise dynamic allocations (implicitly synchronised, thus slow), e.g. by moving the filesystem::path objects (obtained from the folder traversal) into the folder messages, rather than creating a dynamic copy of the string representation.
Allocate memory for a batch in one go, rather than for each folder message individually. E.g. use a std::vector for the batch data. Set the capacity to the expected number of folder messages in each batch (can be dynamically adjusted).
I have made some further measurements and observations that may be relevant for considering optimisations. Traversal dominates the run time, and of that, most is inherent in std::filesystem, it seems. The bad news is that this means that there are few obvious opportunities for big performance gains. The good news is that locking and data allocation have minor effect on the performance. And the speed is already comparable to Tree Size (which is a mere ~30 percent faster on my system), so that is not too bad. Both are much faster than File Explorer | Properties.
Hi, Vidar,
I made a big rewrite of the folder calculation code:
Moderator: Fixed list formatting.
Last edit: Vidar Hasfjord 2024-04-21
Hi Jogy,
Argh! You have a devious bug in your latest code. It randomly skips subfolders. Which made me pretty certain you had some subtle synchronisation issue going on. After a lenghty session of debugging, code rewriting and head scratching, I finally spotted the issue. It was much more mundane than I thought.
Can you see it?
Last edit: Vidar Hasfjord 2020-07-19
Is it in the code that traverses the tree to find a folder by a path?
I am not very happy with that code, there should be a better way to do it.
No. It is right there in the trace. What is 1193 minus 100?
Loops and plus/minus one errors. Still get us after all these years. :-)
Use a map, like in your original version, but preferably use a std::unordered_map for constant lookup time on average. Here is a hash function for filesystem::path, as required for it to be used as a key in unordered_map:
Good call. The issue is that the UI thread goes to sleep in the message loop if IdleAction returns false (see TApplication::MessageLoop). Make sure IdleAction returns true while there is work to do. Still, if the worker thread is slow, and there is no work for a split second, the message loop may still end up suspending the UI thread.
The "correct" way to deal with it is to make the main thread wake up again when there is more work to do. The message loop goes to sleep by calling MsgWaitForMultipleObjects with an array of wait handles. You can add wait handles to something waitable (mutex, event, semaphore, etc.), which will cause the thread to wake up again when signalled. See TApplication::WaitOnObject. The function mutex::native_handle gives you a handle to wait on, or you can use some native synchronisation primitive in the Windows API, e.g. an event (see OWLMaker, which uses an event for the stop token).
All that said, a nudge timer is a simple and good enough solution.
Last edit: Vidar Hasfjord 2022-01-01
Ah yes ... silly me. Thanks for pointing that out. Should be fixed now.
As for the tree, I created this desing with the idea to test it and then maybe try to switch to a lock-less queue, if it turned out that the locking is having noticeable performance effect,
But it does not seems so.
The problems is that the background thread processes a tree, but pushes each step of the processing as an independent queue entry, and then the UI htread has to restore the tree from those queue entries.
I am thinking now of having the background thread to build the tree, and the individual tree nodes can have their own locks, instead of one global lock for the whole structure.
There could still be a queue, but it will contain only pointers to the added/updated nodes, so that the UI knows which ones to process and what to update in the TreeView control.
This should eliminate all lookups.
Nice. Starting to take shape now.
Why is that a problem? Since the file system operations are the slow part of the process, the UI thread has ample processing time to build the structure, leaving the worker thread to just shuttle the data along as quickly as possible as it becomes available from the file system.
I wonder how that will scale. It might work fine for this application, but the code will no longer be a good example of how to do multi-threading well. Avoid shared data if at all possible. Pass results instead. (Shared data isn't really shared — it is essentially automatically copied between caches, by the CPU performing expensive cache-coherency operations.)
Following a pointer is constant time. Lookup with an unordered_map is constant time. Those are pretty much the same from a performance standpoint.
You can simplify it using path::lexically_relative (although it may be more expensive):
I'd prefer unordered_map though. Faster and even simpler.
Thanks for the idea - I changed the lookups to use unordered_map, and now it is really fast - the UI thread keeps pace with the background one, and fully empties the queue from time to time.
I am pretty happy now with the program performance.
Aye. You are now matching my earlier results — about 27 seconds to scan my System drive.
However, I had to fix an issue in IdleAction to make your version run full tilt. Otherwise, it would idle, and only make small bursts every second.
Also consider adjusting the work in ProcessQueue according to an allotted time slice, rather than specifying a set number of items. I used 20 ms (50 fps) in my version, and that provides a buttery smooth UI.
Last edit: Vidar Hasfjord 2020-07-20
Tip: Rather than commenting out TRACE and other logging in your code, use the extended macros (TRACEX, etc.) with a diagnostic level greater than 0. You can then enable and disable the logging by setting the level in the Diagnostic Window.
For example, using the default diagnostic group Def and level 1:
Multi-threading bug: Variable FinishedBackgroundProcessing is shared data without any synchronisation (updates not guaranteed to be ordered/visible across threads).
Thanks, I pushed a fix making it std::atomic - is this the correct way?
Observation: The folder icons in the Tree View seem to be converted to 4-bit colour with dithering. The icons in the List View have no dithering.
Last edit: Vidar Hasfjord 2020-07-21