From: K. F. <kfr...@gm...> - 2011-09-30 17:49:01
|
Hi Kai! On Fri, Sep 30, 2011 at 12:26 PM, Kai Tietz <kti...@go...> wrote: > Hi K. Frank, > > very interesting results. I try to give my 5 cents for interpretation > of timings. Esspecially case "d" I am interested in. As here we seem > to spend too much time in idle-looping. Not sure why it happens, but > looks worth-wile for investigation. I should point out (I meant to in my original post, but I forgot.) that test "d" (mutex contention) is, in some sense, very unrealistic. I have 100 threads (on a 2-core) machine, each running the same tight loop (100,000 iterations): acquire mutex, increment data, release mutex. Now, for a real multi-threaded program, this would be very bad design. So an implementation that trades off performing well on a sensible program for performing more slowly on this test would be making a wise trade-off. (Of course, there could be an issue that would affect well-designed programs, as well, and in that case it would be worth looking into.) > 2011/9/30 K. Frank <kfr...@gm...>: >> Hello List! >> >> I have run some timings on a couple of my std::thread test programs >> using the following three std::thread implementations: >> >> native windows implementation, mingw-w64 4.5.2, a sezero build >> pthreads-win32 implementation, mingw-w64 4.5.2, a sezero build >> winpthreads implementation, mingw-w64 4.7.0, Ruben's std::thread build >> >> I list the timings, in seconds, in the above order (native, pthreads-win32, >> winpthreads): >> >> Test "c2" (parallel accumulate): >> >> 15.40 (0.09) 15.43 (0.05) 18.79 (0.07) > > This slow-down might be caused by different implementation and/or > code-generation of libstdc++/compiler. Nevertheless it might be also > some additional overhead for thread-creation what we are seeing here. My sense is that most of the run time for this program is in that actual accumulation of the numbers, rather than in the thread-specific stuff. This test uses 50 threads -- three seconds for thread creation doesn't seem likely (and contradicts test "e" where the winpthreads implementation does everything, including creating 20 threads, in under a second). So I'm inclined to agree with your speculation that the actual accumulate code (rather than the thread stuff) is running a little slower because of compiler differences. (But I'm just guessing -- I haven't profiled the internals of the test program.) >> Test "d" (mutex contention): >> >> 2.12 (0.02) 2.62 (0.08) 40.41 (0.06) > > For this test I would be interested to get some testcode. This looks > to me like an uncessary sleep or thread-pausing. The code for test "d" appears below. One comment: The motivation for the consistent_data class is to have some data that won't remain consistent if it is not incremented atomically. The idea is that if the mutex were broken somehow and didn't actually synchronize access to the data, obvious inconsistencies would show up. (And, by the way, if you comment out the lock_guard (i.e., the mutex) in guarded_data the expected inconsistencies do show up.) >> Test "e" (condition variables): >> >> 3.18 (0.04) 3.31 (0.10) 0.74 (0.04) > > That one shows that winpthread tries to do a real round-balancing for > waiters on conditional variables. And the approach seems to be worth, > as we see in timing. I spend here some time to implement proper > concurrency balancing code for waiters. Nice :) Yes, that's good. I don't know -- but you might -- whether pthreads-win32 implements its condition variables with round-balancing. I also don't know whether native windows condition variables (which I use in what I've called the native implementation) use round-balancing. It would be interesting to know, one way or another. (A little off topic here, but to elaborate... In general I think having the round-balancing is good, but not for all use cases. The question of whether thread synchronization should be "fair," i.e., all threads get a guaranteed or probabilistic fair shot at acquiring a released mutex or at being notified by a condition variable, depends on what you want to accomplish. Normally we think about different threads doing different things, but needing access to some shares resource. In this case, fairness makes sense. But if you think about a thread pool (a very common use of condition variables) of fungible threads -- by this I mean that you have a pool of a bunch of equivalent threads, any of which can be substituted for one another, so that it doesn't matter which specific thread actually does the work -- fairness can actually be a detriment. Let's say you have eight (fungible) worker threads running on eight processor cores, but, for possibly legitimate reasons, you have twenty threads in your thread pool. Ideally you'd like those eight threads to keep running, rather than context-switch other threads from the thread pool onto the cores. Fairness would require exactly this -- in this case unnecessary -- context switching, I've wondered sometimes whether it would make sense for this reason to have two kinds of synchronization objects -- those that are fair, and a second set that are "anti-fair," so to speak.) > Regards, > Kai Thanks to you (and to Ruben, too) for all of your work on this. Having some of the new c++0x features, and std::thread, in particular, is very nice. Best. K. Frank Test "d" code follows: ================================================= #include <iostream> #include <thread> #include <mutex> // for std::for_each #include <algorithm> // for std::ref #include <functional> // for std::make_shared #include <memory> struct consistent_data { consistent_data() : a_(0), b_(0), a_plus_b_(0), a_minus_b_(0) {} void increment() { a_ = ( (a_plus_b_ + a_minus_b_) / 2 ) + 1; b_ = ( (a_plus_b_ - a_minus_b_) / 2 ) - 1; a_plus_b_ = a_ + b_; a_minus_b_ = a_ - b_; } long a_; long b_; long a_plus_b_; long a_minus_b_; }; std::ostream& operator<< (std::ostream &os, const consistent_data &d) { os << "a_ = " << d.a_ << '\n'; os << "b_ = " << d.b_ << '\n'; os << "a_plus_b_ = " << d.a_plus_b_ << '\n'; os << "a_minus_b_ = " << d.a_minus_b_ << std::endl; return os; } struct guarded_data { void operator() (unsigned long count) { for (unsigned long i = 0; i < count; i++) { std::lock_guard<std::mutex> guard (mutex_); data_.increment(); } } guarded_data() = default; guarded_data (const guarded_data&) = delete; consistent_data data_; std::mutex mutex_; }; int main (int argc, char *argv[]) { std::cout << "std_thread_test_d..." << std::endl; unsigned long count = 100000; unsigned long num_threads = 100; std::vector<std::shared_ptr<std::thread>> threads; guarded_data data; std::cout << "data.data_ = ..." << std::endl; std::cout << data.data_; for (unsigned long i = 0; i < num_threads; i++) { threads.push_back (std::make_shared<std::thread> (std::thread (std::ref(data), count))); } std::for_each (threads.begin(), threads.end(), std::mem_fn(&std::thread::join)); std::cout << "data.data_ = ..." << std::endl; std::cout << data.data_; } ================================================= |