Re: [Mingw-w64-public] Some std::thread timings for various implementations

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hi Kai!

On Fri, Sep 30, 2011 at 12:26 PM, Kai Tietz <kti...@go...> wrote:
> Hi K. Frank,
>
> very interesting results.  I try to give my 5 cents for interpretation
> of timings.  Esspecially case "d" I am interested in.  As here we seem
> to spend too much time in idle-looping.  Not sure why it happens, but
> looks worth-wile for investigation.

I should point out (I meant to in my original post, but I forgot.) that
test "d" (mutex contention) is, in some sense, very unrealistic.
I have 100 threads (on a 2-core) machine, each running the same
tight loop (100,000 iterations): acquire mutex, increment data,
release mutex.

Now, for a real multi-threaded program, this would be very bad design.
So an implementation that trades off performing well on a sensible
program for performing more slowly on this test would be making a
wise trade-off.  (Of course, there could be an issue that would affect
well-designed programs, as well, and in that case it would be worth
looking into.)

> 2011/9/30 K. Frank <kfr...@gm...>:
>> Hello List!
>>
>> I have run some timings on a couple of my std::thread test programs
>> using the following three std::thread implementations:
>>
>>   native windows implementation, mingw-w64 4.5.2, a sezero build
>>   pthreads-win32 implementation, mingw-w64 4.5.2, a sezero build
>>   winpthreads implementation, mingw-w64 4.7.0, Ruben's std::thread build
>>
>> I list the timings, in seconds, in the above order (native, pthreads-win32,
>> winpthreads):
>>
>> Test "c2" (parallel accumulate):
>>
>>   15.40 (0.09)   15.43 (0.05)   18.79 (0.07)
>
> This slow-down might be caused by different implementation and/or
> code-generation of libstdc++/compiler.  Nevertheless it might be also
> some additional overhead for thread-creation what we are seeing here.

My sense is that most of the run time for this program is in that actual
accumulation of the numbers, rather than in the thread-specific stuff.
This test uses 50 threads -- three seconds for thread creation doesn't
seem likely (and contradicts test "e" where the winpthreads implementation
does everything, including creating 20 threads, in under a second).

So I'm inclined to agree with your speculation that the actual accumulate
code (rather than the thread stuff) is running a little slower because of
compiler differences.  (But I'm just guessing -- I haven't profiled
the internals
of the test program.)

>> Test "d" (mutex contention):
>>
>>   2.12 (0.02)   2.62 (0.08)   40.41 (0.06)
>
> For this test I would be interested to get some testcode.  This looks
> to me like an uncessary sleep or thread-pausing.

The code for test "d" appears below.  One comment:  The motivation
for the consistent_data class is to have some data that won't remain
consistent if it is not incremented atomically.  The idea is that if the
mutex were broken somehow and didn't actually synchronize access
to the data, obvious inconsistencies would show up.  (And, by the way,
if you comment out the lock_guard (i.e., the mutex) in guarded_data
the expected inconsistencies do show up.)

>> Test "e" (condition variables):
>>
>>   3.18 (0.04)   3.31 (0.10)   0.74 (0.04)
>
> That one shows that winpthread tries to do a real round-balancing for
> waiters on conditional variables.  And the approach seems to be worth,
> as we see in timing.  I spend here some time to implement proper
> concurrency balancing code for waiters.  Nice :)

Yes, that's good.

I don't know -- but you might -- whether pthreads-win32 implements its
condition variables with round-balancing.  I also don't know whether native
windows condition variables (which I use in what I've called the native
implementation) use round-balancing.  It would be interesting to know,
one way or another.

(A little off topic here, but to elaborate...   In general I think having the
round-balancing is good, but not for all use cases.  The question of
whether thread synchronization should be "fair," i.e., all threads get a
guaranteed or probabilistic fair shot at acquiring a released mutex or
at being notified by a condition variable, depends on what you want to
accomplish.

Normally we think about different threads doing different things, but needing
access to some shares resource.  In this case, fairness makes sense.
But if you think about a thread pool (a very common use of condition
variables) of fungible threads -- by this I mean that you have a pool of a
bunch of equivalent threads, any of which can be substituted for one
another, so that it doesn't matter which specific thread actually does the
work -- fairness can actually be a detriment.

Let's say you have eight (fungible) worker threads running on eight processor
cores, but, for possibly legitimate reasons, you have twenty threads in your
thread pool.  Ideally you'd like those eight threads to keep running, rather
than context-switch other threads from the thread pool onto the cores.
Fairness would require exactly this -- in this case unnecessary -- context
switching,

I've wondered sometimes whether it would make sense for this reason to
have two kinds of synchronization objects -- those that are fair, and a
second set that are "anti-fair," so to speak.)

> Regards,
> Kai

Thanks to you (and to Ruben, too) for all of your work on this.  Having
some of the new c++0x features, and std::thread, in particular, is very
nice.

Best.

K. Frank

Test "d" code follows:

=================================================

#include <iostream>
#include <thread>
#include <mutex>

// for std::for_each
#include <algorithm>

// for std::ref
#include <functional>

// for std::make_shared
#include <memory>

struct consistent_data {
  consistent_data() : a_(0), b_(0), a_plus_b_(0), a_minus_b_(0) {}
  void increment() {
    a_ = ( (a_plus_b_ + a_minus_b_) / 2 ) + 1;
    b_ = ( (a_plus_b_ - a_minus_b_) / 2 ) - 1;
    a_plus_b_ = a_ + b_;
    a_minus_b_ = a_ - b_;
  }
  long a_;
  long b_;
  long a_plus_b_;
  long a_minus_b_;
};

std::ostream& operator<< (std::ostream &os, const consistent_data &d) {
  os << "a_          =  "  <<  d.a_          << '\n';
  os << "b_          =  "  <<  d.b_          << '\n';
  os << "a_plus_b_   =  "  <<  d.a_plus_b_   << '\n';
  os << "a_minus_b_  =  "  <<  d.a_minus_b_  << std::endl;
  return os;
}

struct guarded_data {
  void operator() (unsigned long count) {
    for (unsigned long i = 0; i < count; i++) {
      std::lock_guard<std::mutex> guard (mutex_);
      data_.increment();
    }
  }
  guarded_data() = default;
  guarded_data (const guarded_data&) = delete;
  consistent_data data_;
  std::mutex mutex_;
};

int main (int argc, char *argv[]) {
  std::cout << "std_thread_test_d..." << std::endl;

  unsigned long count = 100000;
  unsigned long num_threads = 100;

  std::vector<std::shared_ptr<std::thread>> threads;

  guarded_data data;

  std::cout << "data.data_ = ..." << std::endl;
  std::cout << data.data_;

  for (unsigned long i = 0; i < num_threads; i++) {
    threads.push_back (std::make_shared<std::thread> (std::thread
(std::ref(data), count)));
  }
  std::for_each (threads.begin(), threads.end(),
std::mem_fn(&std::thread::join));

  std::cout << "data.data_ = ..." << std::endl;
  std::cout << data.data_;
}

=================================================

Re: [Mingw-w64-public] Some std::thread timings for various implementations

A complete runtime environment for gcc

Re: [Mingw-w64-public] Some std::thread timings for various implementations