Just dropping by to provide some numbers... 

Hmm, there is only one other transparent solution I can think of: TLS. That doesn't require
locking by the user, no idea how the compiler/os handle it though. Traditionally TLS is
expensive. You'd go back to the store the allocator in a TLS variable each operation
model. If you're using any C style I/O with errno crud, you're using TLS already.

 If you look at this thread from Boost's mailing list, there's some nice benchmarking showing the performance of TLS (both pthread's and boost's implementation).

Actually, malloc() has to use synchronisation already so you're already paying a performance hit
for synchronisation, it used to use mutex locks but today I don't know, it seems a good candidate
for "lock free" operation to me.

 The graphs generated to show tcmalloc's performance vs glibc's malloc (found here) give a nice idea of how well malloc handles multiple cores.