|
From: Julian S. <js...@ac...> - 2013-06-20 09:27:24
|
Transactional memory support in hardware is about to hit mainstream
cpus, with the arrival of Intel's Haswell family. It also exists or
is about to exist (I don't know which) in the s390 and POWER architectures.
So we need to have a Plan for dealing with it. I looked at the Intel
(Haswell) instructions a bit, but don't know about the s390 or POWER
ones.
IIRC, the Intel scheme requires two instructions:
XBEGIN fail-addr
XEND
XBEGIN starts a transaction and registers fail-addr as the fallback
address. If the transaction fails for whatever reason (there are many
reasons why it might fail) then the processor rolls back register/memory
state to how it was at the XBEGIN, and executes code at fail-addr
instead.
XEND finishes the transaction.
Transactions can be nested.
We can't hope to emulate the conflict checking done by the real CPU
in any sane (performance or complexity) way. So I see two remaining
options:
(1) translate "XBEGIN fail-addr" as "goto fail-addr"; that is: push
simulated execution directly onto the failure path. This is simple
but will have poor performance, if (as is likely) the failure path
uses normal locking and is not tuned for speed.
(2) translate "XBEGIN fail-addr" by handing it through to the real CPU,
in the same kind of way we handle CPUID, RDTSC, etc. Of course we will
have to put our own failure-handler address so we don't lose control
of execution if the transaction is aborted, but that's not difficult.
---
(1) is simple but likely to work. (2) is probably preferable but I don't
know if there will be hidden problems in it.
This also assumes that the s390 and POWER instructions can be mapped to
this same structure: TRANSACTION-START(fail-addr) and TRANSACTION-END.
That's probably the first thing that we should investigate.
J
|
|
From: Josef W. <Jos...@gm...> - 2013-06-20 12:15:58
|
Am 20.06.2013 11:27, schrieb Julian Seward: > (1) translate "XBEGIN fail-addr" as "goto fail-addr"; that is: push > simulated execution directly onto the failure path. This is simple > but will have poor performance, if (as is likely) the failure path > uses normal locking and is not tuned for speed. > > (2) translate "XBEGIN fail-addr" by handing it through to the real CPU, > in the same kind of way we handle CPUID, RDTSC, etc. Of course we will > have to put our own failure-handler address so we don't lose control > of execution if the transaction is aborted, but that's not difficult. > > --- > > (1) is simple but likely to work. (2) is probably preferable but I don't > know if there will be hidden problems in it. I do not see a hidden problem at the moment. But the code added by VG + tool (just assume cache simulation) will raise the probability for transaction failure significantly. And doing a rollback + failure path is slower than doing the failure path from the beginning. So (1) may not be a bad option at all. What about (3) if XBEGIN/XEND is found in the same SB, remove them. As VG is serializing threads, there is no way for a conflict within a SB anyway. Hm. A problem would be exceptions raised between XBEGIN/XEND. A transaction would simply fail... Josef PS: Using TM ourself should be a very nice solution to make memcheck fast when we remove serializing of threads at one point. The move forward here would be to let the tool decide whether VG core should do serialization or not. If TM is not available, memcheck would go with thread serialization as now. PS2: making the handling of TM transparent to tools means that e.g. cache simulation cannot tell anything about how TM would influence the cache. Current TM implementations use ways of cache sets to store transaction changes, thus influencing miss behavior. This is just a note; i do not suggest to simulate TM in VG ;-) > This also assumes that the s390 and POWER instructions can be mapped to > this same structure: TRANSACTION-START(fail-addr) and TRANSACTION-END. > That's probably the first thing that we should investigate. > > J > > ------------------------------------------------------------------------------ > This SF.net email is sponsored by Windows: > > Build for Windows Store. > > http://p.sf.net/sfu/windows-dev2dev > _______________________________________________ > Valgrind-developers mailing list > Val...@li... > https://lists.sourceforge.net/lists/listinfo/valgrind-developers > |
|
From: Julian S. <js...@ac...> - 2013-06-20 13:39:26
|
On 06/20/2013 02:15 PM, Josef Weidendorfer wrote: > But the code added by VG + > tool (just assume cache simulation) will raise the probability for > transaction failure significantly. Yes .. I agree with that. > And doing a rollback + failure path > is slower than doing the failure path from the beginning. So (1) may > not be a bad option at all. OK, worth a try. At least it gives us an ultra-simple baseline implementation. I expect to have a Haswell box to try with, soon. That's Intel-specific, though. The question of whether we can fit Power/s390 into the same framework is also important. > What about > (3) if XBEGIN/XEND is found in the same SB, remove them. As VG is > serializing threads, there is no way for a conflict within a SB anyway. OK for now, but if we continue Philippe Waroquiers' work on multithreaded Valgrind, then that serialization might go away one day. > PS: Using TM ourself should be a very nice solution to make memcheck > fast when we remove serializing of threads at one point. The move > forward here would be to let the tool decide whether VG core should > do serialization or not. If TM is not available, memcheck would > go with thread serialization as now. Hmm, interesting. Some time earlier this year I worked out most of the details for making memcheck multithreaded without using TM, and posted the details to the list (I think). Philippe points out though that we will in any case need to retain the ability to serialize so as not to force tool authors to completely rewrite tools for multithreaded operation. J |
|
From: Niall D. <ndo...@bl...> - 2013-06-20 15:08:15
Attachments:
smime.p7s
|
> Some time earlier this year I worked out most of the details for making > memcheck multithreaded without using TM, and posted the details to the list (I > think). Philippe points out though that we will in any case need to retain the > ability to serialize so as not to force tool authors to completely rewrite tools for > multithreaded operation. You are aware there is a perfectly good mature compiler based on GCC which implements TM in software and uses HTM for acceleration where possible? TM in software isn't slow either, in fact the software converted to TM has shown very significant performance improvements. My point is you can convert valgrind to TM right now, and it works on non-Haswell. Your only problem would be if SG5 significantly changes the present TM spec for C/C++ and makes your code stop compiling - presently, this seems highly unlikely so long as you don't do anything too exotic. Niall --- Opinions expressed here are my own and do not necessarily represent those of BlackBerry Inc. |
|
From: Maran P. <ma...@li...> - 2013-06-21 10:13:20
|
On 06/20/2013 07:09 PM, Julian Seward wrote: > >> And doing a rollback + failure path >> is slower than doing the failure path from the beginning. So (1) may >> not be a bad option at all. > OK, worth a try. At least it gives us an ultra-simple baseline > implementation. I expect to have a Haswell box to try with, soon. > > That's Intel-specific, though. The question of whether we can fit > Power/s390 into the same framework is also important. > With s390, there are 2 modes of transaction execution. 1) Non-constrained transaction 2) Constrained transaction Constrained-transaction as stated in PoP is "a constrained transaction will eventually complete, thus an abort-handler routine is not required." (Refer to: Principles of operations: "Constrained Transaction", pg. no: 5-105) So, a failure (fallback) path need not be present for all the transactions. So, on encountering a transaction, directly jumping to fallback path may not be a viable option atleast with constrained transactions. As far as I know, constrained transaction seem to be specific to s390. (These seem to be no equivalent for this type of transaction in ppc). Even with non-constrained transaction, the s390 architecture does not enforce an abort handler to be specified for a transaction. Also, it is possible to specify a set of (general) registers to be restored to the original value on a transaction abort whereas the remaining registers will retain the value updated in the transaction. And this kind of partial register set restoration seems to be specific to s390. Similary, it is possible to do a non-transactional store within a transaction - that is, the store operation designated as non-transactional store will be retained on a transaction abort while all the other data stores within a transaction are discarded. This feature also seems to be specific to s390. --Maran |
|
From: Josef W. <Jos...@gm...> - 2013-07-12 09:06:35
|
Am 20.06.2013 14:15, schrieb Josef Weidendorfer: > Am 20.06.2013 11:27, schrieb Julian Seward: >> (2) translate "XBEGIN fail-addr" by handing it through to the real CPU, >> in the same kind of way we handle CPUID, RDTSC, etc. Of course we will >> have to put our own failure-handler address so we don't lose control >> of execution if the transaction is aborted, but that's not difficult. > I do not see a hidden problem at the moment. But the code added by VG + > tool (just assume cache simulation) will raise the probability for > transaction failure significantly. And that actually may be the problem with (2): * the probability for failure may not just raise, but switch to a persistant failure, e.g. if the added code by a tool does a system call. * if hardware ensures that some transaction will eventually succeed (as on s390), and the compiler does not produce a failure path, any changes in the original instructions (already done be VEX itself) may destroy the hardware guarantee, and then we have a problem (ie. probably a livelock). TM detects conflicts by snooping memory accesses of other cores. Thus, LL/SC can be seen as primitive to implement a poor-man's TM (just one address). With LL/SC, VEX hands it through to the real CPU, and this results in the following effect: * with multithreaded ARM code, when asked for verbose debug output e.g. with Callgrind, the program may stuck - because LL/SC always will fail (e.g. a printf in the middle), but the original program expects LL/SC to eventually succeed. * I remember that for the MIPS port, LL/SC failed for some reason already with the regular tool handling with cachegrind/callgrind. The solution was to move the call to the cache simulator due to the memory access of LL *before* the passed-through LL instruction, thus reducing the amount of code executed between LL and SC, and thus reducing the probability for failure. And of course, this is not a real solution, but more a work-around which seems to work. I think with (2), we will run into similar issues as seen with LL/SC. It may be not too complex to emulate TM by using TM itself for hardware-supported conflict detection. Josef And doing a rollback + failure path > is slower than doing the failure path from the beginning. So (1) may > not be a bad option at all. > > What about > (3) if XBEGIN/XEND is found in the same SB, remove them. As VG is > serializing threads, there is no way for a conflict within a SB anyway. > Hm. A problem would be exceptions raised between XBEGIN/XEND. A > transaction would simply fail... > > Josef > > PS: Using TM ourself should be a very nice solution to make memcheck > fast when we remove serializing of threads at one point. The move > forward here would be to let the tool decide whether VG core should > do serialization or not. If TM is not available, memcheck would > go with thread serialization as now. > > PS2: making the handling of TM transparent to tools means that > e.g. cache simulation cannot tell anything about how TM would > influence the cache. Current TM implementations use ways > of cache sets to store transaction changes, thus influencing miss > behavior. > This is just a note; i do not suggest to simulate TM in VG ;-) > > > >> This also assumes that the s390 and POWER instructions can be mapped to >> this same structure: TRANSACTION-START(fail-addr) and TRANSACTION-END. >> That's probably the first thing that we should investigate. >> >> J >> >> ------------------------------------------------------------------------------ >> This SF.net email is sponsored by Windows: >> >> Build for Windows Store. >> >> http://p.sf.net/sfu/windows-dev2dev >> _______________________________________________ >> Valgrind-developers mailing list >> Val...@li... >> https://lists.sourceforge.net/lists/listinfo/valgrind-developers >> > > > ------------------------------------------------------------------------------ > This SF.net email is sponsored by Windows: > > Build for Windows Store. > > http://p.sf.net/sfu/windows-dev2dev > _______________________________________________ > Valgrind-developers mailing list > Val...@li... > https://lists.sourceforge.net/lists/listinfo/valgrind-developers > |
|
From: Niall D. <ndo...@bl...> - 2013-06-20 14:53:29
Attachments:
smime.p7s
|
This list may find this discussion on WG21 SG5 Transactional Memory of use: https://groups.google.com/a/isocpp.org/d/msg/tm/zfRyY_XfxSU/47esUzUcHX0J In short, I think you should simulate TM directly as effectively a "super longjmp" which preserves memory as well as registers. That's the most future proof, and it gives you a third way from the two you proposed to handle transactional memory. I'd recommend that approach instead, because (a) Haswell's HTM is known to be incomplete by Intel's own definition and they're planning on improving it and (ii) you can't handle it using the real CPU because Haswell's HTM either captures all memory writes or it doesn't i.e. all of valgrind's writes in addition to program writes. The fact Haswell's HTM is all or nothing I think suggests strongly what Intel will improve upon. Niall --- Opinions expressed here are my own and do not necessarily represent those of BlackBerry Inc. |
|
From: John R. <jr...@bi...> - 2013-06-20 15:32:04
|
> (1) translate "XBEGIN fail-addr" as "goto fail-addr"; that is: push > simulated execution directly onto the failure path. This is simple > but will have poor performance, if (as is likely) the failure path > uses normal locking and is not tuned for speed. That strategy might induce a deadlock, infinite loop, or application failure in some cases. For example, one fallback strategy for the application when the transaction fails is: grab a larger lock (more powerful) which logically guarantees that the transaction will succeed, then re-run the transaction while holding that lock. If VG always interprets XBEGIN as "goto transaction failed" then the application cannot make progress. -- |
|
From: Julian S. <js...@ac...> - 2013-06-20 15:59:26
|
On 06/20/2013 05:33 PM, John Reiser wrote: >> (1) translate "XBEGIN fail-addr" as "goto fail-addr"; that is: push >> simulated execution directly onto the failure path. This is simple >> but will have poor performance, if (as is likely) the failure path >> uses normal locking and is not tuned for speed. > > That strategy might induce a deadlock, infinite loop, or application failure > in some cases. For example, one fallback strategy for the application when > the transaction fails is: grab a larger lock (more powerful) which > logically guarantees that the transaction will succeed, then re-run > the transaction while holding that lock. Are you sure about that? The transaction might fail for entirely arbitrary reasons, such as resource exhaustion (for conflict tracking) in the hardware or due to a context switch whilst inside the transaction, so the fallback path must always be usable. Hence not providing a usable fallback path is a straight-out application bug. Do I misunderstand something? J |
|
From: John R. <jr...@bi...> - 2013-06-20 16:18:37
|
On 06/20/2013 08:59 AM, Julian Seward wrote: > On 06/20/2013 05:33 PM, John Reiser wrote: >>> (1) translate "XBEGIN fail-addr" as "goto fail-addr"; that is: push >>> simulated execution directly onto the failure path. This is simple >>> but will have poor performance, if (as is likely) the failure path >>> uses normal locking and is not tuned for speed. >> >> That strategy might induce a deadlock, infinite loop, or application failure >> in some cases. For example, one fallback strategy for the application when >> the transaction fails is: grab a larger lock (more powerful) which >> logically guarantees that the transaction will succeed, then re-run >> the transaction while holding that lock. > > Are you sure about that? The transaction might fail for entirely arbitrary > reasons, such as resource exhaustion (for conflict tracking) in the hardware > or due to a context switch whilst inside the transaction, so the fallback > path must always be usable. Hence not providing a usable fallback path is > a straight-out application bug. The lock might guarantee that P(failure) < (1.0 - delta) for non-infinitesimal delta > 0. For instance, the lock might guarantee no [other] memory accesses to the relevant memory region, no context switching, no other XBEGIN (in the strongest case: no other process, and no other thread), etc. Thus progress will occur unless VG forces perpetual failure. -- |
|
From: Eliot M. <mo...@cs...> - 2013-06-20 16:52:44
|
On 6/20/2013 12:19 PM, John Reiser wrote: > On 06/20/2013 08:59 AM, Julian Seward wrote: >> On 06/20/2013 05:33 PM, John Reiser wrote: > The lock might guarantee that P(failure) < (1.0 - delta) for non-infinitesimal > delta > 0. For instance, the lock might guarantee no [other] memory accesses > to the relevant memory region, no context switching, no other XBEGIN (in the > strongest case: no other process, and no other thread), etc. Thus progress > will occur unless VG forces perpetual failure. Furthermore, I know that at least some people at IBM are very interested in an HTM definition that *guarantees* that certain transactions will succeed in the absence of conflicting accesses. The existing IBM spec may even make such a guarantee. As one of the originators of TM ( :-) ), my personal opinion is that it would be interesting to simulate TM in Valgrind, which basically means to build a software TM facility. When run on a platform that supports HTM, one could take a Hybrid TM approach, where some transactions may succeed directly in the target hardware while others will proceed in software (e.g., if the hardware version fails enough times or the transaction is too complex for the hardware). In terms of interaction with tools, one can try weaving in the tool rewrites directly -- which may tend to make transactions bigger, and thus more likely to fail in the underlying hardware. To get the proper semantics, we may need an STM emulation of logical success of the hardware transaction (as opposed to throwing the original code down its failure path). *Perhaps* we can "buffer" the actions of the original transaction, and if it commits, do the work that a tool wants to do. This may be too complex, but it's a thought. It has the virtue of not doing tool work on the actions of failing transactions -- actions that logically disappear anyway. ... Which suggests that we need additional clarity as to what should happen w.r.t. tool actions for a failing transaction. If you run it as a hardware txn and it *fails*, then the tool actions disappear -- which might, or might not, be what you want. Regards -- Eliot Moss |
|
From: Julian S. <js...@ac...> - 2013-06-20 16:04:12
|
On 06/20/2013 04:53 PM, Niall Douglas wrote: > https://groups.google.com/a/isocpp.org/d/msg/tm/zfRyY_XfxSU/47esUzUcHX0J > > In short, I think you should simulate TM directly as effectively a "super > longjmp" which preserves memory as well as registers. I skimmed the (very long) thread but am none the wiser. Can you summarise what actions the simulator should take in response to XBEGIN and XEND ? J |
|
From: Niall D. <ndo...@bl...> - 2013-06-20 18:22:00
Attachments:
smime.p7s
|
> https://groups.google.com/a/isocpp.org/d/msg/tm/zfRyY_XfxSU/47esUzUcHX > > 0J > > > > In short, I think you should simulate TM directly as effectively a > > "super longjmp" which preserves memory as well as registers. > > I skimmed the (very long) thread but am none the wiser. Can you summarise > what actions the simulator should take in response to XBEGIN and XEND ? 1. All writes by valgrind on behalf of a program being executed need a new trans_ctx pointer. 2. xbegin starts a new memory transaction context such that subsequent writes occur in that context instead. Other threads can't see anything accumulated there. 3. xend commits that context's changes as visible to all other threads and throws away the context. 4. Any thread reading or writing any cache line straddling a memory transaction context causes that context to abort next timeslice. Any syscall or context switch also aborts. The API I have in that thread is a very reasonable starting prototype for an internal API implementing the above. Quite separately, one might wish to use Transactional GCC to implement TM in valgrind's implementation right now. Niall --- Opinions expressed here are my own and do not necessarily represent those of BlackBerry Inc. |
|
From: Carl E. L. <ce...@li...> - 2013-06-20 16:20:17
|
On Thu, 2013-06-20 at 11:27 +0200, Julian Seward wrote: > Transactional memory support in hardware is about to hit mainstream > cpus, with the arrival of Intel's Haswell family. It also exists or > is about to exist (I don't know which) in the s390 and POWER architectures. > > So we need to have a Plan for dealing with it. I looked at the Intel > (Haswell) instructions a bit, but don't know about the s390 or POWER > ones. The IBM ISA 2.07 published May 10th at https://www.power.org contains the documentation on the IBM TM instructions. The TM instructions in the IBM ISA 2.07 are described in book 2, chapter 5. There are about 15 TM instructions. I have not studied them in any detail as of yet. > > IIRC, the Intel scheme requires two instructions: > > XBEGIN fail-addr > XEND > > XBEGIN starts a transaction and registers fail-addr as the fallback > address. If the transaction fails for whatever reason (there are many > reasons why it might fail) then the processor rolls back register/memory > state to how it was at the XBEGIN, and executes code at fail-addr > instead. > > XEND finishes the transaction. > > Transactions can be nested. > > We can't hope to emulate the conflict checking done by the real CPU > in any sane (performance or complexity) way. So I see two remaining > options: > > (1) translate "XBEGIN fail-addr" as "goto fail-addr"; that is: push > simulated execution directly onto the failure path. This is simple > but will have poor performance, if (as is likely) the failure path > uses normal locking and is not tuned for speed. > > (2) translate "XBEGIN fail-addr" by handing it through to the real CPU, > in the same kind of way we handle CPUID, RDTSC, etc. Of course we will > have to put our own failure-handler address so we don't lose control > of execution if the transaction is aborted, but that's not difficult. > > --- > > (1) is simple but likely to work. (2) is probably preferable but I don't > know if there will be hidden problems in it. > > This also assumes that the s390 and POWER instructions can be mapped to > this same structure: TRANSACTION-START(fail-addr) and TRANSACTION-END. > That's probably the first thing that we should investigate. > > J > |