|
From: Julian S. <js...@ac...> - 2004-08-22 20:56:53
|
One of the major stumbling blocks to porting Valgrind is the lack of a CPU virtualisation framework. UCode has significant limitations, and rather than try and fix them, I think it better to abandon UCode, learn from it, and design something clean, and based on 1990s compiler technology -- UCode is rooted in the 60s. UCode's limitations, as I see them, are: * It's very tied to x86. The UInstrs are x86-specific, and there are x86-specific kludges, particularly the handling of condition codes. * Because the UInstrs are x86-specific, all the tools are exposed to architecture-specific details. Ports to other platforms (eg, experimental PPC port) introduce their own set of PPC-specific stuff. This leads to an N x M (archs x tools) maintenance headache. * It provides insufficient description of SIMD operations, hence memcheck et al cannot properly instrument them. This will become more of a problem as compilers more routinely generate SIMD code (and yes, even gcc is getting there). * It's microarchitecturally naive. UCode frequently and unavoidably swaps the %eflags state between the real and simulated CPUs. What we discovered too late is that this is an expensive operation on PII/III/4, Athlon -- costing between 10 and 20 cycles. The same problem also afflicts the FPU simulation. * It forces use of a naive, macro-expanding instruction selector which often generates poor code. Instruction selectors based on tree pattern matching ideas from the late 80s / early 90s are easy to implement and produce better code, but UCode precludes their use. * UCode hardwires the assumption that the guest and host CPU architectures are the same. It therefore rules out any future possibility of cross-architecture debugging, something which is important in the embedded arena. * UCode offers no support for optimisation across multiple basic blocks. Doing so could significantly improve performance. * The UCode machinery is deeply wired into the rest of Valgrind. This is a mistake: it makes it impossible to debug and profile the translator/instrumentors using a simple test driver. What I am thinking of is a framework based on an architecturally-neutral intermediate representation (IR). This IR will represent superblocks -- single entry, multiple exit regions of code, rather than the basic blocks we represent at present. The IR will fundamentally be a sequence of statements, and allow both flat SSA-style code, and arbitrarily deep expression trees, or any point in between, depending on what is convenient for the transformation next to be done. The IR will be strongly typed (machine-level types) and will have a clearly defined semantics. Those who have followed such discussions in the past will know that two alternative schemes have been proposed: * copy-annotate, in which original insns are copied more or less verbatim, and annotations supporting instrumentation are added * disassemble-resynthesise, in which original insns are unpicked into an intermediate representation, which is then instrumented, and real insns are then regenerated At present we use copy-annotate for FP/MMX/SSE/SSE2 instructions, and disassemble-resynthesise for the integer instruction set, excluding the condition code handling. This is an unholy mess. I am leaning towards a framework which is predominantly disassemble- resynthesise. I see that there are occasions where explicitly representing guest instruction semantics is infeasible, and so copy-annotate must be used for those. The framework will need to support that. However, I hope to keep that to a minimum. Currently I do not have anything much to show. I hope to come up with something more definite over the next couple of months. J |
|
From: Nicholas N. <nj...@ca...> - 2004-08-23 09:22:56
|
On Sun, 22 Aug 2004, Julian Seward wrote: > UCode's limitations, as I see them, are: > > * It's very tied to x86. > > * Because the UInstrs are x86-specific, all the tools are exposed to > architecture-specific details. > > * It provides insufficient description of SIMD operations, hence > memcheck et al cannot properly instrument them. > > * It's microarchitecturally naive. > > * It forces use of a naive, macro-expanding instruction selector which > often generates poor code. > > * UCode hardwires the assumption that the guest and host CPU > architectures are the same. > > * UCode offers no support for optimisation across multiple basic > blocks. > > * The UCode machinery is deeply wired into the rest of Valgrind. Yeah. Man, UCode sucks -- who thought that up? :) > Those who have followed such discussions in the past will know > that two alternative schemes have been proposed: > > * copy-annotate, in which original insns are copied more or less > verbatim, and annotations supporting instrumentation are added > > * disassemble-resynthesise, in which original insns are unpicked into > an intermediate representation, which is then instrumented, and real > insns are then regenerated > > At present we use copy-annotate for FP/MMX/SSE/SSE2 instructions, and > disassemble-resynthesise for the integer instruction set, excluding > the condition code handling. This is an unholy mess. > > I am leaning towards a framework which is predominantly disassemble- > resynthesise. I see that there are occasions where explicitly > representing guest instruction semantics is infeasible, and so > copy-annotate must be used for those. The framework will need to > support that. However, I hope to keep that to a minimum. Just to clarify: my understanding is that for really weird instructions that cannot be expressed explicitly in the IR (eg. CPUID) there is a "opaque" IR expression which holds the raw bytes (which the code generator can spit out, ie. copy-annotate), but which the tools get very little info, eg. only the regs/mem inputs and outputs. (There might also be a string attached "cpuid" so that an arch-specific tool could know about CPUID if it really needed to.) I'm not sure how X-arch translation is feasible in the face of such instructions, however. It's also worth noting that a distinct advantage of disassemble-resynthesise is that if there's a mistake in the disassembler, the program will (hopefully) crash and so the bug will be obvious. Whereas with copy-annotate, if there's a mistake in the disassembler it is possible that the instruction description seen by the tool is wrong, but the generated code is right. Such an error would be very hard to find. (Julian found one like that with an SSE instruction recently, where it was IIRC incorrectly marked as reading a memory location instead of writing.) N |
|
From: Julian S. <js...@ac...> - 2004-08-23 10:12:57
|
> Yeah. Man, UCode sucks -- who thought that up? :) Yeah, what a quarterwitted* hack UCode is :) > Just to clarify: my understanding is that for really weird instructions > that cannot be expressed explicitly in the IR (eg. CPUID) there is a > "opaque" IR expression which holds the raw bytes (which the code generator > can spit out, ie. copy-annotate), but which the tools get very little > info, eg. only the regs/mem inputs and outputs. (There might also be a > string attached "cpuid" so that an arch-specific tool could know about > CPUID if it really needed to.) I'm not sure how X-arch translation is > feasible in the face of such instructions, however. That's exactly correct. For instructions like CPUID which have implicit, fixed register uses, the "opaque" expression will need to contain * raw bytes * some indication of host registers defd and used so that reg-alloc can correctly integrate the insn in the flow -- a bit like the constraints on gcc in-line assembly, I guess. Opaque expressions better not read or write memory (I think that's OK). For insns which are strange but have adjustable register fields (eg, vector frobnication of a SSE register), the opaque expr will need to contain not exactly literal bytes, but more of a bitfield template, which tells the register allocator where to slot in the revised register numbers it has computed. Of course a general solution will need to deal with both explicit (pfrobl %xmm1, %xmm4) and fixed-implicit (eg CPUID) reg uses, and any combination. As you can see I have yet to work out the precise details. X-arch translation will be impossible in the face of such opaque expressions. At least this gives a clear line, though: if the x86->IR front end does not generate these, X-arch is possible; if it does, it's not. It also gives flexibility: if you want to do X-arch but really need to handle CPUID, the front end can insert a call to a C helper function which simulates CPUID rather than pushing a literal CPUID through the translation pipeline. Have your cake and eat it too. > It's also worth noting that a distinct advantage of > disassemble-resynthesise is that if there's a mistake in the disassembler, > the program will (hopefully) crash and so the bug will be obvious. > Whereas with copy-annotate, if there's a mistake in the disassembler it is > possible that the instruction description seen by the tool is wrong, but > the generated code is right. Such an error would be very hard to find. Yes, the verifiability problem is one of several things which inclines me away from full-scale copy-annotate. > (Julian found one like that with an SSE instruction recently, where it was > IIRC incorrectly marked as reading a memory location instead of writing.) coregrind/vg_to_ucode.c rev 1.143. That bug had been there for a year before I discovered it. J * 2^(-n)wit -- a generalised halfwit :-) |
|
From: Jeremy F. <je...@go...> - 2004-08-23 21:14:52
|
On Mon, 2004-08-23 at 10:22 +0100, Nicholas Nethercote wrote: > Just to clarify: my understanding is that for really weird instructions > that cannot be expressed explicitly in the IR (eg. CPUID) there is a > "opaque" IR expression which holds the raw bytes (which the code generator > can spit out, ie. copy-annotate), but which the tools get very little > info, eg. only the regs/mem inputs and outputs. (There might also be a > string attached "cpuid" so that an arch-specific tool could know about > CPUID if it really needed to.) I'm not sure how X-arch translation is > feasible in the face of such instructions, however. CPUID is probably a slightly different case, since we need to do special things with it rather than just run a native CPUID instruction. But there are other strange things which we'd probably just run as-is (erm, like, um, something. All the BCD instructions, maybe?) J |
|
From: Julian S. <js...@ac...> - 2004-08-23 21:55:42
|
> CPUID is probably a slightly different case, since we need to do special > things with it rather than just run a native CPUID instruction. But > there are other strange things which we'd probably just run as-is (erm, > like, um, something. All the BCD instructions, maybe?) Yes, I guess. Or SSE-vector-do-bizarre-arithmetic, of where there appear to be many such instructions :-( J |