|
From: Paul M. <pa...@sa...> - 2004-03-05 01:17:17
|
Here are the broad outlines of the changes I made to valgrind-2.1.0 to add PowerPC support. I have made some changes since I released the original patch, so I am talking about the current patch, which is at: http://ozlabs.org/~paulus/valgrind-2.1.0-ppc.patch.bz The major changes were the creation of the PPC versions of VG_(disBB) and VG_(emit_code), in vg_ppc_to_ucode.c and vg_ppc_from_ucode.c. Many of the PPC instructions could be expressed using the existing ucodes, but there were some that couldn't, so I added a set of new ucodes. (I also removed (with ifdefs) some of the existing x86 ucodes (the MMX/SSE ones and the segment ones) to reduce code size, but this isn't strictly necessary.) PowerPC is a RISC load/store architecture with 3-address arithmetic and logical instructions, so it maps reasonably well onto ucode. Since ucode arithmetic/logical instructions are 2-address, we end up generating some extra MOVs, but those get eliminated in the ucode-to-ppc instructions step. Since arithmetic/logical instructions operate on registers, and in particular on the whole register, the size field of the corresponding uinstrs is always 4. In other words, I don't need to handle ADDB and ADDW, just ADDL. There were substantial changes to the set of things that get put in VG_(baseBlock). Obviously we have a PowerPC register set in there instead of an x86 set. There isn't any particular distinction between compact and noncompact slots on PowerPC because the load and store instructions have a 16-bit (sign-extended) offset field, so we can get to any word within 32kB of the start of VG_(baseBlock) as easily as any other. In fact I didn't end up needing many helpers. Similarly there were changes to the set of registers stored in the thread state structure. I included the floating point registers in the thread state but not in VG_(baseBlock). The FP registers are loaded from the thread state at the first use of floating point since the thread was scheduled. When switching away from the thread, I save the real FP registers into the thread state. This works because I compile valgrind with -msoft-float, which ensures that gcc doesn't use the FP registers in valgrind code. One of the areas that needed a lot of changes was the handling of flags and jump conditions. X86 has a "flags" register which bundles up condition codes (Z, S), carry (C, A) and overflow (O) bits, together with some other stuff. PowerPC handles this a bit differently. There is a 32-bit condition register (CR) which has 8 4-bit fields, labelled CR0 to CR7. Each 4-bit field can express the result of a comparison. The first 3 bits are called LT (less than), GT (greater than) and EQ (equal). The fourth bit is SO (summary overflow), and is copied from the SO bit in XER (which I describe below) when the other bits are set. Compare instructions can set any of the 8 fields, and come in signed and unsigned flavours, and set only one of LT, GT, EQ, and clear the other two. The conditional branch instructions can branch depending on whether any one of the 32 bits in CR is zero or one. Thus instead of having one type of compare (which is basically a subtract) and two sets of conditional branches (for signed and unsigned relations) as x86 does, PowerPC has two types of compare and one type of conditional branch. The conditional branches test just one bit in the CR instead of branching on a complex combination of the flag bits. Most instructions that do some arithmetic or logical operation can optionally set CR0 based on a signed comparison of the result with 0. There is a bit in the instruction (in almost all cases) that indicates whether to set CR0 or not. PowerPC also has an "integer exception register" called XER. It is 32 bits. The top 3 bits are SO (summary overflow), OV (overflow) and CA (carry). Instructions that do addition or subtraction can optionally detect overflow and clear or set the OV bit, and set the SO bit. Addition and subtraction instructions can also optionally set the CA bit to the carry-out of the addition, and optionally use CA as the carry-in. Because the setting of the carry, overflow and condition results is optional, we don't need to do any lifetime analysis on those bits. In general the instruction won't ask those bits to be set unless they are going to be used, so if the instruction says to set those bits, we just set them. In fact I have never seen the overflow-detecting forms used. In the PowerPC port, I have valid bits for each bit of CR and XER. I have GETVF and PUTVF get/set the valid bit for XER.CA. For compares and instruction forms which set CR0, I currently say that all of LT, GT and EQ are invalid if any bit of either operand (for compare instructions) or the result is invalid. That is quite conservative. When setting CR0, we know that LT is valid if the top bit of the result is valid, for instance. Another point where x86 and PowerPC differ is in the conventions for calling procedures. PowerPC has 32 general-purpose registers (GPRs) and passes the first 8 parameters to a function in R3 - R10. The return value is returned in R3 (or R3 and R4 if the function returns a long long). Functions are called using the "bl" (branch and link) instruction, which sets LR (link register) to the address of the next instruction after the bl. Functions have to save and restore LR if they call other functions, and return with a "blr" (branch to link register) instruction. By convention, R1 is used as the stack pointer. R1 always points to the bottom of the current stack frame, and the word pointed to by R1 always contains the address of the next stack frame. In other words, R1 is only changed to create a new stack frame, destroy a stack frame, or expand the current stack frame for alloca(). The differences in function calling conventions have implications for the stack tracing code, the code that sets up a new thread, the signal handling code, the code that starts GDB, etc. The code that gets the argument pointer on a client request and passes the result back is also affected. Most system calls are very similar between x86 and PowerPC, although many have different system call numbers. The calling convention is that the system call number is in R0, with parameters in R3 - R8. The return value is in R3, with an error indicator in CR0.SO. If CR0.S0 is set, then R3 contains the (positive) error number. There are some system calls (e.g. mmap) where x86 puts the arguments in memory and passes a pointer to that block of memory but PPC just passes the arguments in registers. Also, some of the kernel types are different; for example uid_t and gid_t are 32-bit, not 16-bit. Most of the bit definitions are identical, except for a few things like O_DIRECT and O_DIRECTORY. Atomic operations and spinlocks are different. PowerPC has "load with reservation" and "store conditional" instructions. The "load with reservation" establishes a reservation on the address used for the load (in fact on the cache line containing that address), and "store conditional" will only do the store if the reservation still exists (and it sets CR0 to say whether it did the store or not). Stores to that cache line from other processors will cancel the reservation. Thus you can read a value, modify it, and write back the modified value in a fashion that is effectively atomic. In valgrind I currently use a LOCK uinstr followed by a LOAD or STORE to express the load-with-reservation and store-conditional instructions. The computation of valid bits for memory loads has changed slightly. Because PowerPC load instructions always set all 32 bits of the destination register, regardless of whether it is a 1-byte, 2-byte or 4-byte load, I changed MC_(helperc_LOADV1) and friends to set the high bits of the result to 0 rather than 1. Things I changed in the existing uInstrs include: * The condition values for the JMP instruction. On PowerPC, condition values 0 .. 31 mean jump if that bit of CR is 0. Values 32 .. 63 mean jump if bit (cond & 0x1f) of CR is 1. Value 64 means jump unconditionally. The uInstrs I added are: CMP and CMPU: signed and unsigned comparison. CMP0: signed comparison against 0, for instructions that set CR0. SETZ: computes (x? 0: -1). Used in conjunction with JIFZ for the branch instruction forms that decrement the CTR. ICRF: insert condition register field. Sets 4 bits of the destination. Used for instructions that set a CR field. XBIT: extract bit. Does dst = (src >> n) & 1. Used in CR-logical instructions (ands, ors, etc. between bits of CR). IBIT: insert bit. Does dst = (dst & ~(1 << n)) | ((src & 1) << n). Used in CR-logical instructions. MULH, UMULH: signed and unsigned multiply, giving the high 32 bits of the 64-bit product. DIV, UDIV: signed and unsigned 32-bit divide. CNTLZ: count leading zeroes. Computes the number of zeroes to the left of the most-significant 1 bit in the source (or 32 if the source operand is 0). Just recently I have added three new tag ops: Tag_Min4_QT, Tag_Max4_QT and Tag_Cmp0_TQ. The first two are used in computing the exact valid bits for ADD operations. They compute (A & ~B) and (A | B) respectively. Tag_Cmp0_TQ computes the validity of the 4-bit condition result from comparing a value with 0. Minor changes included: * Changed __attribute__((regparm(N))) to asmlinkage, and defined that as an empty macro on PPC, to eliminate compiler warnings. * Moved VG_(helper_offset) from vg_from_ucode.c to vg_main.c so that I didn't have to duplicate it in vg_ppc_from_ucode.c. * Changed remove_if_exeseg_from_list() to handle removal of parts of executable segments. |