|
From: Greg P. <gp...@us...> - 2005-09-05 02:39:14
|
Valgrind on Darwin/ppc32 is progressing well. Nulgrind simulated its first instructions yesterday. Unfortunately, a simple do-nothing program on Mac OS X still does quite a lot, so it falls over before it can complete a process. The port is proceeding faster than I expected, partly because of the improvements in Valgrind 3.x. This is good news for other potential ports: Darwin is less Linux-like than most other Unix flavors, so everyone else should have an easier job. I'm working from svn trunk plus some "no libc" and "no stage1" changes copied from the aspacem branch. Those changes exactly match what's needed for Darwin, which is handy. The biggest problem blocking Darwin/ppc32 right now is VEX/ppc32. It's not yet complete enough to run Darwin's dynamic loader, and I don't know enough about VEX yet to fix it myself. If anyone out there is listening, my current requests are `lwbrx`, `mftb`, and `lvsl`, particularly the latter. (Unfortunately, I don't have any Altivec-less machines to play with, and convincing Mac OS X not to use it when it's present is tricky.) I also poked at Darwin/x86 briefly. I was quickly stumped by Darwin's syscalls. For sysenter-based syscalls, Darwin calls `jae` afterwards to check success vs. failure. If I understand correctly, this means the syscall is returning the success bit in the Carry flag. The problem is that I don't know how to copy the Carry flag back into VEX's state, because I don't understand VEX's complicated eflags representation. Other minor porting issues of the moment: * Signals are nasty. I don't understand which parts of Valgrind's signal handling are ugly because Linux is ugly, and which are because signals overall are ugly. Luckily, Mac OS X generally doesn't use signals much, so I mostly turned off signal handling code for now. (Not to mention the problem of the parallel Mach exception mechanism.) * Pulling the ELF-specific code out of the loader and adding a Mach-O path was easy. I haven't looked at the debuginfo module much, but it looks less easy. * Handling Darwin/ppc's "set $pc based on syscall success for some syscalls" isn't fun, but I think I have it correct now. I can't run far enough to test it completely. * Wrappers to monitor Mach traps are just coming online. My Valgrind 2.x experience was that they were mostly just tedious to discover and write. * Actual memory checking will be missing for some time. I likely won't bother trying to bring it up until the helpful aspace changes are ready. * My automake skillz ain't l33t enough to conditionalize all the build differences, but there's no need to make that work any time soon. -- Greg Parker gp...@us... |
|
From: Cerion Armour-B. <ce...@op...> - 2005-09-05 21:56:14
|
On Monday 05 September 2005 04:39, Greg Parker wrote: > Valgrind on Darwin/ppc32 is progressing well. Nulgrind simulated its > first instructions yesterday. Unfortunately, a simple do-nothing > program on Mac OS X still does quite a lot, so it falls over before > it can complete a process. > > The port is proceeding faster than I expected, partly because > of the improvements in Valgrind 3.x. This is good news for other > potential ports: Darwin is less Linux-like than most other Unix > flavors, so everyone else should have an easier job. > > I'm working from svn trunk plus some "no libc" and "no stage1" > changes copied from the aspacem branch. Those changes exactly > match what's needed for Darwin, which is handy. > > The biggest problem blocking Darwin/ppc32 right now is VEX/ppc32. > It's not yet complete enough to run Darwin's dynamic loader, and > I don't know enough about VEX yet to fix it myself. If anyone > out there is listening, my current requests are `lwbrx`, `mftb`, > and `lvsl`, particularly the latter. (Unfortunately, I don't have > any Altivec-less machines to play with, and convincing Mac OS X > not to use it when it's present is tricky.) > I'm about to get going with remaining ppc32 insns + altivec again - i'll put those insns at the top of the list :-) Cerion |
|
From: Greg P. <gp...@us...> - 2005-09-06 01:35:07
|
Cerion Armour-Brown writes: > On Monday 05 September 2005 04:39, Greg Parker wrote: > > The biggest problem blocking Darwin/ppc32 right now is VEX/ppc32. > > It's not yet complete enough to run Darwin's dynamic loader, and > > I don't know enough about VEX yet to fix it myself. If anyone > > out there is listening, my current requests are `lwbrx`, `mftb`, > > and `lvsl`, particularly the latter. (Unfortunately, I don't have > > any Altivec-less machines to play with, and convincing Mac OS X > > not to use it when it's present is tricky.) > > I'm about to get going with remaining ppc32 insns + altivec again - > i'll put those insns at the top of the list :-) Sounds good. I found a kernel switch to disable the Altivec unit, so that's not blocking me anymore. The commented-out lwbrx seems to work after a small tweak, and a do-nothing mftb works well enough for the programs I'm running now. lswx and stswx are rearing their ugly heads again. I have two solutions coded up, but neither is great. 1. Abuse Mux0X by performing 128 1-byte memory ops from either the right place in memory or a scratch location. lswx generates about 1800 instructions, and 1400 for stswx. It also needs cooperation from memcheck, because memcheck needs to know about the scratch byte used to load and store when it shouldn't hit the real memory. 2. Use a dirty helper. This is fast, but for memcheck to work properly it really needs the "memory touched" size to be an expression instead of a constant. I don't know how messy this would be to deal with in memcheck. However, I don't care about memcheck yet, so both solutions run well enough. -- Greg Parker gp...@us... |
|
From: Julian S. <js...@ac...> - 2005-09-06 10:33:50
|
> Sounds good. I found a kernel switch to disable the Altivec unit,
> so that's not blocking me anymore. The commented-out lwbrx seems
> to work after a small tweak, and a do-nothing mftb works well
> enough for the programs I'm running now.
lwbrx is done, as is mftb{,u}. We tried the do-nothing game for
the x86/amd64 equivalent (rdtsc) and you end up with hangs,
strange long delays, or divisions by zero.
> lswx and stswx are rearing their ugly heads again. I have two
> solutions coded up, but neither is great.
Yeh.
> 2. Use a dirty helper. This is fast, but for memcheck to work
> properly it really needs the "memory touched" size to be
> an expression instead of a constant. I don't know how messy
> this would be to deal with in memcheck.
The problem is that the instrumentation code generated by memcheck
depends on that size -- that is, the number needs to be known at
translation time.
I'll contemplate this.
Let us know of any other insns you need.
J
|
|
From: Julian S. <js...@ac...> - 2005-09-06 11:06:45
|
> lswx and stswx are rearing their ugly heads again. I have two > solutions coded up, but neither is great. > > 1. Abuse Mux0X by performing 128 1-byte memory ops from either > the right place in memory or a scratch location. lswx generates > about 1800 instructions, and 1400 for stswx. I figured out in principle how to do this so that it (1) only references memory it should and (2) is correctly instrumentable by memcheck. Getting sane performance out of it will involve adding a bit more magic to the IR optimiser, though. Can you show me the insns in context? In particular it would be useful to know how XER is set up prior to the insns. J |
|
From: Greg P. <gp...@us...> - 2005-09-06 20:56:53
|
Julian Seward writes:
> Let us know of any other insns you need.
I think I'm getting the hang of vex's IR, so I should be able to
write ordinary instructions myself now. I have implementations
of lswi and stswi, though they need a bit more testing.
[lswx/stswx]
> Can you show me the insns in context? In particular it would
> be useful to know how XER is set up prior to the insns.
The uses I've seen so far are the G3 and G4 hand-written bcopy(),
which use lswx/stswx pairs for copying some parts of some lengths.
They're in the Darwin source at xnu:osfmk/ppc/commpage/bcopy_g3.s
and bcopy_g4.s .
Two excerpts are below. They include:
* G3, length <= 32 bytes
* G3, handle misaligned bytes at the start
* G4, handle misaligned bytes at the start.
I've hilited the more relevant instructions in uppercase.
#define rs r4
#define rd r12
#define rc r5
#define w1 r6
#define w2 r7
#define w3 r8
#define w4 r9
#define w5 r10
#define w6 r11
// G3 bcopy(src=r3, dst=r4 (later rd), len=r5)
bcopy_g3:
cmplwi r5,kLong // length > 32 bytes?
sub w1,r4,r3 // must move in reverse if (rd-rs)<rc
mr rd,r4 // start to move source & dest to canonic spot
bge LLong0 // skip if long operand
MTXER r5 // set length for string ops
LSWX r5,0,r3 // load bytes into r5-r12
STSWX r5,0,r4 // store them
blr
// Long operands (more than 32 bytes.)
// w1 = (rd-rs), used to check for alignment
LLong0: // enter from bcopy()
mr rs,r3 // must leave r3 alone (it is return value for memcpy)
LLong1: // enter from memcpy() and memmove()
cmplw cr1,w1,rc // set cr1 blt iff we must move reverse
rlwinm r0,w1,0,0x3 // are operands relatively word-aligned?
NEG w2,rd // prepare to align destination
cmpwi cr5,r0,0 // set cr5 beq if relatively word aligned
blt cr1,LLongReverse // handle reverse move
ANDI. w4,w2,3 // w4 <- #bytes to word align destination
beq cr5,LLongFloat // relatively aligned so use FPRs
sub rc,rc,w4 // adjust count for alignment
srwi r0,rc,5 // get #chunks to xfer (>=1)
rlwinm rc,rc,0,0x1F // mask down to leftover bytes
mtctr r0 // set up loop count
beq 1f // dest already word aligned
// Word align the destination.
MTXER w4 // byte count to xer
cmpwi r0,0 // any chunks to xfer?
LSWX w1,0,rs // move w4 bytes to align dest
add rs,rs,w4
STSWX w1,0,rd
add rd,rd,w4
beq- 2f // pathologic case, no chunks to xfer
1: [...]
// G4 bcopy, "medium" length (32..95 bytes)
// bcopy(src=rs, dst=rd, len=rc)
// Medium and long operands. Use Altivec if long enough, else scalar loops.
// w1 = (rd-rs), used to check for alignment
// cr1 = blt iff we must move reverse
LMedium:
dcbtst 0,rd // touch in destination
cmplwi cr7,rc,kLong // >= 96, long enough for vectors?
NEG w3,rd // start to compute #bytes to align destination
rlwinm r0,w1,0,0x7 // check relative 8-byte alignment
ANDI. w6,w3,7 // w6 <- #bytes to 8-byte align destination
blt cr1,LMediumReverse // handle reverse moves
rlwinm w4,w3,0,0x1F // w4 <- #bytes to 32-byte align destination
cmpwi cr6,r0,0 // set cr6 beq if relatively aligned
bge cr7,LFwdLong // long enough for vectors
// Medium length: use scalar loops.
// w6/cr0 = #bytes to 8-byte align destination
// cr6 = beq if relatively doubleword aligned
sub rc,rc,w6 // decrement length remaining
beq 1f // skip if dest already doubleword aligned
MTXER w6 // set up count for move
LSWX w1,0,rs // move w6 bytes to align destination
STSWX w1,0,rd
[...]
|