Re: [linuxsh-dev] Fixup for the fast SH4 memcpy

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Jivin Stuart Menefy lays it down ...
> Hi David
> 
> On Thu, 27 Feb 2003 17:41:44 +1000 da...@sn... wrote:
> 
> > I found a bug today in the fast SH4 memcpy that Stuart Menefy graciously
> > wrote for us :-).
> 
> All I can say is nuts, and sorry!

No problem,  even with the bug the performance improvements are well
worth the effort and I would rather debug the existing version than
write my own from scratch.  It must have taken a bit of work to get
it going ;-)

> I had a suspicion that there was still a lurking problem, as using
> this memcpy instead of glibc's memcpy caused bash to generate some
> strange errors after a while. However I was never able to pin down the
> circumstances. Before I could investigate further I got diverted onto
> some other work.
> 
> I thought I'd run through all possible tests before releasing this,
> but you found one I'd missed. As soon as I added your size and
> alignments to my test code it spotted it as well. I've now given up on
> testing only selected alignments, and am doing exhaustive tests!

Yeah,  I wrote a quick exhaustive test to check my changes before I
posted them.  Took about 15 minutes to run but was worth it ;-)
I can post a copy if you like.

> > Attached is a patch,  and the changed original just in case.  I would
> > appreciate any feedback on the correctness/speed implications.  I ran
> > it through an extensive alignment/boundary case test without any problems
> > and it fixed my original crash :-)
> 
> Once I had a test case to throw at it, I fixed it independently, and
> came up with an almost identical fix, a copy of which is
> attached. Either will fix it, and I can't see any correctness
> problems with either.
> 
> Performance wise, they are identical: both add one cycle. The only way
> I can see of avoiding the extra test is to use the byte at a time copy
> for remaining lengths 4 to 7.  Best case (4 bytes left) this would
> increase the time from 12 cycles to 19, so I don't think its worth it
> to save one cycle in the 8 to 31 case, which will take a minimum of
> 15 cycles anyway (timings from the 1: label to ret instr).

Cool,  I'll use you version so I am in sync with what you have.  Also I
think the cmp is easier to understand when reading the code ;-)

> So thank you for all the hard work you must have put in tracking this
> down, and sorry again.

No need to be sorry :-)

Thanks,
Davidm

-- 
David McCullough:    Ph: +61 7 3435 2815  http://www.SnapGear.com
da...@sn...  Fx: +61 7 3891 3630  Custom Embedded Solutions + Security