here's the updated greedy2frame sse2 patch. In contrast to the first try
I've now made it fall back to the mmxext path when alignment
restrictions aren't met (I've also renamed that old path from sse as
it's really mmxext not sse). Thanks to Petri it also initializes the xmm
variables in a much less crappy way...
I couldn't quite reuse the old template, I'm sure there's some clever
way to reuse the same assembly for mmx and sse2 (the ffmpeg guys do that
for instance) but that's the way it is now.
btw performance results are intersting, on a c2d-class chip the
performance increase was little more than statistical noise (maybe 2%
overall including h264 decode), despite that this chip can execute the
arithmetic really twice as fast thanks to 128bit simd units (it indeed
clearly executed less instructions).
On a Athlon64 X2 the performance increase was much more substantial (25%
or so for deinterlacer alone), even though this chip doesn't gain
anything really from using sse2 over mmx (due to its 64bit simd units) -
all the difference came from incorporating the separate line copy loop
into the inline assembly and the prefetch instructions (slightly more
improvement from the former than the latter). Of course that could be
done for the mmx path too but I'm not sure how some older cpus would
react to that - apparently on a c2d it makes no difference anyway (I'm
quite sure the prefetch instructions are just a total waste there for
these simple patterns the hw prefetcher is more than adequate on that chip).
So on a c2d things are limited by memory bandwidth it seems. Might be
more of an improvement on chips which have both 128bit simd units AND a
fast memory interface (like Nehalem, Sandy Bridge, possibly Barcelona,
From: Roland Scheidegger <rscheidegger_lists@hi...> - 2012-06-11 10:14:58
Am 10.06.2012 01:52, schrieb Darren Salt:
> I demand that Roland Scheidegger may or may not have written...
>> here's the updated greedy2frame sse2 patch. In contrast to the first try
>> I've now made it fall back to the mmxext path when alignment
>> restrictions aren't met (I've also renamed that old path from sse as
>> it's really mmxext not sse). Thanks to Petri it also initializes the xmm
>> variables in a much less crappy way...
> UTTERLY BROKEN PATCHES...
> All is well on x86 and x86_64; that, in and of itself, is fine. BUT due to at
> least some of these MMX/SSE patches, building is now BROKEN on anything not
> x86-based. I don't know x86 asm or I might have given these more than a
> cursory glance...
> Here I was, thinking that I could do a release in plenty of time for the
> Debian wheezy release freeze. Now, 1.2.2 *could* end up missing the freeze
> unless we get this mess fixed *quickly*.
> I STRONGLY suggest that some reading of the build logs is in order. You will
> find the following link to be of use.
Sorry for that just some x86 specific code which was outside the X86 macros.
Looks like Petri already fixed it.
At some point I was actually wondering why the templates get included at
all, since that deinterlacer is x86 only anyway.