Am 10.04.2012 17:45, schrieb Roland Scheidegger:
> Am 10.04.2012 15:57, schrieb Petri Hintukainen:
>> On ti, 2012-04-10 at 01:55 +0200, Roland Scheidegger wrote:
>>> Missed the availability of pavgb instruction last time (since the code
>>> required mmxext anyway) which is VERY useful. Cuts down assembly
>>> instructions nearly in half (and makes it completely memory bound again,
>>> the performance improvement is less than 10% here). Keep the old version
>>> fixed to work with mmx only if that's still useful.
>> Nice :)
>> Some notes about the new (and old) code:
>> Instructions have some latency and now most of the new instructions
>> depend on each other (mm3 register). This slows down execution time.
>> Better scheduling might improve performance even more ?
> This is possible, but it is near impossible to measure as waiting on
> memory probably exceeds waiting on instruction dependencies by an order
> of magnitude anyway (well on the cpus I tested this on - it is true they
> have throughput of 1 (or 2 even) and latency of 1 for most of these
> instructions, but OTOH those old ones which have lower throughput/higher
> latency have memory systems which suck so it probably doesn't change
> things much). Also, keep in mind all cpus (ok almost all except atoms
> and pentium mmx and the latter doesn't apply to the mmxext path) we're
> looking at here are out-of-order with instruction windows of ~40 or so
> instructions so manually trying to reorder instructions probably doesn't
> really do much. Interleaving instructions (i.e. process both line1 and
> line2 simultaneously) can be done quite easily (as there are 2 unused mm
> regs actually), though I think it makes the code less readable so that's
> why I mostly didn't do it (considering in the end it doesn't really make
> a difference anyway).
>> ".align ..." in beginning of asm block is useless. Loop is generated by
>> the compiler, so it should take care of aligning the block. Align here
>> can even make the loop slower, if there is some register loading in the
>> loop body before asm block.
> You are right this should go away here.
>> Incrementing some of the data pointers could be interleaved to asm code.
>> This might speed up things when there are not enough registers to hold
>> all pointers (x86_32). Interleaving could speed up the code on x86_64
>> too (mmx/sse uses different execution units than add, and some bubbles
>> could be filled).
> Yeah, but that makes things more complicated and ugly. It might be a
> better idea to completely skip pointer incrementing and just use more
> complex indirect addressing instead (using the SIB modes), I'm actually
> unsure if that has a performance penalty itself or is free (I did some
> experiments along these lines elsewhere in the deinterlacer, but this
> has the same problem as changing arithmetic portion of the code doesn't
> change performance at all anyway).
>> Multiplying by 5 and 7 might be faster with pmullw, if instructions are
>> interleaved properly. There are some other instructions that take long
>> time and have low throughput (unpack). But I don't know if it would
>> speed up the new mmxext version (or even mmx one).
> You mean faster than the shifts/adds? This is possible (especially on
> intel cpus) - it would essentially trade 2 shifts/2 adds for 2 muls,
> though the muls would also need a constant load. Personally I'm not all
> that interested in the mmx version, and it is impossible this can do
> anything for the mmxext case (you have twice as many instructions just
> because working on 16bit numbers, plus mul is typically a much worse
> instruction than pavgb both in latency and throughput).
> unpack though doesn't really have high latency/low throughput (latency 1
> on all intel cpus, though it might be 2 on some amd cpus).
>> Could the interlaced loops (y) be combined ? It could make the code
>> simpler. It could also reduce cache misses and even improve streaming
>> stores. The loop is already over 200 bytes long.
> You mean you only want to run through the code once both for even and
> odd lines? I thought about this and my conclusion was it should be the
> same performance probably. The lines are large enough (well for video
> sizes _I_ care about) that non-contiguous write access of lines should
> have a minimal impact, and on the other hand you need to access more of
> the (reused in several lines) chrominance data simultaneously which
> possibly means it will get evicted from L1. I haven't tried it actually,
> by no means this is the most possible optimized version, just reasonably
> optimized. If you think this is nicer I can change it.
>> I don't know if streaming store is optimal here; in many cases data is
>> used immediately after conversion, and with some CPUs SD video fits
>> perfectly well in L2 cache.
> This is a valid point, but I was optimizing for hd resolutions (and the
> old code used streaming stores too). Granted this could still fit into
> caches with a new enough cpu (just needs more than 4MB cache, which I
> definitely don't have). Even with sd I've got no hope of fitting it into
> cache (2x512KB L2 only). CPUs with such large caches though are likely
> to be "fast enough" anyway hence I think it's reasonable to optimize
> this for cases where it doesn't fit into cache (and in this case
> streaming stores are a big big win). Unless the code would take cache
> (and video) size into account...
>> Have you tried if using prefetch for YUV planes has any effect ?
> I suspect this is just about the only optimization which could really
> make a difference but I didn't try yet (I am more concerned with
> yv12->yuy2 performance including the deinterlacer and the deinterlacer
> is taking a much bigger slice of cpu time, if you don't use deinterlacer
> you typically don't need yv12->yuy2 anyway - I was thinking about
> integrating yv12->yuy2 conversion into deinterlacer too which I think
> would help quite a bit both for performance and quality but that's a
> complex task.)
>>> This code should also do better rounding (not 100% correct but better
>>> than the old code which did truncation),
>> Adding rounding to C version should not hurt performance much ? That
>> would allow testing of accelerated versions by running both versions and
>> comparing the results.
> Oh didn't think of that. The c version is already very slow though.
> Getting the same rounding might be quite tricky actually (so might
> indeed slow it down quite a bit) at least for the interlaced case (and
> the question is, same to what, as the mmx and mmxext versions don't have
> the same rounding). I think it would be better if you want to compare
> results (programmatically) to just allow a difference of 1 (or 2?).
> Visually the difference should be not visible I think (well I couldn't
> see any...).
>>> and is also changed to use the
>>> same assembly for both odd and even lines (as a simple argument swap is
>>> all that's needed).
>> Applied this part.
>> I have also SSE2 and AVX versions of the code. Maybe we should create
>> template for the function, I don't like the idea of having 5 copies of
>> almost identical function ...
>> I was also thinking that we could leave all mmx (+mmxext?) versions out
>> when compiling to 64-bit systems, as there is always at least SSE2
>> support. I think even oldest 64-bit kernels support SSE2 too ?
> Both suggestions look perfectly reasonable to me. It would be trivial to
> always use sse2 versions instead of mmxext for everything, safe for that
> alignment problem which I didn't find a good solution for (this is the
> only reason why I didn't write a sse2 version and stuck with mmxext here
> - well that and performance is the same anyway because of both memory
> bandwidth limitations and the cpu I got having only 64bit simd units).
> 64bit environment has xmm regs defined in the abi for passing arguments
> so yes sse2 support can be assumed.
Hmm btw I think there was some bug in the earlier patch... Must have
done some "trivial" changes without much testing...