[xine-cvs] HG: xine-lib: Emit vzeroupper after avx memcpy

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

# HG changeset patch
# User Roland Scheidegger <rsc...@hi...>
# Date 1379458717 -3600
# Node ID 951e79f6a580678c6c2aae0b43638f7d257fcac1
# Branch  default
# Parent  f8839feb700572c2c5ab722c0d827657f77a4a7c
Emit vzeroupper after avx memcpy

Emitting vzeroupper is necessary to avoid avx<->sse transition penalties (when
using avx-256 instructions).
This didn't really matter much in the past, since other code wasn't using avx,
hence there was just a penalty once afterwards when sse code was executed.
However, there's code in ffmpeg which mixes avx-128 and sse a lot, and each
time this happens there's a huge penalty. This causes in particular
ff_deblock_v_luma_8_avx to slow down by a factor of 50 or so which makes the
whole decoding about twice as slow (might be dependent on the h264 stream or
maybe ffmpeg version too, since ffmpeg will also emit vzeroupper when using
avx-256 hence not doing it here might not always be an issue, but in the case
I was seeing nothing else used avx-256).

diff --git a/src/xine-utils/memcpy.c b/src/xine-utils/memcpy.c
--- a/src/xine-utils/memcpy.c
+++ b/src/xine-utils/memcpy.c
@@ -249,6 +249,7 @@
     /* since movntq is weakly-ordered, a "sfence"
      * is needed to become ordered again. */
     __asm__ __volatile__ ("sfence":::"memory");
+    __asm__ __volatile__ ("vzeroupper");
   }
   /*
    *	Now do the tail of the block