Besides the actual stream decoding, the bigges video decoding bottleneck is converting the decoded stream from YUV colorspace to whatever you require. (RGB usually).
Initially I made a C implementation and optimised it as much as possible. But that's still a C implementation, and it's performance is compiler dependent.
So, to reduce the performance impact of this process, I have now integrated Google's libyuv library which contains assembly optimised code for colorspace conversion.
So far I tested a few cases and could measure up to 5 times better performance on MacOS using SSSE3 compiled libyuv.
So far I've enabled libyuv on Mac using SSSE3 and iOS using Arm NEON optimisations.
Next on the list is enabling it on Win32 and Android.
Android is going to be a bit tricky because at the moment theoraplayer supports only armv5te architecture while libyuv suports only arm neon (armv7a) and newer. I found some armv5te optimised assembly code for Android online so I'll investigate that next.