https://www.phoronix.com/news/FFmpeg-swscale-Rewrite-Landing This swscale rewrite is less dependent upon compiler auto vectorization and introduced a new x86 SIMD back-end.
Niklas Haas reported that the
single-threaded code was 2.1x faster overall and as much as 40.3x faster.
For the
multi-threaded usage it was
2.6x faster overall and as much as
254x faster!
Also, from that forum: "hand crafted assembler will always be faster than compiler optimized code and no programmer worth his salt that really cares about performance would rely on compiler optimizations."