Unfortunately, due to the complexity and specialized nature of AVX-512, such optimizations are typically reserved for performance-critical applications and require expertise in low-level programming and processor microarchitecture.
Unfortunately, due to the complexity and specialized nature of AVX-512, such optimizations are typically reserved for performance-critical applications and require expertise in low-level programming and processor microarchitecture.
yes, as i said
(they don’t make that comparison in the article)
so its not clear exactly how handwritten asm compares to intrinsics in this specific comparison. we can’t assume their handwritten AVX-512 asm and instrinics AVX-512 will perform identically here, it may be better, or worse.
also worth noting they’re discussing benchmarking of a specific function, so overall performance on executing a given set of commands may be quite different depending what can and can’t be unrolled and in which order for different dependencies.