c++ - SSE and AVX intrinsics mixture -
in addition sse-copy, avx-copy , std::copy performance. suppose need vectorize loop in following manner: 1) vectorize first loop-batch (which multiple 8) via avx. 2) split loop's remainder 2 batches. vectorize batch multiple of 4 via sse. 3) process residual batch of entire loop via serial routine. let's consider example of copying arrays:
#include <immintrin.h> template<int length, int unroll_bound_avx = length & (~7), int unroll_tail_avx = length - unroll_bound_avx, int unroll_bound_sse = unroll_tail_avx & (~3), int unroll_tail_last = unroll_tail_avx - unroll_bound_sse> void simd_copy(float *src, float *dest) { auto src_ = src; auto dest_ = dest; //vectorize first part of loop via avx for(; src_!=src+unroll_bound_avx; src_+=8, dest_+=8) { __m256 buffer = _mm256_load_ps(src_); _mm256_store_ps(dest_, buffer); } //vectorize remainder part of loop via sse for(; src_!=src+unroll_bound_sse+unroll_bound_avx; src_+=4, dest_+=4) { __m128 buffer = _mm_load_ps(src_); _mm_store_ps(dest_, buffer); } //process residual elements for(; src_!=src+length; ++src_, ++dest_) *dest_ = *src_; } int main() { const int sz = 15; float *src = (float *)_mm_malloc(sz*sizeof(float), 16); float *dest = (float *)_mm_malloc(sz*sizeof(float), 16); float a=0; std::generate(src, src+sz, [&](){return ++a;}); simd_copy<sz>(src, dest); _mm_free(src); _mm_free(dest); }
is correct use both sse , avx? need avoid avx-sse transitions?
you can mix sse , avx intrinsics want.
the thing want make sure specify correct compiler flag enable avx.
- gcc:
-mavx
- visual studio:
/arch:avx
failing either result in code not compiling (gcc), or in case of visual studio,
kind of crap:
what flag forces simd instructions use vex encoding avoid state-switching penalties described in question above.
Comments
Post a Comment