c++ - How to optimize my matrix multiplication using SIMD AVX2 instructions? - Stack Overflow

I have implemented a function to calculate the matrix product of A[i,k] * B[k,j] and stores it in C[i,j

I have implemented a function to calculate the matrix product of A[i,k] * B[k,j] and stores it in C[i,j].

Using c++ , i know that for matrix A and C the access to memory is direct and sequential BUT for B i need to use the columns thus the only thing i way able to think of is the TRANSPOSE of B.

Here is the code :

void multMat1(int n, float *A, float *B, float *C) {
    float *Bt = (float*)malloc(sizeof(float)*n*n);
    // transposing B to Bt
    for(int col=0; col< n; col+=BLOCK_SIZE){
        for(int row=0; row< n; row+=BLOCK_SIZE){
            for (int x = row; x < row+BLOCK_SIZE && x< n; x++) {
                for (int y = col; y < col+BLOCK_SIZE && y< n; y++) {
                    Bt[y + x * n] = B[x + y * n];
                }
            }
        }
    }
    for (int i = 0; i < n; i++) {
        for (int j = 0; j < n; j++) {
            __m256 C_temp_vect = _mm256_setzero_ps();
            for (int k = 0; k < n; k += 8) {
                // loading 8 elements from A and Bt(transposed B)
                __m256 A_vect = _mm256_loadu_ps(&A[i * n + k]);
                __m256 B_scalar = _mm256_loadu_ps(&Bt[j * n + k]);
                C_temp_vect = _mm256_add_ps(C_temp_vect,_mm256_mul_ps(A_vect, B_scalar));
            }
            float temp[8];
            float sum = 0.0f;
            _mm256_store_ps(temp, C_temp_vect);
            // summation of all the elements of the vector temp.
            for(int index=0;index<8;index++)
                sum += temp[index];
            C[i * n + j] = sum;
        }
    }
    free(Bt);
}

I did some other versions like the naive and threaded version, here is the results.

- Naive Version : ijk: n = 1000, 7.421 Gflop/s

- Threaded Version : ijk: n = 1000, 42.237 Gflop/s

- SIMD Version : added to make file : CXXFLAGS=-ggdb -Wall -pedantic -mavx2 -O3 ijk: n = 1000, 2.221 Gflop/s

where is the problem here?, as from what i know, the simd version should be the fastest.

I have implemented a function to calculate the matrix product of A[i,k] * B[k,j] and stores it in C[i,j].

Using c++ , i know that for matrix A and C the access to memory is direct and sequential BUT for B i need to use the columns thus the only thing i way able to think of is the TRANSPOSE of B.

Here is the code :

void multMat1(int n, float *A, float *B, float *C) {
    float *Bt = (float*)malloc(sizeof(float)*n*n);
    // transposing B to Bt
    for(int col=0; col< n; col+=BLOCK_SIZE){
        for(int row=0; row< n; row+=BLOCK_SIZE){
            for (int x = row; x < row+BLOCK_SIZE && x< n; x++) {
                for (int y = col; y < col+BLOCK_SIZE && y< n; y++) {
                    Bt[y + x * n] = B[x + y * n];
                }
            }
        }
    }
    for (int i = 0; i < n; i++) {
        for (int j = 0; j < n; j++) {
            __m256 C_temp_vect = _mm256_setzero_ps();
            for (int k = 0; k < n; k += 8) {
                // loading 8 elements from A and Bt(transposed B)
                __m256 A_vect = _mm256_loadu_ps(&A[i * n + k]);
                __m256 B_scalar = _mm256_loadu_ps(&Bt[j * n + k]);
                C_temp_vect = _mm256_add_ps(C_temp_vect,_mm256_mul_ps(A_vect, B_scalar));
            }
            float temp[8];
            float sum = 0.0f;
            _mm256_store_ps(temp, C_temp_vect);
            // summation of all the elements of the vector temp.
            for(int index=0;index<8;index++)
                sum += temp[index];
            C[i * n + j] = sum;
        }
    }
    free(Bt);
}

I did some other versions like the naive and threaded version, here is the results.

- Naive Version : ijk: n = 1000, 7.421 Gflop/s

- Threaded Version : ijk: n = 1000, 42.237 Gflop/s

- SIMD Version : added to make file : CXXFLAGS=-ggdb -Wall -pedantic -mavx2 -O3 ijk: n = 1000, 2.221 Gflop/s

where is the problem here?, as from what i know, the simd version should be the fastest.

Share Improve this question edited Mar 21 at 23:24 Peter Cordes 367k49 gold badges717 silver badges981 bronze badges asked Mar 21 at 21:07 Acno_SamaAcno_Sama 678 bronze badges 7
  • Are you sure the naive version isn't optimized via a higher-level SIMD? Also, have you considered checking the simd version with the BT matrix stored on the side so you don't have to constantly malloc/free it at each call? – ALX23z Commented Mar 21 at 21:25
  • 1 I assume your main problem is the latency when summing C_temp_vect -- your inner loop should ideally accumulate in 8 registers in parallel (when using FMA instead of separate mul/add). Besides that, there is no need to transpose B and for matrices larger than the CPU cache you need to have a look into blocking, I recommend having a look at existing BLAS implementations. – chtz Commented Mar 21 at 21:37
  • @user4581301 im compiling using '-ggdb -Wall -pedantic -mavx2 -O3' , and for the matrices, i am using malloc(). – Acno_Sama Commented Mar 21 at 21:41
  • @ALX23z i did not get what you mean here sir, Bt its a matrix that stores B transposed, its called inside the function and i free it as soon as i finish the multiplication – Acno_Sama Commented Mar 21 at 21:43
  • 1 Compare your matrix calculation results with np.matmul first, to make sure your functions do the right thing. And/or some well established C++ linear algebra library, like boost or eigen++. When your calculations match those, compare your calculation speeds against those libraries. Numerical stability often requires adjusting the compiler flags and calculations in non-trivial ways. – Maxim Egorushkin Commented Mar 21 at 22:26
 |  Show 2 more comments

1 Answer 1

Reset to default 0

I found out a solution that resulted in a very fast multiplication , 5980 Gflops/s !
Basically for an element A[0,0] , we multiply it by the whole row B[0,J] , then the results could be stored inside C[0,J], then for second element of A[0,1] we multiply it by the whole row [1,J] and again store it inside C[0,J], same as C+=C.
Thus we are accessing A,B and C all in a sequential order which is cache friendly and reusing same column from A for a whole row of B which then stored in the same row of C.
The code is the following :

__m256 C_vect;
for (int i = 0; i < n; i++) { // Accessing Columns of A and C
        for (int k = 0; k < n; k+=8) { // Accessing Columns of B with step 8 cause we are using loadu
            for (int j = 0; j < n; j++){ // To Jump Between rows
                C_vect = _mm256_loadu_ps(&C[i * n + k]);
                __m256 B0_vect = _mm256_loadu_ps(&B[j * n + k]);
                __m256 A0_vect = _mm256_set1_ps(A[i * n + j]);
                C_vect = _mm256_add_ps(C_vect,_mm256_mul_ps(A0_vect, B0_vect));         
            }    
            _mm256_storeu_ps(&C[i * n + k],C_vect); // Storing 8 values to C after whole row of A * whole 8 columns of B.
        }
    }

发布者:admin,转转请注明出处:http://www.yc00.com/questions/1744333589a4569004.html

相关推荐

发表回复

评论列表(0条)

  • 暂无评论

联系我们

400-800-8888

在线咨询: QQ交谈

邮件:admin@example.com

工作时间:周一至周五,9:30-18:30,节假日休息

关注微信