I have implemented a function to calculate the matrix product of A[i,k] * B[k,j] and stores it in C[i,j].
Using c++ , i know that for matrix A and C the access to memory is direct and sequential BUT for B i need to use the columns thus the only thing i way able to think of is the TRANSPOSE of B.
Here is the code :
void multMat1(int n, float *A, float *B, float *C) {
float *Bt = (float*)malloc(sizeof(float)*n*n);
// transposing B to Bt
for(int col=0; col< n; col+=BLOCK_SIZE){
for(int row=0; row< n; row+=BLOCK_SIZE){
for (int x = row; x < row+BLOCK_SIZE && x< n; x++) {
for (int y = col; y < col+BLOCK_SIZE && y< n; y++) {
Bt[y + x * n] = B[x + y * n];
}
}
}
}
for (int i = 0; i < n; i++) {
for (int j = 0; j < n; j++) {
__m256 C_temp_vect = _mm256_setzero_ps();
for (int k = 0; k < n; k += 8) {
// loading 8 elements from A and Bt(transposed B)
__m256 A_vect = _mm256_loadu_ps(&A[i * n + k]);
__m256 B_scalar = _mm256_loadu_ps(&Bt[j * n + k]);
C_temp_vect = _mm256_add_ps(C_temp_vect,_mm256_mul_ps(A_vect, B_scalar));
}
float temp[8];
float sum = 0.0f;
_mm256_store_ps(temp, C_temp_vect);
// summation of all the elements of the vector temp.
for(int index=0;index<8;index++)
sum += temp[index];
C[i * n + j] = sum;
}
}
free(Bt);
}
I did some other versions like the naive and threaded version, here is the results.
- Naive Version : ijk: n = 1000, 7.421 Gflop/s
- Threaded Version : ijk: n = 1000, 42.237 Gflop/s
- SIMD Version : added to make file : CXXFLAGS=-ggdb -Wall -pedantic -mavx2 -O3 ijk: n = 1000, 2.221 Gflop/s
where is the problem here?, as from what i know, the simd version should be the fastest.
I have implemented a function to calculate the matrix product of A[i,k] * B[k,j] and stores it in C[i,j].
Using c++ , i know that for matrix A and C the access to memory is direct and sequential BUT for B i need to use the columns thus the only thing i way able to think of is the TRANSPOSE of B.
Here is the code :
void multMat1(int n, float *A, float *B, float *C) {
float *Bt = (float*)malloc(sizeof(float)*n*n);
// transposing B to Bt
for(int col=0; col< n; col+=BLOCK_SIZE){
for(int row=0; row< n; row+=BLOCK_SIZE){
for (int x = row; x < row+BLOCK_SIZE && x< n; x++) {
for (int y = col; y < col+BLOCK_SIZE && y< n; y++) {
Bt[y + x * n] = B[x + y * n];
}
}
}
}
for (int i = 0; i < n; i++) {
for (int j = 0; j < n; j++) {
__m256 C_temp_vect = _mm256_setzero_ps();
for (int k = 0; k < n; k += 8) {
// loading 8 elements from A and Bt(transposed B)
__m256 A_vect = _mm256_loadu_ps(&A[i * n + k]);
__m256 B_scalar = _mm256_loadu_ps(&Bt[j * n + k]);
C_temp_vect = _mm256_add_ps(C_temp_vect,_mm256_mul_ps(A_vect, B_scalar));
}
float temp[8];
float sum = 0.0f;
_mm256_store_ps(temp, C_temp_vect);
// summation of all the elements of the vector temp.
for(int index=0;index<8;index++)
sum += temp[index];
C[i * n + j] = sum;
}
}
free(Bt);
}
I did some other versions like the naive and threaded version, here is the results.
- Naive Version : ijk: n = 1000, 7.421 Gflop/s
- Threaded Version : ijk: n = 1000, 42.237 Gflop/s
- SIMD Version : added to make file : CXXFLAGS=-ggdb -Wall -pedantic -mavx2 -O3 ijk: n = 1000, 2.221 Gflop/s
where is the problem here?, as from what i know, the simd version should be the fastest.
Share Improve this question edited Mar 21 at 23:24 Peter Cordes 367k49 gold badges717 silver badges981 bronze badges asked Mar 21 at 21:07 Acno_SamaAcno_Sama 678 bronze badges 7 | Show 2 more comments1 Answer
Reset to default 0I found out a solution that resulted in a very fast multiplication , 5980 Gflops/s !
Basically for an element A[0,0] , we multiply it by the whole row B[0,J] , then the results could be stored inside C[0,J], then for second element of A[0,1] we multiply it by the whole row [1,J] and again store it inside C[0,J], same as C+=C.
Thus we are accessing A,B and C all in a sequential order which is cache friendly and reusing same column from A for a whole row of B which then stored in the same row of C.
The code is the following :
__m256 C_vect;
for (int i = 0; i < n; i++) { // Accessing Columns of A and C
for (int k = 0; k < n; k+=8) { // Accessing Columns of B with step 8 cause we are using loadu
for (int j = 0; j < n; j++){ // To Jump Between rows
C_vect = _mm256_loadu_ps(&C[i * n + k]);
__m256 B0_vect = _mm256_loadu_ps(&B[j * n + k]);
__m256 A0_vect = _mm256_set1_ps(A[i * n + j]);
C_vect = _mm256_add_ps(C_vect,_mm256_mul_ps(A0_vect, B0_vect));
}
_mm256_storeu_ps(&C[i * n + k],C_vect); // Storing 8 values to C after whole row of A * whole 8 columns of B.
}
}
发布者:admin,转转请注明出处:http://www.yc00.com/questions/1744333589a4569004.html
C_temp_vect
-- your inner loop should ideally accumulate in 8 registers in parallel (when using FMA instead of separate mul/add). Besides that, there is no need to transposeB
and for matrices larger than the CPU cache you need to have a look into blocking, I recommend having a look at existing BLAS implementations. – chtz Commented Mar 21 at 21:37