I want to use the inclusive scan operation in OpenMP to implement an algorithm. What follows is a description of my attempt at doing so, and failing to get more than a tepid speedup.
The inclusive operation is defined as following: for an input vector say [x1,x2,x3,x4]
it ouputs the sequence of partial sums [x1, x1+x2, x1+x2+x3, x1+x2+x3+x4]
. This is a highly parallelizable operation and ostensibly this has been implemented well in OpenMP.
I looked at the following reference: .html#Scanprefixoperations (The manual reference .0/openmpsu45.html#x68-1940002.9.6 seems too cryptic to me right now)
The artofhpc website says that the reduction clause gets a modifier inscan
:
#pragma omp parallel for reduction(inscan,+:sumvar)`
In the body of the parallel loop there is a scan directive that allows you to store the partial results. For inclusive scans the reduction variable is updated before the scan pragma:
sumvar // update
#pragma omp scan inclusive(sumvar)
partials[i] = sumvar
I tried to follow the same syntax, to measure the performance versus standard serial reduction and the results were quite disappointing. My code is at the bottom of the post.
In the code, I am just doing a simple test by considering a very large vector of size 90 million consisting of random values in the interval [-1,1], and performing a scan on it using an increasing number of threads and measuring the speedup. Here are my results (I get the same answer on repeated runs). My laptop CPU has 16 hardware threads, but the overall speedup is a disappointing 1.36. (I would have expected something substantially more!)
Is there something wrong in the way I am using the OpenMP syntax for reduction?
➜ Desktop gcc -fopenmp scantest.c && ./a.out
NumThreads Speedup
1 0.458
2 1.173
3 1.424
4 1.686
5 1.635
6 1.501
7 1.522
8 1.499
9 1.455
10 1.416
11 1.395
12 1.393
13 1.352
14 1.336
15 1.353
16 1.357
#include<stdio.h>
#include<omp.h>
#include<math.h>
#include<stdlib.h>
#include<assert.h>
int main(int argc, char** argv){
int N = 9e7; // vector size
double* x; // vector to reduce
double* partials_s; // vector to scan into
double* partials_p; // vector to scan into
double end, start; // timing variables
double sumvar;
int tmax = argc>1? atoi(argv[1]):35;
int threadcount ;
// Allocate space for all vectors
x = (double*) malloc(sizeof(double)*N);
partials_s = (double*) malloc(sizeof(double)*N);
partials_p = (double*) malloc(sizeof(double)*N);
// Populate the input vectors
for(int i=0 ; i<N ; ++i){
x[i] = -1+2.0*rand()/(double)RAND_MAX;
partials_s[i] = 0.0;
partials_p[i] = 0.0;
}
//-----------------------------------------
// Serial inclusive scan
//-----------------------------------------
start = omp_get_wtime();
sumvar = 0.0;
for(int i=0 ; i<N ; ++i){
sumvar += x[i];
partials_s[i] = sumvar;
}
end = omp_get_wtime();
double stime = end-start; // Time to run the serial code
//-----------------------------------------------------------------------------
// Performance of parallel inclusive scan. Print ratio of serial/parallel time
//----------------------------------------------------------------------------
printf("\nNumThreads Speedup \n");
for(threadcount=1;threadcount<=tmax;++threadcount){
start = omp_get_wtime();
sumvar = 0.0;
#pragma omp parallel for num_threads(threadcount) reduction(inscan,+:sumvar)
for(int i=0 ; i<N ; ++i){
sumvar = sumvar + x[i]; // updating the value of sumvar
#pragma omp scan inclusive(sumvar)
partials_p[i] = sumvar;
}
end = omp_get_wtime();
double ptime = end-start;
printf("%d \t %.3f\n",threadcount,stime/ptime);
}
//for(int i=0 ; i<N ; ++i){
// printf("%.4f %.4f\n", partials_s[i], partials_p[i]);
//}
// Deallocate
free(x);
free(partials_s);
free(partials_p);
return 0;
}
I want to use the inclusive scan operation in OpenMP to implement an algorithm. What follows is a description of my attempt at doing so, and failing to get more than a tepid speedup.
The inclusive operation is defined as following: for an input vector say [x1,x2,x3,x4]
it ouputs the sequence of partial sums [x1, x1+x2, x1+x2+x3, x1+x2+x3+x4]
. This is a highly parallelizable operation and ostensibly this has been implemented well in OpenMP.
I looked at the following reference: https://theartofhpc/pcse/omp-reduction.html#Scanprefixoperations (The manual reference https://www.openmp./spec-html/5.0/openmpsu45.html#x68-1940002.9.6 seems too cryptic to me right now)
The artofhpc website says that the reduction clause gets a modifier inscan
:
#pragma omp parallel for reduction(inscan,+:sumvar)`
In the body of the parallel loop there is a scan directive that allows you to store the partial results. For inclusive scans the reduction variable is updated before the scan pragma:
sumvar // update
#pragma omp scan inclusive(sumvar)
partials[i] = sumvar
I tried to follow the same syntax, to measure the performance versus standard serial reduction and the results were quite disappointing. My code is at the bottom of the post.
In the code, I am just doing a simple test by considering a very large vector of size 90 million consisting of random values in the interval [-1,1], and performing a scan on it using an increasing number of threads and measuring the speedup. Here are my results (I get the same answer on repeated runs). My laptop CPU has 16 hardware threads, but the overall speedup is a disappointing 1.36. (I would have expected something substantially more!)
Is there something wrong in the way I am using the OpenMP syntax for reduction?
➜ Desktop gcc -fopenmp scantest.c && ./a.out
NumThreads Speedup
1 0.458
2 1.173
3 1.424
4 1.686
5 1.635
6 1.501
7 1.522
8 1.499
9 1.455
10 1.416
11 1.395
12 1.393
13 1.352
14 1.336
15 1.353
16 1.357
#include<stdio.h>
#include<omp.h>
#include<math.h>
#include<stdlib.h>
#include<assert.h>
int main(int argc, char** argv){
int N = 9e7; // vector size
double* x; // vector to reduce
double* partials_s; // vector to scan into
double* partials_p; // vector to scan into
double end, start; // timing variables
double sumvar;
int tmax = argc>1? atoi(argv[1]):35;
int threadcount ;
// Allocate space for all vectors
x = (double*) malloc(sizeof(double)*N);
partials_s = (double*) malloc(sizeof(double)*N);
partials_p = (double*) malloc(sizeof(double)*N);
// Populate the input vectors
for(int i=0 ; i<N ; ++i){
x[i] = -1+2.0*rand()/(double)RAND_MAX;
partials_s[i] = 0.0;
partials_p[i] = 0.0;
}
//-----------------------------------------
// Serial inclusive scan
//-----------------------------------------
start = omp_get_wtime();
sumvar = 0.0;
for(int i=0 ; i<N ; ++i){
sumvar += x[i];
partials_s[i] = sumvar;
}
end = omp_get_wtime();
double stime = end-start; // Time to run the serial code
//-----------------------------------------------------------------------------
// Performance of parallel inclusive scan. Print ratio of serial/parallel time
//----------------------------------------------------------------------------
printf("\nNumThreads Speedup \n");
for(threadcount=1;threadcount<=tmax;++threadcount){
start = omp_get_wtime();
sumvar = 0.0;
#pragma omp parallel for num_threads(threadcount) reduction(inscan,+:sumvar)
for(int i=0 ; i<N ; ++i){
sumvar = sumvar + x[i]; // updating the value of sumvar
#pragma omp scan inclusive(sumvar)
partials_p[i] = sumvar;
}
end = omp_get_wtime();
double ptime = end-start;
printf("%d \t %.3f\n",threadcount,stime/ptime);
}
//for(int i=0 ; i<N ; ++i){
// printf("%.4f %.4f\n", partials_s[i], partials_p[i]);
//}
// Deallocate
free(x);
free(partials_s);
free(partials_p);
return 0;
}
Share
asked Mar 9 at 2:04
smilingbuddhasmilingbuddha
14.8k37 gold badges120 silver badges206 bronze badges
3 Answers
Reset to default 6TL;DR: This does not scale because such a scan operation is simply memory bound. More specifically, it is bound by the bandwidth of the DRAM.
Analysis
On mainstream PCs, only few cores are able to saturate the bandwidth of the DRAM. Sometimes, a single core can (typically a PC with a good CPU and a a single-channel DRAM). The sequential scan is already a memory-bound operation, and on my machine (i5-9600KF CPU with a 2 x 3200MHz DDR4-DRAM) it is not that far from saturating the DRAM. Indeed, it reach 24 GiB/s of throughput. My DRAM can theoretically reach 48 GiB/s for only memory-reads, but in practice it is generally rather 40~42 GiB/s for reads and 36~38 GiB/s for mixed read/writes. This means the sequential code already saturate ~65% of my DRAM bandwidth!
With the parallel code, I reach 20 GiB/s with 1 thread, 30 GiB/s with 2 threads, 35 GiB/s with 3 threads, 36 GiB/s with 4 threads, 35 GiB/s with 6 threads (maximum useful since I only have 6 cores). As we can see, only 3 threads are enough to nearly saturate the bandwidth on my machine. 2 threads are actually not that far to saturate it (~80%).
In fact, this is rather consistent with the speed up printed by your program:
NumThreads Speedup
1 0.456
2 0.686
3 0.765 <---- optimal speed-up for the parallel code
4 0.749
5 0.700
6 0.666
With more than 3 threads, the performances actually decrease. This is probably because DRAM accesses (and the one of the L3 cache) look more random from the CPU perspective and random accesses are a bit less efficient (harder to prefetch). Moreover, cache trashing can cause more data to be loaded from the DRAM so the efficiency decrease while the throughput stay roughly stable.
There is another interesting thing: the parallel implementation is actually slower than the sequential one on my machine, even with 6 threads! This is because the parallel implementation needs to perform more work. Indeed, AFAIK, parallel scan requires multiple memory steps: for example, one to compute partial reduction and one to compute the actual scan by block based on the partial sums computed before. With this implementation, it means more data must be read from memory (twice more) making the operation slower if memory is already the bottleneck. There are other possible implementations but all the one I can think of require to read/write significantly more data from/to DRAM in this loop (with a default static scheduling).
Note I used GCC with the flags -fopenmp -O3
to compile the profiled program (-mavx2
did not impact profiling results). In practice, GOMP (the OpenMP implementation of GCC use to profile this on my machine) apparently perform 2 read/write steps which is a bit less efficient than the one mentioned above. I saw that by profiling the generated assembly code: we can see two hot memory-bound loops in the parallel code and the two seems to perform a kind of scan-like operation (certainly a local scan followed by a scan update). This 2 read/write step implementation should be theoretically twice slower. This is consistent with the speedup of the OpenMP implementation which is twice slower with 1 threads (both on your machine and mine).
Possible optimizations
There are 2 main optimizations we can perform to make this faster. However, they are not trivial to apply and have significant downsides.
First of all, x86-64 CPU actually read data to write it into DRAM due to cache-line write allocation. You can do non-temporal streaming stores to avoid this. This only worth it if you are sure written data cannot fit in cache. In general, developers cannot guarantee such a thing unless they know the target CPU architecture or make (sometime reasonable) assumptions on it. For example, they can assume the LLC cache is not bigger than few hundreds of MiB (which is not so true because there are some CPUs with a lot of LLC cache like Xeon Phi and future one might have such big LLC caches). OpenMP can theoretically generate non-temporal streaming store instructions with the nontemporal
clause. That being said, it is AFAIK not yet implemented in mainstream OpenMP implementation at the time of writing. This optimization can make the sequential/parallel codes about 50% faster.
Another way to make this faster is to write your own more-efficient cache-friendly parallel implementation. You can read data chunk by chunk (with chunks fitting well in the LLC cache). Once a chunk is read from memory, you can compute the partial sums of the sub-chunks in parallel, and then compute the scan more efficiently (since data should already fit in the LLC cache and do not need to be reloaded). This should strongly improve the efficiency of the parallel implementations so to make it much more competitive with the sequential one (though still a bit slower with 1 thread). Note the additional synchronizations can decrease the performance of this implementation. Besides, NUMA effects can also make things more complicated on machines with multiple NUMA nodes. Ideally, doing such an optimization yourself is not great, and I think it should be the job of the OpenMP implementation...
The two optimizations can be combined together. The streaming store should reduce cache trashing and improve a bit the performance of the second optimization. On my machine, I expect the 2 optimizations to make the parallel implementation about twice faster than the provided sequential one. This still not scale but there is not much more you can do about that since the operation is memory-bound in the first place and memory-bound codes generally do not scale.
An initial disclaimer: When benchmarking such simple codes, you always need to make sure that you produce an observable side effect (printf("%.4f %.4f\n", partials_s[N-1], partials_p[N-1]);
) that ensures that the compiler will not optimize out the hole code.
Your problem has data dependencies for each iteration and is, therefore, difficult to efficiently execute in parallel. A naive parallel implementation would effectively translate to:
#pragma omp parallel for num_threads(threadcount) ordered
for(int i=0 ; i<N ; ++i){
#pragma omp ordered
{
sumvar = sumvar + x[i]; // updating the value of sumvar
partials_p[i] = sumvar;
}
}
This code effectively serializes the execution and adds additional thread synchronization cost between the iteration steps.
To actually calculate the scan in parallel, the compiler can generate code that distributes the loop iterations to the threads. In a first step, each thread performs the scan on the local chunk. In a second step, the threads determine the offset for their thread and need to add the offset to all of their array values. As already pointed out in the other answer, your problem is memory-bound and adding a second pass on the whole array is not very beneficial.
Now, how can this ever be beneficial? OpenMP is generally not good for distributing fine-grained and/or memory-bound workloads. The usecase for OpenMP scan is to have a loop with computational work to distribute to multiple threads and as a result you need the scan of the iterations' results.
#pragma omp parallel for num_threads(threadcount) reduction(inscan,+:sumvar)
for(int i=0 ; i<N ; ++i){
double ires = sin(i+.5) * cos(i+.5); // calculate something for each iteration
sumvar = sumvar + ires;
#pragma omp scan inclusive(sumvar)
partials_p[i] = sumvar;
}
Conceptually, this can again be understood as:
#pragma omp parallel for num_threads(threadcount) ordered
for(int i=0 ; i<N ; ++i){
double ires = sin(i+.5) * cos(i+.5); // calculate something for each iteration
#pragma omp ordered
{
sumvar = sumvar + ires; // updating the value of sumvar
partials_p[i] = sumvar;
}
}
How good it would scale with ordered
depends on the work to be done in each iteration, i.e., to calculate ires
. Additionally, it will only scale well, if each iteration has the same calculation cost, because we synchronize with each iteration. Any imbalance between the iterations will hit the scalability.
Using scan has the advantage that we don't synchronize between each iteration, but only at the end. So, imbalances between the iterations can average out as along as the imbalances are not systematic between the threads.
With substantial compute cost for each iteration, the final pass through memory will not be that expensive.
When I execute the modified code from above compiled with gcc -fopenmp -O3 -lm
on my notebook with 4 cores, I get these scaling results:
NumThreads Speedup
1 0.870
2 1.849
3 2.568
4 3.607
5 2.838
6 3.247
7 3.654
8 3.031
9 3.868
10 3.568
11 3.852
12 3.759
13 3.971
14 3.961
15 4.005
16 4.016
I've tried this program on a CPU with tons of bandwidth, and I'm still only getting a factor or 2 speedup, starting at 12 cores, and then never going higher. I blame a bad implementation of the reduction part of the scan. Also, the reduction goes up to 20-some levels, so there is tons of extra work, and with bad cacheline utilization. Since the loop has absolutely no work outside of the reduction I would indeed expect bad performance. Try introducing some amount of scalar work into the loop.
发布者:admin,转转请注明出处:http://www.yc00.com/questions/1744880686a4598820.html
评论列表(0条)