I was having strange performance issues in a single-threaded real-time program which I eventually narrowed down to this minimal example. It takes an array of 1 million floats as inputs, does some arbitrary calculations with them, and writes 1 million floats to an output array:
#include <random>
#include <chrono>
#include <array>
#include <iostream>
#include <thread>
#include <Windows.h>
constexpr int no_vals = 1'000'000;
std::array<float, no_vals> in;
std::array<float, no_vals> out;
inline float func(float in) {
float val;
for (int p = 0; p < 10; p++) {
val = in + p;
val *= val;
val = sqrtf(val);
}
val /= 10.1f;
return val;
}
int main() {
//generate random float inputs from -10.f to +10.f
std::random_device dev;
std::mt19937 rng(dev());
std::uniform_real_distribution<float> random(-10.f, 10.f);
for (auto& f : in) f = random(rng);
//pin the thread to CPU core 0 (toggle this on/off for testing):
//if (!SetThreadAffinityMask(GetCurrentThread(), 0b0001)) return -1;
int no_loops = 3000;
while (no_loops-- > 0) {
//flush all output values from the cache (toggle this on/off for testing):
for (int n = 0; n < no_vals; n++) _mm_clflush(&out[0] + n);
auto time_start = std::chrono::steady_clock::now();
for (int n = 0; n < no_vals; n++) out[n] = func(in[n]);
auto time_end = std::chrono::steady_clock::now();
double time_microseconds =
(double)std::chrono::duration_cast<std::chrono::nanoseconds>
(time_end - time_start).count() / 1000.0;
std::cout << time_microseconds << "," << GetCurrentProcessorNumber() << std::endl;
//sleep for 1 millisecond (toggle this on/off for testing):
std::this_thread::sleep_for(std::chrono::milliseconds(1));
}
}
For some reason putting a thread to sleep can cause severe performance problems related to the cache. Specifically it has something to do with values that have been written to previously (the output array in this example). If you flush these values from the cache, the performance problem stops. If you don't put the thread to sleep, the performance problem also stops - but this isn't an option if you're trying to keep a real-time program running at a stable frequency or framerate:
I also checked which logical processor the thread was running on, and in all cases, Windows scheduled it to run on different logical processors (and cores) throughout the 3000 sample test and frequently moved it around. Since each core has its own cache I thought this was worth checking. For reference the system that I ran this test on has 8 processors (4 cores).
In case it sheds any more light on the problem, here is more detail on the poor performance results; they look suspiciously like the classic cache level staircase:
Can anyone explain why this performance problem is occurring? How does putting a thread to sleep hurt cache performance?
I was having strange performance issues in a single-threaded real-time program which I eventually narrowed down to this minimal example. It takes an array of 1 million floats as inputs, does some arbitrary calculations with them, and writes 1 million floats to an output array:
#include <random>
#include <chrono>
#include <array>
#include <iostream>
#include <thread>
#include <Windows.h>
constexpr int no_vals = 1'000'000;
std::array<float, no_vals> in;
std::array<float, no_vals> out;
inline float func(float in) {
float val;
for (int p = 0; p < 10; p++) {
val = in + p;
val *= val;
val = sqrtf(val);
}
val /= 10.1f;
return val;
}
int main() {
//generate random float inputs from -10.f to +10.f
std::random_device dev;
std::mt19937 rng(dev());
std::uniform_real_distribution<float> random(-10.f, 10.f);
for (auto& f : in) f = random(rng);
//pin the thread to CPU core 0 (toggle this on/off for testing):
//if (!SetThreadAffinityMask(GetCurrentThread(), 0b0001)) return -1;
int no_loops = 3000;
while (no_loops-- > 0) {
//flush all output values from the cache (toggle this on/off for testing):
for (int n = 0; n < no_vals; n++) _mm_clflush(&out[0] + n);
auto time_start = std::chrono::steady_clock::now();
for (int n = 0; n < no_vals; n++) out[n] = func(in[n]);
auto time_end = std::chrono::steady_clock::now();
double time_microseconds =
(double)std::chrono::duration_cast<std::chrono::nanoseconds>
(time_end - time_start).count() / 1000.0;
std::cout << time_microseconds << "," << GetCurrentProcessorNumber() << std::endl;
//sleep for 1 millisecond (toggle this on/off for testing):
std::this_thread::sleep_for(std::chrono::milliseconds(1));
}
}
For some reason putting a thread to sleep can cause severe performance problems related to the cache. Specifically it has something to do with values that have been written to previously (the output array in this example). If you flush these values from the cache, the performance problem stops. If you don't put the thread to sleep, the performance problem also stops - but this isn't an option if you're trying to keep a real-time program running at a stable frequency or framerate:
I also checked which logical processor the thread was running on, and in all cases, Windows scheduled it to run on different logical processors (and cores) throughout the 3000 sample test and frequently moved it around. Since each core has its own cache I thought this was worth checking. For reference the system that I ran this test on has 8 processors (4 cores).
In case it sheds any more light on the problem, here is more detail on the poor performance results; they look suspiciously like the classic cache level staircase:
Can anyone explain why this performance problem is occurring? How does putting a thread to sleep hurt cache performance?
Share Improve this question edited Mar 20 at 18:22 greenlagoon asked Mar 20 at 15:22 greenlagoongreenlagoon 1811 silver badge8 bronze badges 10- 4 A thread context switch is an expensive operation. qv Operation Cost in CPU Cycles – Eljay Commented Mar 20 at 15:43
- @Eljay That's a useful reference. If cache invalidation is the cause of the huge performance fluctuations, why does it only seem to occur when the thread is put to sleep? If Windows is frequently re-scheduling the thread to run on different processors & cores in every case, shouldn't there be a context switch and cache invalidation in every case? (excluding the cases where I manually flushed the cache beforehand, which of course avoid paying for cache invalidation during the timed test) – greenlagoon Commented Mar 20 at 15:59
- @ThomasWeller Let's exclude the manual cache flushing cases for simplicity and just compare sleeping VS not sleeping. Regardless of whether the thread sleeps, Windows is re-scheduling it to different cores quite often. Every time it gets re-scheduled it has to use the cache of the processor that it's running on. So why does the thread that sleeps pay for so many cache invalidations, while the non-sleeping thread doesn't seem to pay for them at all? Both switch cores & caches frequently. – greenlagoon Commented Mar 20 at 16:22
- 1 How does performance change when you bind the thread to a single CPU core? – Homer512 Commented Mar 20 at 16:35
- 1 No, the cache doesn't need to be flushed on context switches. It's using physical addresses, so if the virtual address space changes on a context switch, it will not interfere. Only the TLB needs to be flushed since the page table is swapped out. – Homer512 Commented Mar 20 at 20:20
1 Answer
Reset to default 3After getting feedback and further analyzing the results, I think the answer is that putting a thread to sleep increases how often Windows reschedules it to different cores, which results in more thread context switches, which causes more cache invalidations (which are very expensive). That's how putting a thread to sleep can hurt cache performance. Note that this is in the context of a real-time, single-threaded application which is attempting to achieve a stable framerate by sleeping periodically (ex. a game running at 60 FPS).
In this test, putting the thread to sleep for 1ms in a 3000-sample test caused Windows to re-schedule it to a different core about 2.6 times more that not putting it to sleep:
The sleeping thread was rescheduled to a different core 2.6 times more often, and the performance cost about 2.4 times more on average, compared to a non-sleeping thread. I'm not sure if you can compare those two numbers so directly but there seems to be a correlation.
Pinning the thread to core 0 significantly reduced the cost of a sleeping thread, presumably because Windows wasn't moving it around between different cores anymore so cache invalidations happened less often (although other processes & their threads probably switched out the cache if they were also running on that core during the sleeping time):
Another note: this is a very simplified example, but in a much larger real-time application that has several stages where the working set changes completely (the cache is entirely flushed out and replaced by different data during each stage), I suspect that this problem might "solve itself", because by the time you loop back around to the start of the real-time loop, the cache has probably already been flushed out several times by other stages; there shouldn't be many lingering, obsolete cache lines from the previous loop that require expensive cache coherence work.
Thanks to all for your help and hopefully this post can help others.
发布者:admin,转转请注明出处:http://www.yc00.com/questions/1744400106a4572344.html
评论列表(0条)