In the previous article, we measured the performance by running 100'000 local workgroups that executed 256 billion floating point operations in total. This might take about 2 milliseconds on GeForce 5090, about 0.5 second on integrated Intel graphics and timeout on slow Vulkan software driver. In short, our measurement algorithm needs improvements. The reasons are as follows:
Our goal will be to perform measurements that takes about 20ms and we will do them repeatedly. We will print the output header and start our measurement with single local workgroup:
// output header
cout << "\n"
" Measurement Number of Computation Performance\n"
" time stamp local workgroups time" << endl;
uint32_t workgroupCountX = 1;
uint32_t workgroupCountY = 1;
uint32_t workgroupCountZ = 1;
chrono::time_point startTime = chrono::high_resolution_clock::now();
do {
// begin command buffer
vk::beginCommandBuffer(
commandBuffer,
vk::CommandBufferBeginInfo{
.flags = vk::CommandBufferUsageFlagBits::eOneTimeSubmit,
.pInheritanceInfo = nullptr,
}
);
// bind pipeline
vk::cmdBindPipeline(commandBuffer, vk::PipelineBindPoint::eCompute, pipeline);
// dispatch computation
vk::cmdDispatch(commandBuffer, workgroupCountX, workgroupCountY, workgroupCountZ);
// end command buffer
vk::endCommandBuffer(commandBuffer);
Then, we submit the work and measure the time to have the work completed by the device. Once done, we print the results. Then, we stop the measurements if measuring for more than 3 seconds:
// print results double time = chrono::duration(t2 - t1).count(); double totalTime = chrono::duration (t2 - startTime).count(); uint64_t numInstructions = uint64_t(20000) * 128 * workgroupCountX * workgroupCountY * workgroupCountZ; cout << fixed << setprecision(2) << setw(9) << totalTime * 1000 << "ms " << setw(9) << workgroupCountX * workgroupCountY * workgroupCountZ << " " << " " << formatFloatSI(time) << "s " << " " << formatFloatSI(double(numInstructions) / time) << "FLOPS" << endl; // stop measurements after three seconds if(totalTime >= 3.) break;
The function formatFloatSI(float) nicely formats floating point value. It prints just three most significant digits followed by SI suffix like M for mega, G for giga and T for tera. Only three digits are printed to keep the output as simple as possible.
Then, we update number of local workgroups for the next measurement. If the previous measurement time is shorter than 2ms, we just multiply number of local workgroups by 10. Otherwise, we compute ratio and try to target 20ms measurement. Vulkan specification tells us that at least 65535 local workgroups is supported in each of X, Y, and Z dimensions. So, we cap each dimension to 10'000 to avoid exceeding the limit:
// update number of local workgroups
// to reach computation time of about 20ms
constexpr double targetTime = 0.02;
if(time < targetTime / 10.) {
if(workgroupCountX <= 1000)
workgroupCountX *= 10;
else if(workgroupCountY <= 1000)
workgroupCountY *= 10;
else if(workgroupCountZ <= 1000)
workgroupCountZ *= 10;
}
else {
double ratio = targetTime / time;
uint64_t newNumGroups = uint64_t(ratio * (uint64_t(workgroupCountX) * workgroupCountY * workgroupCountZ));
if(newNumGroups > 10000 * 10000) {
workgroupCountZ = 1 + ((newNumGroups - 1) / (10000 * 10000));
uint64_t remainder = newNumGroups / workgroupCountZ;
workgroupCountY = 1 + ((remainder - 1) / 10000);
workgroupCountX = remainder / workgroupCountY;
}
else {
if(newNumGroups == 0)
newNumGroups = 1;
workgroupCountZ = 1;
workgroupCountY = 1 + ((newNumGroups - 1) / 10000);
workgroupCountX = newNumGroups / workgroupCountY;
}
}
} while(true);
When we run the application, we might see the result similar to the following one:
Compatible devices:
1: Intel(R) UHD Graphics (compute queue: 0, type: IntegratedGpu)
2: Quadro RTX 3000 (compute queue: 0, type: DiscreteGpu)
3: Quadro RTX 3000 (compute queue: 2, type: DiscreteGpu)
4: llvmpipe (LLVM 20.1.5, 256 bits) (compute queue: 0, type: Cpu)
Using device:
Quadro RTX 3000
Measurement Number of Computation Performance
time stamp local workgroups time
0.53ms 1 286 us 8.96 GFLOPS
1.30ms 10 180 us 142 GFLOPS
2.01ms 100 186 us 1.38 TFLOPS
3.46ms 1000 935 us 2.74 TFLOPS
11.32ms 10000 7.30 ms 3.51 TFLOPS
31.74ms 27390 19.9 ms 3.53 TFLOPS
53.28ms 27576 20.1 ms 3.52 TFLOPS
74.43ms 27480 19.8 ms 3.56 TFLOPS
96.04ms 27774 20.2 ms 3.52 TFLOPS
117.46ms 27483 20.0 ms 3.53 TFLOPS
139.08ms 27540 20.1 ms 3.50 TFLOPS
163.54ms 27363 20.1 ms 3.48 TFLOPS
184.53ms 27198 19.9 ms 3.50 TFLOPS
207.91ms 27348 19.7 ms 3.55 TFLOPS
230.24ms 27744 18.2 ms 3.90 TFLOPS
245.08ms 30492 11.8 ms 6.63 TFLOPS
265.70ms 51804 19.6 ms 6.76 TFLOPS
286.95ms 52806 19.8 ms 6.83 TFLOPS
309.12ms 53394 20.6 ms 6.62 TFLOPS
332.74ms 51750 19.7 ms 6.74 TFLOPS
355.33ms 52626 19.7 ms 6.85 TFLOPS
377.28ms 53472 20.3 ms 6.74 TFLOPS
399.08ms 52626 20.3 ms 6.64 TFLOPS
420.75ms 51858 20.1 ms 6.59 TFLOPS
444.25ms 51492 19.7 ms 6.71 TFLOPS
[...snip...]
In the first column, we see the timeline. In the second column, number of local workgroups grow from 1 in the first measurement to about 27 thousands, stays for a while and then, grows to about 52 thousands. The computation time reaches 20 milliseconds in the sixth measurement. The meaningful performance of 3.5 TFLOPS is reported already in fifth measurement. Strange change can be seen after 200ms since the beginning of the measurements. Suddenly, the performance starts to raise from about 3.5 to about 6.7 TFLOPS. It is almost doubled at that point of time and stays about 6.7 TFLOPS until the end of the measurement process. Undoubtedly, the graphics card is switched after 200 ms to the higher performance level and higher power consumption. So, the real graphics card performance is not always seen immediately, but some time might be needed before highest performance level is reached.
We might also want to measure another graphics card installed in the system. Possible options are given by the number, ranging from 1 to 4 in our case:
Compatible devices: 1: Intel(R) UHD Graphics (compute queue: 0, type: IntegratedGpu) 2: Quadro RTX 3000 (compute queue: 0, type: DiscreteGpu) 3: Quadro RTX 3000 (compute queue: 2, type: DiscreteGpu) 4: llvmpipe (LLVM 20.1.5, 256 bits) (compute queue: 0, type: Cpu)
We can pass '-' followed by the number to the command line to select particular device and compute queue. For example, specifying -1 to command line will measure performance of integrated Intel GPU.
Passing substring of device name is another way to select the device to test. For example, passing llvm on the command line will select llvmpipe software device that runs completely on the CPU.
We can combine both approaches and specify RTX to get two devices while the first one would be used unless we specify -2 to select the second one that uses different compute queue.