In previous article, we measured the performance running 100'000 local workgroups that executed 256 billion floating point operations in total. This might take about 2 milliseconds on GeForce 5090, about 0.5 second on integrated Intel graphics and timeout on slow Vulkan software driver. In short, our measurement algorithm needs improvements. The reasons are as follows:
Our goal will be to perform measurements that takes about 20ms and we will do them repeatedly. We will print the output header and start our measurement with single local group:
// output header cout << "\n" " Measurement Number of Computation Performance\n" " time stamp local workgroups time" << endl; uint32_t groupCountX = 1; uint32_t groupCountY = 1; uint32_t groupCountZ = 1; chrono::time_point startTime = chrono::high_resolution_clock::now(); do { // begin command buffer vk::beginCommandBuffer( commandBuffer, vk::CommandBufferBeginInfo{ .flags = vk::CommandBufferUsageFlagBits::eOneTimeSubmit, .pInheritanceInfo = nullptr, } ); // bind pipeline vk::cmdBindPipeline(commandBuffer, vk::PipelineBindPoint::eCompute, pipeline); // dispatch computation vk::cmdDispatch(commandBuffer, groupCountX, groupCountY, groupCountZ); // end command buffer vk::endCommandBuffer(commandBuffer);
Then, we submit the work and measure the time to have the work completed by the device. Once done, we print the results and stop further measurements if we are measuring for more than 3 seconds already:
// print results double time = chrono::duration(t2 - t1).count(); double totalTime = chrono::duration (t2 - startTime).count(); uint64_t numInstructions = uint64_t(20000) * 128 * groupCountX * groupCountY * groupCountZ; cout << fixed << setprecision(2) << setw(9) << totalTime * 1000 << "ms " << setw(9) << groupCountX * groupCountY * groupCountZ << " " << " " << formatFloatSI(time) << "s " << " " << formatFloatSI(double(numInstructions) / time) << "FLOPS" << endl; // stop measurements after three seconds if(totalTime >= 3.) break;
The function formatFloatSI(float) makes just more nice floating point value formatting. It prints just three most significant numbers followed by SI suffix like M for mega, G for giga and T for tera. Only three digits are printed to keep the output as simple as possible.
We follow by updating number of local workgroups for the next measurement. If the previous measurement time is shorter than 2ms, we just multiply number of local workgroups by 10. Otherwise, we compute ratio and try to target 20ms measurement. Vulkan specification tells us that at least 65535 local workgroups is supported in each of X, Y, and Z dimensions. So, we cap each dimension to 10'000 to avoid exceeding the limit:
// update number of local workgroups // to reach computation time of about 20ms constexpr double targetTime = 0.02; if(time < targetTime / 10.) { if(groupCountX <= 1000) groupCountX *= 10; else if(groupCountY <= 1000) groupCountY *= 10; else if(groupCountZ <= 1000) groupCountZ *= 10; } else { double ratio = targetTime / time; uint64_t newNumGroups = uint64_t(ratio * (uint64_t(groupCountX) * groupCountY * groupCountZ)); if(newNumGroups > 10000 * 10000) { groupCountZ = 1 + ((newNumGroups - 1) / (10000 * 10000)); uint64_t remainder = newNumGroups / groupCountZ; groupCountY = 1 + ((remainder - 1) / 10000); groupCountX = remainder / groupCountY; } else { if(newNumGroups == 0) newNumGroups = 1; groupCountZ = 1; groupCountY = 1 + ((newNumGroups - 1) / 10000); groupCountX = newNumGroups / groupCountY; } } } while(true);
When we run the application, we might see the result similar to the following one:
Compatible devices: 1: Intel(R) UHD Graphics (compute queue: 0, type: IntegratedGpu) 2: Quadro RTX 3000 (compute queue: 0, type: DiscreteGpu) 3: Quadro RTX 3000 (compute queue: 2, type: DiscreteGpu) 4: llvmpipe (LLVM 20.1.5, 256 bits) (compute queue: 0, type: Cpu) Using device: Quadro RTX 3000 Measurement Number of Computation Performance time stamp local workgroups time 0.53ms 1 286 us 8.96 GFLOPS 1.30ms 10 180 us 142 GFLOPS 2.01ms 100 186 us 1.38 TFLOPS 3.46ms 1000 935 us 2.74 TFLOPS 11.32ms 10000 7.30 ms 3.51 TFLOPS 31.74ms 27390 19.9 ms 3.53 TFLOPS 53.28ms 27576 20.1 ms 3.52 TFLOPS 74.43ms 27480 19.8 ms 3.56 TFLOPS 96.04ms 27774 20.2 ms 3.52 TFLOPS 117.46ms 27483 20.0 ms 3.53 TFLOPS 139.08ms 27540 20.1 ms 3.50 TFLOPS 163.54ms 27363 20.1 ms 3.48 TFLOPS 184.53ms 27198 19.9 ms 3.50 TFLOPS 207.91ms 27348 19.7 ms 3.55 TFLOPS 230.24ms 27744 18.2 ms 3.90 TFLOPS 245.08ms 30492 11.8 ms 6.63 TFLOPS 265.70ms 51804 19.6 ms 6.76 TFLOPS 286.95ms 52806 19.8 ms 6.83 TFLOPS 309.12ms 53394 20.6 ms 6.62 TFLOPS 332.74ms 51750 19.7 ms 6.74 TFLOPS 355.33ms 52626 19.7 ms 6.85 TFLOPS 377.28ms 53472 20.3 ms 6.74 TFLOPS 399.08ms 52626 20.3 ms 6.64 TFLOPS 420.75ms 51858 20.1 ms 6.59 TFLOPS 444.25ms 51492 19.7 ms 6.71 TFLOPS [...snip...]
In the first column, we see the timeline. In the second column, number of local workgroups grow from 1 in the first measurement to about 27 thousands, stays for a while and then, grows to about 52 thousands. The computation time reaches 20 milliseconds in the sixth measurement. The meaningful performance of 3.5 TFLOPS is reported already in fifth measurement. Strange change can be seen after 200ms since the beginning of the measurements. Suddenly, the performance starts to raise from about 3.5 to about 6.7 TFLOPS. It is almost doubled at that point of time and stays about 6.7 TFLOPS until the end of the measurement process. Undoubtedly, the graphics card is switched after 200 ms to the higher performance level and higher power consumption. So, the real graphics card performance is not always seen immediately, but some time might be needed before highest performance level is reached.
We might also want to measure another graphics card installed in the system. Possible options are identified by the number, ranging from 1 to 4 in our case:
Compatible devices: 1: Intel(R) UHD Graphics (compute queue: 0, type: IntegratedGpu) 2: Quadro RTX 3000 (compute queue: 0, type: DiscreteGpu) 3: Quadro RTX 3000 (compute queue: 2, type: DiscreteGpu) 4: llvmpipe (LLVM 20.1.5, 256 bits) (compute queue: 0, type: Cpu)
We can pass '-' followed by the number to command line to select particular device and compute queue. For example, specifying -1 to command line will measure performance of integrated Intel GPU.
Passing substring of device name is another way to select the device to test. For example, passing llvm on command line will select llvmpipe software device run completely on CPU.
We can combine both approaches and specify RTX to get two devices while the first one would be used unless we specify -2 to select the second one that uses different compute queue.