Vulkan Tutorial

2-3 - Adjusted Measurement

In the previous article, we measured the performance by running 100'000 local workgroups that executed 256 billion floating point operations in total. This might take about 2 milliseconds on GeForce 5090, about 0.5 second on integrated Intel graphics and timeout on slow Vulkan software driver. In short, our measurement algorithm needs improvements. The reasons are as follows:

avoid very long measurements on very slow devices and avoid timeouts
avoid too short measurements on very fast devices - very short measurements might not be precise
avoid the power management to influence the results - some devices take time to switch from power saving modes

Our goal will be to perform measurements that takes about 20ms and we will do them repeatedly. We will print the output header and start our measurement with single local workgroup:

// output header
cout << "\n"
        " Measurement        Number of         Computation     Performance\n"
        "  time stamp     local workgroups         time" << endl;

uint32_t workgroupCountX = 1;
uint32_t workgroupCountY = 1;
uint32_t workgroupCountZ = 1;
chrono::time_point startTime = chrono::high_resolution_clock::now();
do {

	// begin command buffer
	vk::beginCommandBuffer(
		commandBuffer,
		vk::CommandBufferBeginInfo{
			.flags = vk::CommandBufferUsageFlagBits::eOneTimeSubmit,
			.pInheritanceInfo = nullptr,
		}
	);

	// bind pipeline
	vk::cmdBindPipeline(commandBuffer, vk::PipelineBindPoint::eCompute, pipeline);

	// dispatch computation
	vk::cmdDispatch(commandBuffer, workgroupCountX, workgroupCountY, workgroupCountZ);

	// end command buffer
	vk::endCommandBuffer(commandBuffer);

Then, we submit the work and measure the time to have the work completed by the device. Once done, we print the results. Then, we stop the measurements if measuring for more than 3 seconds:

	// print results
	double time = chrono::duration(t2 - t1).count();
	double totalTime = chrono::duration(t2 - startTime).count();
	uint64_t numInstructions = uint64_t(20000) * 128 * workgroupCountX * workgroupCountY * workgroupCountZ;
	cout << fixed << setprecision(2)
	     << setw(9) << totalTime * 1000 << "ms       "
	     << setw(9) << workgroupCountX * workgroupCountY * workgroupCountZ << "        "
	     << "     " << formatFloatSI(time) << "s   "
	     << "    " << formatFloatSI(double(numInstructions) / time) << "FLOPS" << endl;

	// stop measurements after three seconds
	if(totalTime >= 3.)
		break;

The function formatFloatSI(float) nicely formats floating point value. It prints just three most significant digits followed by SI suffix like M for mega, G for giga and T for tera. Only three digits are printed to keep the output as simple as possible.

Then, we update number of local workgroups for the next measurement. If the previous measurement time is shorter than 2ms, we just multiply number of local workgroups by 10. Otherwise, we compute ratio and try to target 20ms measurement. Vulkan specification tells us that at least 65535 local workgroups is supported in each of X, Y, and Z dimensions. So, we cap each dimension to 10'000 to avoid exceeding the limit:

	// update number of local workgroups
	// to reach computation time of about 20ms
	constexpr double targetTime = 0.02;
	if(time < targetTime / 10.) {
		if(workgroupCountX <= 1000)
			workgroupCountX *= 10;
		else if(workgroupCountY <= 1000)
			workgroupCountY *= 10;
		else if(workgroupCountZ <= 1000)
			workgroupCountZ *= 10;
	}
	else {
		double ratio = targetTime / time;
		uint64_t newNumGroups = uint64_t(ratio * (uint64_t(workgroupCountX) * workgroupCountY * workgroupCountZ));
		if(newNumGroups > 10000 * 10000) {
			workgroupCountZ = 1 + ((newNumGroups - 1) / (10000 * 10000));
			uint64_t remainder = newNumGroups / workgroupCountZ;
			workgroupCountY = 1 + ((remainder - 1) / 10000);
			workgroupCountX = remainder / workgroupCountY;
		}
		else {
			if(newNumGroups == 0)
				newNumGroups = 1;
			workgroupCountZ = 1;
			workgroupCountY = 1 + ((newNumGroups - 1) / 10000);
			workgroupCountX = newNumGroups / workgroupCountY;
		}
	}

} while(true);

When we run the application, we might see the result similar to the following one:

Compatible devices:
   1: Intel(R) UHD Graphics (compute queue: 0, type: IntegratedGpu)
   2: Quadro RTX 3000 (compute queue: 0, type: DiscreteGpu)
   3: Quadro RTX 3000 (compute queue: 2, type: DiscreteGpu)
   4: llvmpipe (LLVM 20.1.5, 256 bits) (compute queue: 0, type: Cpu)
Using device:
   Quadro RTX 3000

 Measurement        Number of         Computation     Performance
  time stamp     local workgroups         time
     0.53ms               1              286 us       8.96 GFLOPS
     1.30ms              10              180 us        142 GFLOPS
     2.01ms             100              186 us       1.38 TFLOPS
     3.46ms            1000              935 us       2.74 TFLOPS
    11.32ms           10000             7.30 ms       3.51 TFLOPS
    31.74ms           27390             19.9 ms       3.53 TFLOPS
    53.28ms           27576             20.1 ms       3.52 TFLOPS
    74.43ms           27480             19.8 ms       3.56 TFLOPS
    96.04ms           27774             20.2 ms       3.52 TFLOPS
   117.46ms           27483             20.0 ms       3.53 TFLOPS
   139.08ms           27540             20.1 ms       3.50 TFLOPS
   163.54ms           27363             20.1 ms       3.48 TFLOPS
   184.53ms           27198             19.9 ms       3.50 TFLOPS
   207.91ms           27348             19.7 ms       3.55 TFLOPS
   230.24ms           27744             18.2 ms       3.90 TFLOPS
   245.08ms           30492             11.8 ms       6.63 TFLOPS
   265.70ms           51804             19.6 ms       6.76 TFLOPS
   286.95ms           52806             19.8 ms       6.83 TFLOPS
   309.12ms           53394             20.6 ms       6.62 TFLOPS
   332.74ms           51750             19.7 ms       6.74 TFLOPS
   355.33ms           52626             19.7 ms       6.85 TFLOPS
   377.28ms           53472             20.3 ms       6.74 TFLOPS
   399.08ms           52626             20.3 ms       6.64 TFLOPS
   420.75ms           51858             20.1 ms       6.59 TFLOPS
   444.25ms           51492             19.7 ms       6.71 TFLOPS
   [...snip...]

In the first column, we see the timeline. In the second column, number of local workgroups grow from 1 in the first measurement to about 27 thousands, stays for a while and then, grows to about 52 thousands. The computation time reaches 20 milliseconds in the sixth measurement. The meaningful performance of 3.5 TFLOPS is reported already in fifth measurement. Strange change can be seen after 200ms since the beginning of the measurements. Suddenly, the performance starts to raise from about 3.5 to about 6.7 TFLOPS. It is almost doubled at that point of time and stays about 6.7 TFLOPS until the end of the measurement process. Undoubtedly, the graphics card is switched after 200 ms to the higher performance level and higher power consumption. So, the real graphics card performance is not always seen immediately, but some time might be needed before highest performance level is reached.

Graphics Device selection

We might also want to measure another graphics card installed in the system. Possible options are given by the number, ranging from 1 to 4 in our case:

Compatible devices:
   1: Intel(R) UHD Graphics (compute queue: 0, type: IntegratedGpu)
   2: Quadro RTX 3000 (compute queue: 0, type: DiscreteGpu)
   3: Quadro RTX 3000 (compute queue: 2, type: DiscreteGpu)
   4: llvmpipe (LLVM 20.1.5, 256 bits) (compute queue: 0, type: Cpu)

We can pass '-' followed by the number to the command line to select particular device and compute queue. For example, specifying -1 to command line will measure performance of integrated Intel GPU.

Passing substring of device name is another way to select the device to test. For example, passing llvm on the command line will select llvmpipe software device that runs completely on the CPU.

We can combine both approaches and specify RTX to get two devices while the first one would be used unless we specify -2 to select the second one that uses different compute queue.