Vulkan Tutorial

2-4 - Timestamp Queries

Using chrono::high_resolution_clock::now() is the CPU way to get the current time. However, Vulkan provides its own way to get the current time directly on the device. Getting the time directly on the device avoids unnecessary overhead and might provide higher precision of measurements.

Vulkan provides vkCmdWriteTimestamp function to request the timestamp and to write it to the memory:

void vkCmdWriteTimestamp(
	VkCommandBuffer commandBuffer,
	VkPipelineStageFlagBits pipelineStage,
	VkQueryPool queryPool,
	uint32_t query);

The command is recorded into the command buffer specified by the first parameter. The second parameter gives the pipeline stage. What is the pipeline stage and why we need it as the parameter? Vulkan devices might execute commands in parallel and sometimes out of submission order. Moreover, commands are executed in stages, such as compute shader stage, vertex shader stage, geometry shader stage and fragment shader stage. All commands submitted before vkCmdWriteTimestamp must complete execution of at least the stage specified by pipelineStage parameter before the timestamp command can be executed. Third parameter queryPool specifies the object where the timestamp will be stored. Finally, fourth parameter specifies index into queryPool where the result will be stored.

The command buffer is recorded as follows:

// reset timestamp pool
vk::cmdResetQueryPool(
	commandBuffer,
	timestampPool,
	0,  // firstQuery
	2);  // queryCount

// bind pipeline
vk::cmdBindPipeline(commandBuffer, vk::PipelineBindPoint::eCompute, pipeline);

// write timestamp 0
vk::cmdWriteTimestamp(
	commandBuffer,
	vk::PipelineStageFlagBits::eTopOfPipe,
	timestampPool,
	0);  // query

// dispatch computation
vk::cmdDispatch(commandBuffer, workgroupCountX, workgroupCountY, workgroupCountZ);

// write timestamp 1
vk::cmdWriteTimestamp(
	commandBuffer,
	vk::PipelineStageFlagBits::eBottomOfPipe,
	timestampPool,
	1);  // query

First, we reset the query pool and bind the pipeline. Then, we write first timestamp, record the work using vk::cmdDispatch() and write the second timestamp.

When the command buffer is executed, the timestamps are written by the device into the query pool. After waiting for the command buffer execution to complete, we read the timestamps:

// read timestamps
array timestamps;
vk::getQueryPoolResults(
	timestampPool,  // queryPool
	0,  // firstQuery
	2,  // queryCount
	2 * sizeof(uint64_t),  // dataSize
	timestamps.data(),  // pData
	sizeof(uint64_t),  // stride
	vk::QueryResultFlagBits::e64 | vk::QueryResultFlagBits::eWait  // flags
);

// print results
double time = float((timestamps[1] - timestamps[0]) & timestampValidBitMask) * timestampPeriod / 1e9;
double totalTime = chrono::duration(chrono::high_resolution_clock::now() - startTime).count();
uint64_t numInstructions = uint64_t(20000) * 128 * workgroupCountX * workgroupCountY * workgroupCountZ;
cout << fixed << setprecision(2)
     << setw(9) << totalTime * 1000 << "ms       "
     << setw(9) << workgroupCountX * workgroupCountY * workgroupCountZ << "        "
     << "     " << formatFloatSI(time) << "s   "
     << "    " << formatFloatSI(double(numInstructions) / time) << "FLOPS" << endl;

We read two timestamps as two 64-bit integers. We substract them to get their difference, mask the result by timestampValidBitMask to limit the result just to valid bits, we multiply it by timestampPeriod that contains number of nanoseconds that pass to make timestamp incremented by one. Finally, we divide the result by 1e9 to convert final value from nanoseconds to seconds. At the end, the results and printed to the screen.

The last missing piece of the code is timestamp pool creation:

// timestamp pool
vk::UniqueQueryPool timestampPool =
	vk::createQueryPoolUnique(
		vk::QueryPoolCreateInfo{
			.flags = {},
			.queryType = vk::QueryType::eTimestamp,
			.queryCount = 2,
			.pipelineStatistics = {},
		}
	);

We specify queryType to be eTimestamp, so the pool will store timestamps. And we ask for two timestamps to be stored there.

We can run the example and see if we got any additional precision in our simple application. On my Quadro RTX 3000, I see the real difference in first few measurements. Dispatching 1, 10 or 100 workgroups results in as short measurement times as 100 us when using Vulkan timestamps. Without timestamps, I see values ranging from 250 to 600 us using the code of the previous article. As we can see, timestamps are more precise than measuring the time on CPU. Higher values when using CPU is probably caused by CPU - GPU synchronization. This difference is usually noticeable on short measurements only. Anyway, I recommend to use Vulkan timestamps whenever measuring execution time on Vulkan device.