Vulkan Tutorial

2-1 - Command Submission

Logical device contains one or more queues that can be used to submit the work to be executed. This article presents two main topics: Selection of the device and the work submission.

Device selection

It is not uncommon that Vulkan lists two or more physical devices present in the system. The programmer might face the question: Which one should I use? If I take the first one, will I get the high performant one? Or, will I be unlucky and get slow integrated graphics or software-based CPU device?

We will try to answer the question from the beginning. First, we will get the list of the physical devices in the system. Then, we filter out those that does not support all the functionalities we require. Those that satisfy our requirements are stored in compatibleDevices variable:

// get compatible devices
vk::vector<vk::PhysicalDevice> deviceList = vk::enumeratePhysicalDevices();
vector<tuple<vk::PhysicalDevice, uint32_t, vk::PhysicalDeviceProperties>> compatibleDevices;
vector<vk::PhysicalDeviceProperties> incompatibleDevices;
for(size_t i=0; i<deviceList.size(); i++) {

	// append compatible queue families
	vk::PhysicalDevice pd = deviceList[i];
	vk::PhysicalDeviceProperties props = vk::getPhysicalDeviceProperties(pd);
	vk::vector<vk::QueueFamilyProperties> queueFamilyPropList = vk::getPhysicalDeviceQueueFamilyProperties(pd);
	bool found = false;
	for(uint32_t i=0, c=uint32_t(queueFamilyPropList.size()); i<c; i++) {

		// test for compute operations support
		vk::QueueFamilyProperties& qfp = queueFamilyPropList[i];
		if(qfp.queueFlags & vk::QueueFlagBits::eCompute) {
			found = true;
			compatibleDevices.emplace_back(pd, i, props);
		}

	}

	// append incompatible devices
	if(!found)
		incompatibleDevices.emplace_back(props);

}

// print device list
cout << "List of devices:" << endl;
for(size_t i=0, c=compatibleDevices.size(); i<c; i++) {
	auto& t = compatibleDevices[i];
	cout << "   " << i+1 << ": " << get<2>(t).deviceName << " (compute queue: "
			<< get<1>(t) << ", type: " << to_cstr(get<2>(t).deviceType) << ")" << endl;
}
for(size_t i=0, c=incompatibleDevices.size(); i<c; i++) {
	auto& props = incompatibleDevices[i];
	cout << "   incompatible: " << props.deviceName
			<< " (type: " << to_cstr(props.deviceType) << ")" << endl;
}

We iterate over all physical devices and store them in compatibleDevices or incompatibleDevices lists. How we decide whether the device is compatible or not? In our case, we require compute capability on any queue. So for each device, we get list of queue families and iterate over them. We look for compute capability. If we found one, we make a record in compatibleDevices list. Otherwise, we put the device into incompatibleDevice list. Finally, we print all the compatible devices followed by the incompatible ones.

The next step is the selection of the most suitable device. Often, we want the most performant device. But running a benchmark might not be the best solution because we also want our application to start quickly. So, we will use different approach. We will select the device based on its type. We will prefer discrete graphics over integrated, virtual GPU over CPU based and over all other types. We just assign the score to each device type and choose the device with the highest score:

// choose the best device
auto bestDevice = compatibleDevices.begin();
constexpr const array deviceTypeScore = {
	10,  // vk::PhysicalDeviceType::eOther         - lowest score
	40,  // vk::PhysicalDeviceType::eIntegratedGpu - high score
	50,  // vk::PhysicalDeviceType::eDiscreteGpu   - highest score
	30,  // vk::PhysicalDeviceType::eVirtualGpu    - normal score
	20,  // vk::PhysicalDeviceType::eCpu           - low score
	10,  // unknown vk::PhysicalDeviceType
};
int bestScore = deviceTypeScore[clamp(int(get<2>(*bestDevice).deviceType), 0, int(deviceTypeScore.size())-1)];
for(auto it=compatibleDevices.begin()+1; it!=compatibleDevices.end(); it++) {
	int score = deviceTypeScore[clamp(int(get<2>(*it).deviceType), 0, int(deviceTypeScore.size())-1)];
	if(score > bestScore) {
		bestDevice = it;
		bestScore = score;
	}
}

// device to use
cout << "Using device:\n"
        "   " << get<2>(*bestDevice).deviceName << endl;

Then, we create the device with the queue supporting compute operations:

// create device
vk::initDevice(
	pd,  // physicalDevice
	vk::DeviceCreateInfo{  // pCreateInfo
		.flags = {},
		.queueCreateInfoCount = 1,
		.pQueueCreateInfos =
			array{
				vk::DeviceQueueCreateInfo{
					.flags = {},
					.queueFamilyIndex = queueFamily,
					.queueCount = 1,
					.pQueuePriorities = &(const float&)1.f,
				}
			}.data(),
		.enabledLayerCount = 0,  // no enabled layers
		.ppEnabledLayerNames = nullptr,
		.enabledExtensionCount = 0,  // no enabled extensions
		.ppEnabledExtensionNames = nullptr,
		.pEnabledFeatures = nullptr,  // no enabled features
	}
);

We also need command pool and command buffer. The command buffer is used for recording the work to be done. After recording, it can be submitted to the device for execution.

However, command buffers are not created directly. Instead, we create a command pool and we allocate command buffers from it. Command pools allow the driver to amortize the cost of resource allocation across multiple command buffers. They also avoid the need for locks in multithreaded applications, because each thread might use different command pool. We create the command pool and the command buffer as follows:

// command pool
vk::UniqueCommandPool commandPool =
	vk::createCommandPoolUnique(
		vk::CommandPoolCreateInfo{
			.flags = {},
			.queueFamilyIndex = queueFamily,
		}
	);

// allocate command buffer
vk::CommandBuffer commandBuffer =
	vk::allocateCommandBuffer(
		vk::CommandBufferAllocateInfo{
			.commandPool = commandPool,
			.level = vk::CommandBufferLevel::ePrimary,
			.commandBufferCount = 1,
		}
	);

We can record the command buffer. In this article, we just begin and end the recording:

// begin command buffer
vk::beginCommandBuffer(
	commandBuffer,
	vk::CommandBufferBeginInfo{
		.flags = vk::CommandBufferUsageFlagBits::eOneTimeSubmit,
		.pInheritanceInfo = nullptr,
	}
);

// end command buffer
vk::endCommandBuffer(commandBuffer);

Then, we create a fence and we can submit the command buffer for the execution:

// fence
vk::UniqueFence computingFinishedFence =
	vk::createFenceUnique(
		vk::FenceCreateInfo{
			.flags = {}
		}
	);

// submit work
cout << "Submiting work..." << endl;
vk::queueSubmit(
	queue,
	vk::SubmitInfo{
		.waitSemaphoreCount = 0,
		.pWaitSemaphores = nullptr,
		.pWaitDstStageMask = nullptr,
		.commandBufferCount = 1,
		.pCommandBuffers = &commandBuffer,
		.signalSemaphoreCount = 0,
		.pSignalSemaphores = nullptr,
	},
	computingFinishedFence
);

A fence is a synchronization primitive that can be used to wait for the device to finish submitted work. Fences have two states - signaled and unsignaled. When we specify the fence during queue submit operation, it must be unsignalled. Once the device finishes all the work of the particular submit operation, it signals the fence. If our application is waiting on the fence, its work is resumed:

// wait for the work
cout << "Waiting for the work..." << endl;
vk::Result r =
	vk::waitForFence_noThrow(
		computingFinishedFence,
		uint64_t(3e9)  // timeout (3s)
	);
if(r == vk::Result::eTimeout)
	throw std::runtime_error("GPU timeout. Task is probably hanging.");
else
	vk::checkForSuccessValue(r, "vkWaitForFences");

cout << "Done." << endl;

As we can see, we wait for the fence using timeout of three seconds. We call noThrow variant of the function, because we do not want to throw in the case of error. Instead, we handle vk::Result::eTimeout by ourselves. On all other return codes, we let vk::checkForSuccessValue() check for us if the result is vk::Result::eSuccess. If not, it will throw. It will throw even on non-error return codes. Such solution for non-error codes is used for two reasons: First, Vulkan function vkWaitForFences() is not expected to return any other success values than we already handled, e.g. eSuccess and eTimeout. And for the second, even if some future extension or Vulkan version appends a new success value, such success value does not mean we have got our job done. The vk::Result::eTimeout is also non-error return code and does not mean that the job was done. So, this is the reason that we are not happy with any other non-error return codes and we throw.