Logical device contains one or more queues that can be used to submit the work to be executed. This article presents two main topics: Selection of the device and the work submission.
It is not uncommon that Vulkan lists two or more physical devices present in the system. The programmer might face the question: Which one should I use? If I take the first one, will I get the high performant one? Or, will I be unlucky and get slow integrated graphics or software-based CPU device?
We will try to answer the question, but let's start from the beginning. First, we will get the list of the physical devices in the system. Then, we filter out those that do not support all the functionalities we require. Those that satisfy our requirements are stored in compatibleDevices variable:
// get compatible and incompatible devices
//
// required functionality: compute queue
// optional functionality: none
vk::vector<vk::PhysicalDevice> deviceList = vk::enumeratePhysicalDevices();
vector<tuple<vk::PhysicalDevice, uint32_t, vk::PhysicalDeviceProperties>> compatibleDevices;
vector<vk::PhysicalDeviceProperties> incompatibleDevices;
for(vk::PhysicalDevice pd : deviceList) {
// append compatible queue families
vk::PhysicalDeviceProperties props = vk::getPhysicalDeviceProperties(pd);
vk::vector<vk::QueueFamilyProperties> queueFamilyPropList = vk::getPhysicalDeviceQueueFamilyProperties(pd);
bool found = false;
for(uint32_t i=0, c=uint32_t(queueFamilyPropList.size()); i<c; i++) {
// test for compute operations support
vk::QueueFamilyProperties& qfp = queueFamilyPropList[i];
if(qfp.queueFlags & vk::QueueFlagBits::eCompute) {
found = true;
compatibleDevices.emplace_back(pd, i, props);
}
}
// append incompatible devices
if(!found)
incompatibleDevices.emplace_back(props);
}
// print device list
cout << "List of devices:" << endl;
for(size_t i=0, c=compatibleDevices.size(); i<c; i++) {
auto& t = compatibleDevices[i];
cout << " " << i+1 << ": " << get<2>(t).deviceName << " (compute queue: "
<< get<1>(t) << ", type: " << to_cstr(get<2>(t).deviceType) << ")" << endl;
}
for(size_t i=0, c=incompatibleDevices.size(); i<c; i++) {
auto& props = incompatibleDevices[i];
cout << " incompatible: " << props.deviceName
<< " (type: " << to_cstr(props.deviceType) << ")" << endl;
}
We iterate over all physical devices and store them in compatibleDevices or incompatibleDevices lists. How we decide whether the device is compatible or not? In our case, we need compute capability. More precisely, we look for devices that have a queue with compute capability. So for each device, we get list of queue families and iterate over them. If we found one with compute capability, we make a record in compatibleDevices list. Otherwise, we put the device into incompatibleDevice list. Finally, we print all the compatible devices followed by the incompatible ones.
The next step is the selection of the most suitable device. Often, we want the device with highest performance. Running a benchmark might not be the best solution because we want our application to start quickly. So, we will use different approach. We will select the device based on its type. We will prefer discrete graphics over integrated, virtual GPU over CPU based and over all other types. We just assign the score to each device type and choose the device with the highest score:
// choose the best device
auto bestDevice = compatibleDevices.begin();
constexpr const array deviceTypeScore = {
10, // vk::PhysicalDeviceType::eOther - lowest score
40, // vk::PhysicalDeviceType::eIntegratedGpu - high score
50, // vk::PhysicalDeviceType::eDiscreteGpu - highest score
30, // vk::PhysicalDeviceType::eVirtualGpu - normal score
20, // vk::PhysicalDeviceType::eCpu - low score
10, // unknown vk::PhysicalDeviceType
};
int bestScore = deviceTypeScore[clamp(int(get<2>(*bestDevice).deviceType), 0, int(deviceTypeScore.size())-1)];
for(auto it=compatibleDevices.begin()+1; it!=compatibleDevices.end(); it++) {
int score = deviceTypeScore[clamp(int(get<2>(*it).deviceType), 0, int(deviceTypeScore.size())-1)];
if(score > bestScore) {
bestDevice = it;
bestScore = score;
}
}
// device to use
cout << "Using device:\n"
" " << get<2>(*bestDevice).deviceName << endl;
Then, we create the device with the queue supporting compute operations:
// create device
vk::initDevice(
pd, // physicalDevice
vk::DeviceCreateInfo{ // pCreateInfo
.flags = {},
.queueCreateInfoCount = 1,
.pQueueCreateInfos =
array{
vk::DeviceQueueCreateInfo{
.flags = {},
.queueFamilyIndex = queueFamily,
.queueCount = 1,
.pQueuePriorities = &(const float&)1.f,
}
}.data(),
.enabledLayerCount = 0, // no enabled layers
.ppEnabledLayerNames = nullptr,
.enabledExtensionCount = 0, // no enabled extensions
.ppEnabledExtensionNames = nullptr,
.pEnabledFeatures = nullptr, // no enabled features
}
);
We also need a command pool and a command buffer. The command buffer is used for recording the work to be done. After recording, it can be submitted to the device for execution.
However, command buffers are not created directly. Instead, we create a command pool and we allocate command buffers from it. Command pools allow the driver to amortize the cost of resource allocation across multiple command buffers. They also avoid the need for locks in multithreaded applications, because each thread might use different command pool. We create the command pool and the command buffer as follows:
// command pool
vk::UniqueCommandPool commandPool =
vk::createCommandPoolUnique(
vk::CommandPoolCreateInfo{
.flags = {},
.queueFamilyIndex = queueFamily,
}
);
// allocate command buffer
vk::CommandBuffer commandBuffer =
vk::allocateCommandBuffer(
vk::CommandBufferAllocateInfo{
.commandPool = commandPool,
.level = vk::CommandBufferLevel::ePrimary,
.commandBufferCount = 1,
}
);
We can record the command buffer. In this article, we just begin and end the recording:
// begin command buffer
vk::beginCommandBuffer(
commandBuffer,
vk::CommandBufferBeginInfo{
.flags = vk::CommandBufferUsageFlagBits::eOneTimeSubmit,
.pInheritanceInfo = nullptr,
}
);
// end command buffer
vk::endCommandBuffer(commandBuffer);
Then, we create a fence and we submit the command buffer for the execution:
// fence
vk::UniqueFence computingFinishedFence =
vk::createFenceUnique(
vk::FenceCreateInfo{
.flags = {}
}
);
// submit work
cout << "Submiting work..." << endl;
vk::queueSubmit(
queue,
vk::SubmitInfo{
.waitSemaphoreCount = 0,
.pWaitSemaphores = nullptr,
.pWaitDstStageMask = nullptr,
.commandBufferCount = 1,
.pCommandBuffers = &commandBuffer,
.signalSemaphoreCount = 0,
.pSignalSemaphores = nullptr,
},
computingFinishedFence
);
A fence is synchronization primitive that can be used to wait for the device to finish submitted work. Fences have two states - signaled and unsignaled. When we specify the fence during queue submit operation, it must be unsignalled. Once the device finishes all the work of the particular submit operation, it signals the fence. If our application is waiting on the fence, its execution is resumed:
// wait for the work
cout << "Waiting for the work..." << endl;
vk::Result r =
vk::waitForFence_noThrow(
computingFinishedFence,
uint64_t(1.5e9) // timeout (1.5 seconds)
);
if(r == vk::Result::eTimeout) {
cout << "Vulkan device timeout. Task is probably hanging." << endl;
// use std::quick_exit() to terminate the application
// (Do not throw, do not return, do not call std::exit().
// The device is still busy and it uses number of handles such as
// computingFinishedFence and device handle itself.
// Destruction of the handles in use or the unallowed access to them
// is forbidden by Vulkan specification.
quick_exit(-1);
} else
vk::checkForSuccessValue(r, "vkWaitForFences");
cout << "Done." << endl;
As we can see, we wait for the fence using timeout of one and half seconds. One and half second is usually more than enough for a typical GPU task. A programmer mistake or other circumstances might result in frozen GPU task, for example, the task stuck in endless loop. The frozen tasks might, depending on the device architecture, result in frozen unresponsive screen. Thus, some drives resorted to GPU resets if some tasks did not finish in certain time frame. Particularly, some Nvidia drivers kill tasks if they are not able to finish in two seconds. Other drivers might take much longer time. Even if frozen tasks does not cause unresponsive screen on particular device, we do not want to wait tens of seconds or even indefinitely inside waitForFence call. Thus, timeout of 1.5 seconds to make the application reponsive from the point of the user.
Another issue is handling of timeout. Basically, we cannot just throw exception, because it would cause destruction of many Vulkan handles including the device itself before the aplication would be terminated. Vulkan does not allow destruction of handles that are in use by the device, including the device itself. Neither we can return for the same reason, nor call std::exit(). We resort to call std::quick_exit() and leave the release of all Vulkan handles on operating system.
Concerning vk::waitForFence() - we use noThrow variant of the function, because we do not want to throw in the case of error. Instead, we handle vk::Result::eTimeout by ourselves. On all other return codes, we let vk::checkForSuccessValue() check for us if the result is vk::Result::eSuccess. If not, it will throw. It will throw even on non-error return codes. Such solution for non-error codes is used for two reasons: First, Vulkan function vkWaitForFences() is not expected to return any other success values than we already handled, e.g. eSuccess and eTimeout. And the second reason, even if some future Vulkan extension or Vulkan core version appends a new success value, such success value does not mean we have got our job done. The vk::Result::eTimeout is also non-error return code and does not mean that the job was done. So, this is the reason that we are not happy with anything other than vk::Result::eSuccess and vk::Result::eTimeout and handle it by throwing exception.