Logical device contains one or more queues that can be used to submit the work to be executed. This article presents two main topics: Selection of the device and the work submission.
It is not uncommon that Vulkan lists two or more physical devices present in the system. The programmer might face the question: Which one should I use? If I take the first one, will I get the high performant one? Or, will I be unlucky and get slow integrated graphics or software-based CPU device?
We will try to answer the question from the beginning. First, we will get the list of the physical devices in the system. Then, we filter out those that does not support all the functionalities we require. Those that satisfy our requirements are stored in compatibleDevices variable:
// get compatible devices vk::vector<vk::PhysicalDevice> deviceList = vk::enumeratePhysicalDevices(); vector<tuple<vk::PhysicalDevice, uint32_t, vk::PhysicalDeviceProperties>> compatibleDevices; vector<vk::PhysicalDeviceProperties> incompatibleDevices; for(size_t i=0; i<deviceList.size(); i++) { // append compatible queue families vk::PhysicalDevice pd = deviceList[i]; vk::PhysicalDeviceProperties props = vk::getPhysicalDeviceProperties(pd); vk::vector<vk::QueueFamilyProperties> queueFamilyPropList = vk::getPhysicalDeviceQueueFamilyProperties(pd); bool found = false; for(uint32_t i=0, c=uint32_t(queueFamilyPropList.size()); i<c; i++) { // test for compute operations support vk::QueueFamilyProperties& qfp = queueFamilyPropList[i]; if(qfp.queueFlags & vk::QueueFlagBits::eCompute) { found = true; compatibleDevices.emplace_back(pd, i, props); } } // append incompatible devices if(!found) incompatibleDevices.emplace_back(props); } // print device list cout << "List of devices:" << endl; for(size_t i=0, c=compatibleDevices.size(); i<c; i++) { auto& t = compatibleDevices[i]; cout << " " << i+1 << ": " << get<2>(t).deviceName << " (compute queue: " << get<1>(t) << ", type: " << to_cstr(get<2>(t).deviceType) << ")" << endl; } for(size_t i=0, c=incompatibleDevices.size(); i<c; i++) { auto& props = incompatibleDevices[i]; cout << " incompatible: " << props.deviceName << " (type: " << to_cstr(props.deviceType) << ")" << endl; }
We iterate over all physical devices and store them in compatibleDevices or incompatibleDevices lists. How we decide whether the device is compatible or not? In our case, we require compute capability on any queue. So for each device, we get list of queue families and iterate over them. We look for compute capability. If we found one, we make a record in compatibleDevices list. Otherwise, we put the device into incompatibleDevice list. Finally, we print all the compatible devices followed by the incompatible ones.
The next step is the selection of the most suitable device. Often, we want the most performant device. But running a benchmark might not be the best solution because we also want our application to start quickly. So, we will use different approach. We will select the device based on its type. We will prefer discrete graphics over integrated, virtual GPU over CPU based and over all other types. We just assign the score to each device type and choose the device with the highest score:
// choose the best device auto bestDevice = compatibleDevices.begin(); constexpr const array deviceTypeScore = { 10, // vk::PhysicalDeviceType::eOther - lowest score 40, // vk::PhysicalDeviceType::eIntegratedGpu - high score 50, // vk::PhysicalDeviceType::eDiscreteGpu - highest score 30, // vk::PhysicalDeviceType::eVirtualGpu - normal score 20, // vk::PhysicalDeviceType::eCpu - low score 10, // unknown vk::PhysicalDeviceType }; int bestScore = deviceTypeScore[clamp(int(get<2>(*bestDevice).deviceType), 0, int(deviceTypeScore.size())-1)]; for(auto it=compatibleDevices.begin()+1; it!=compatibleDevices.end(); it++) { int score = deviceTypeScore[clamp(int(get<2>(*it).deviceType), 0, int(deviceTypeScore.size())-1)]; if(score > bestScore) { bestDevice = it; bestScore = score; } } // device to use cout << "Using device:\n" " " << get<2>(*bestDevice).deviceName << endl;
Then, we create the device with the queue supporting compute operations:
// create device vk::initDevice( pd, // physicalDevice vk::DeviceCreateInfo{ // pCreateInfo .flags = {}, .queueCreateInfoCount = 1, .pQueueCreateInfos = array{ vk::DeviceQueueCreateInfo{ .flags = {}, .queueFamilyIndex = queueFamily, .queueCount = 1, .pQueuePriorities = &(const float&)1.f, } }.data(), .enabledLayerCount = 0, // no enabled layers .ppEnabledLayerNames = nullptr, .enabledExtensionCount = 0, // no enabled extensions .ppEnabledExtensionNames = nullptr, .pEnabledFeatures = nullptr, // no enabled features } );
We also need command pool and command buffer. The command buffer is used for recording the work to be done. After recording, it can be submitted to the device for execution.
However, command buffers are not created directly. Instead, we create a command pool and we allocate command buffers from it. Command pools allow the driver to amortize the cost of resource allocation across multiple command buffers. They also avoid the need for locks in multithreaded applications, because each thread might use different command pool. We create the command pool and the command buffer as follows:
// command pool vk::UniqueCommandPool commandPool = vk::createCommandPoolUnique( vk::CommandPoolCreateInfo{ .flags = {}, .queueFamilyIndex = queueFamily, } ); // allocate command buffer vk::CommandBuffer commandBuffer = vk::allocateCommandBuffer( vk::CommandBufferAllocateInfo{ .commandPool = commandPool, .level = vk::CommandBufferLevel::ePrimary, .commandBufferCount = 1, } );
We can record the command buffer. In this article, we just begin and end the recording:
// begin command buffer vk::beginCommandBuffer( commandBuffer, vk::CommandBufferBeginInfo{ .flags = vk::CommandBufferUsageFlagBits::eOneTimeSubmit, .pInheritanceInfo = nullptr, } ); // end command buffer vk::endCommandBuffer(commandBuffer);
Then, we create a fence and we can submit the command buffer for the execution:
// fence vk::UniqueFence computingFinishedFence = vk::createFenceUnique( vk::FenceCreateInfo{ .flags = {} } ); // submit work cout << "Submiting work..." << endl; vk::queueSubmit( queue, vk::SubmitInfo{ .waitSemaphoreCount = 0, .pWaitSemaphores = nullptr, .pWaitDstStageMask = nullptr, .commandBufferCount = 1, .pCommandBuffers = &commandBuffer, .signalSemaphoreCount = 0, .pSignalSemaphores = nullptr, }, computingFinishedFence );
A fence is a synchronization primitive that can be used to wait for the device to finish submitted work. Fences have two states - signaled and unsignaled. When we specify the fence during queue submit operation, it must be unsignalled. Once the device finishes all the work of the particular submit operation, it signals the fence. If our application is waiting on the fence, its work is resumed:
// wait for the work cout << "Waiting for the work..." << endl; vk::Result r = vk::waitForFence_noThrow( computingFinishedFence, uint64_t(3e9) // timeout (3s) ); if(r == vk::Result::eTimeout) throw std::runtime_error("GPU timeout. Task is probably hanging."); else vk::checkForSuccessValue(r, "vkWaitForFences"); cout << "Done." << endl;
As we can see, we wait for the fence using timeout of three seconds. We call noThrow variant of the function, because we do not want to throw in the case of error. Instead, we handle vk::Result::eTimeout by ourselves. On all other return codes, we let vk::checkForSuccessValue() check for us if the result is vk::Result::eSuccess. If not, it will throw. It will throw even on non-error return codes. Such solution for non-error codes is used for two reasons: First, Vulkan function vkWaitForFences() is not expected to return any other success values than we already handled, e.g. eSuccess and eTimeout. And for the second, even if some future extension or Vulkan version appends a new success value, such success value does not mean we have got our job done. The vk::Result::eTimeout is also non-error return code and does not mean that the job was done. So, this is the reason that we are not happy with any other non-error return codes and we throw.