Vulkan Tutorial

2-6 - Float Precision

Standard float type has limited precision of about 7 digits. So, double type can be used when higher precision is needed. On the other hand side, half type might be interesting option for performance reasons or lower storage requirements when low precision of about 3 digits is acceptable.

Float type (float32)

Float type occupies 4 bytes, e.g. 32 bits:

 sign   exponent (8 bits)   23 fraction bits representing significand 
0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

It is composed of the sign bit, 8 bits of the exponent and 23 bits of the fraction that represents significand. Representable value ranges from -3.40e38 to 3.40e38 with about 7.2 decimal digits precision.

Double type (float64)

Double type takes 8 bytes, e.g. 64 bits:

 sign   exponent (11 bits)   52 fraction bits representing significand 
0 0 1 1 1 1 1 0 0 0 0 0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

It is composed of the sign bit, 11 bits of the exponent and 52 bits of the fraction that represents significand. Representable value ranges from -1.80e308 to 1.80e308 with about 16.0 decimal digits precision.

Half type (float16)

Half type is stored on 2 bytes, e.g. 16 bits:

 sign   exponent (5 bits)   10 fraction bits representing significand 
0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0

It is composed of the sign bit, 5 bits of the exponent and 10 bits of the fraction that represents significand. Representable value ranges from -65504 to 65504 with about 3.31 decimal digits precision.

Double precision shader

Compute shader for double precision is the same as for float precision except using of double type for the variables:

void main()
{
	// initial values of x, y and z
	double x = gl_GlobalInvocationID.x;
	double y = gl_GlobalInvocationID.y;
	double z = gl_GlobalInvocationID.z;

	FMA10000;

	[...]
}

Half precision shader

Compute shader for half precision is slightly modified version of float precision. Apart from using half type, we need to allow two half operations to be executed in parallel in single shader invocation:

#define FMA10 \
	x1 = x1 * y1 + z; \
	x2 = x2 * y2 + z; \
	x1 = x1 * y1 + z; \
	x2 = x2 * y2 + z; \
	x1 = x1 * y1 + z; \
	x2 = x2 * y2 + z; \
	x1 = x1 * y1 + z; \
	x2 = x2 * y2 + z; \
	x1 = x1 * y1 + z; \
	x2 = x2 * y2 + z

#define FMA100 \
	[...]

#define FMA1000 \
	[...]

#define FMA10000 \
	[...]


void main()
{
	// initial values for the computation
	float16_t x1 = float16_t(gl_GlobalInvocationID.x);
	float16_t y1 = float16_t(gl_GlobalInvocationID.y);
	float16_t z  = float16_t(gl_GlobalInvocationID.z);
	float16_t x2 = x1 + float16_t(0.5);
	float16_t y2 = y1 + float16_t(0.5);

	FMA10000;

	[...]
}

The code uses x1, y1, x2 and y2 variables so that two half-float multiplications or additions are executed in the shader in parallel. It depends on Vulkan device if it can execute half precision operations faster. Often, we get two times higher performance compared to standard float precision.

Double and half type support

We need to know whether double precision and half precision operations are supported. Floats are always supported, but halfs and doubles are optional. We find the information in Vulkan physical device features:

// get supported features
auto [ pipelineCacheControlSupport, float64Support, float16Support ] =
	[&]() -> tuple<bool, bool, bool>
	{
		vk::PhysicalDeviceVulkan13Features features13;
		vk::PhysicalDeviceVulkan12Features features12 =
			{ .pNext = vulkan13Support ? &features13 : nullptr };
		vk::PhysicalDeviceFeatures2 features10 = { .pNext = &features12 };
		vk::getPhysicalDeviceFeatures2(pd, features10);
		return
			tuple{
				vulkan13Support && features13.pipelineCreationCacheControl,
				features10.features.shaderFloat64,
				features12.shaderFloat16,
			};
	}();

Lambda function is used to not keep large structures on the stack once they are not needed.

Then, the features have to be enabled during the device creation:

// create device
vk::initDevice(
	pd,  // physicalDevice
	vk::DeviceCreateInfo{  // pCreateInfo
		.flags = {},
		.queueCreateInfoCount = 1,
		.pQueueCreateInfos =
			array{
				vk::DeviceQueueCreateInfo{
					.flags = {},
					.queueFamilyIndex = queueFamily,
					.queueCount = 1,
					.pQueuePriorities = &(const float&)1.f,
				}
			}.data(),
		.enabledLayerCount = 0,  // no enabled layers
		.ppEnabledLayerNames = nullptr,
		.enabledExtensionCount = 0,  // no enabled extensions
		.ppEnabledExtensionNames = nullptr,
		.pEnabledFeatures =
			&(const vk::PhysicalDeviceFeatures&)vk::PhysicalDeviceFeatures{
				.shaderFloat64 = float64Supported,
				.shaderInt64 = true,
			},
	}.setPNext(
		&(const vk::PhysicalDeviceVulkan12Features&)vk::PhysicalDeviceVulkan12Features{
			.shaderFloat16 = float16Supported,
			.bufferDeviceAddress = true,
		}
		.setPNext(
			vulkan13Support
				? &(const vk::PhysicalDeviceVulkan13Features&)vk::PhysicalDeviceVulkan13Features{
						.pipelineCreationCacheControl = pipelineCacheControlSupport,
					}
				: nullptr
		)
	)
);

Measurement

The heart of measurement logic is the following loop:

size_t halfNumWorkgroups = 1;
size_t floatNumWorkgroups = 1;
size_t doubleNumWorkgroups = 1;
float halfTime;
float floatTime;
float doubleTime;
vector<float> halfPerformanceList;
vector<float> floatPerformanceList;
vector<float> doublePerformanceList;
chrono::time_point startTime = chrono::high_resolution_clock::now();
do {

	// perform tests
	if(float16Support) {
		halfTime = performTest(pipelineList[0], halfNumWorkgroups);
		processResult(halfTime, halfNumWorkgroups, halfPerformanceList);
	}
	floatTime = performTest(pipelineList[1], floatNumWorkgroups);
	processResult(floatTime, floatNumWorkgroups, floatPerformanceList);
	if(float64Support) {
		doubleTime = performTest(pipelineList[2], doubleNumWorkgroups);
		processResult(doubleTime, doubleNumWorkgroups, doublePerformanceList);
	}

	// stop measurements after totalMeasuringTime passed
	float totalTime = chrono::duration<float>(chrono::high_resolution_clock::now() - startTime).count();
	if(totalTime >= totalMeasuringTime)
		break;

	// compute new numWorkgroups
	if(float16Support)
		halfNumWorkgroups = computeNumWorkgroups(halfNumWorkgroups, halfTime);
	floatNumWorkgroups = computeNumWorkgroups(floatNumWorkgroups, floatTime);
	if(float64Support)
		doubleNumWorkgroups = computeNumWorkgroups(doubleNumWorkgroups, doubleTime);

} while(true);

The loop keeps going until totalMeasuringTime is reached. It starts by performing float16, float32 and float64 tests. Then, the test for reaching of totalMeasuringTime is made. Finally, new number of workgroups is computed in the way to eventually reach the test time specified by singleMeasurementTargetTime variable.

The results are recorded in halfPerformanceList, floatPerformanceList and doublePerformanceList in FLOPS. When we are done with measurements, we sort the lists to be able to get median:

// sort the results
sort(halfPerformanceList.begin(), halfPerformanceList.end());
sort(floatPerformanceList.begin(), floatPerformanceList.end());
sort(doublePerformanceList.begin(), doublePerformanceList.end());

Finally, we print median values for all three precisions. We do not use average value, because it is affected by all extreme values collected during the measurements. We do not use std::nth_element() that is more effective for median than std::sort, because we will need other values for computing dispersion. The code is the same for all three precisions:

// print median
cout << formatFloatSI(performanceList[performanceList.size()/2]) << "FLOPS";

// print dispersion using IQR (Interquartile Range);
// Q1 is the value in 25% and Q3 in 75%
cout << "  (Q1: " << formatFloatSI(performanceList[performanceList.size()/4]) << "FLOPS,"
        " Q3: " << formatFloatSI(performanceList[(performanceList.size()*3)/4]) << "FLOPS,"
        " num measurements: " << performanceList.size() << ")";
cout << endl;

First, we print median as the value in the middle of the sorted list. Then, we print IQR (Interquartile Range). First quartile Q1 is taken at 25% of sorted data and third quartile Q3 is taken at 75% of sorted data. IQR gives the idea of statistical dispersion of measured values. Q1 value might be considered median of values in lower half of measured data and Q3 might be considered median of upper half. One important property is that 50% of measured values are between Q1 and Q3 values.

Running the application, we can see the output similar to the following one:

List of devices:
   1: Quadro RTX 3000 (compute queue: 0, type: DiscreteGpu)
   2: Quadro RTX 3000 (compute queue: 2, type: DiscreteGpu)
   3: Intel(R) UHD Graphics (compute queue: 0, type: IntegratedGpu)
   4: llvmpipe (LLVM 21.1.8, 256 bits) (compute queue: 0, type: Cpu)
Using device:
   Quadro RTX 3000
Creating pipelines... done.
   All pipelines were compiled in 20879ms.
Half (float16) performance:    13.6 TFLOPS  (Q1: 13.3 TFLOPS, Q3: 13.7 TFLOPS, num measurements: 48)
Float (float32) performance:   6.67 TFLOPS  (Q1: 6.50 TFLOPS, Q3: 6.77 TFLOPS, num measurements: 49)
Double (float64) performance:   203 GFLOPS  (Q1:  199 GFLOPS, Q3:  205 GFLOPS, num measurements: 51)

Halfs are two times faster then floats on this Nvidia card. And doubles, they are 32 times slower than floats.