Vulkan Tutorial

2-5 - Float Precision

Standard float type has limited precision. So, double type is often used as an alternative when higher precision is needed. On the other hand side, half type might be considered for its higher performance potential when half type limited precision is not an issue.

Float type (float32)

Float type occupies 4 bytes, e.g. 32 bits:

 sign   exponent (8 bits)   23 fraction bits representing significand 
0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

It is composed of the sign bit, 8 bits of the exponent and 23 bits of the fraction that represents significand. Representable value ranges from -3.40e38 to 3.40e38 with about 7.2 decimal digits precision.

Double type (float64)

Double type takes 8 bytes, e.g. 64 bits:

 sign   exponent (11 bits)   52 fraction bits representing significand 
0 0 1 1 1 1 1 0 0 0 0 0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

It is composed of the sign bit, 11 bits of the exponent and 52 bits of the fraction that represents significand. Representable value ranges from -1.80e308 to 1.80e308 with about 16.0 decimal digits precision.

Half type (float16)

Half type is stored on 2 bytes, e.g. 16 bits:

 sign   exponent (5 bits)   10 fraction bits representing significand 
0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0

It is composed of the sign bit, 5 bits of the exponent and 10 bits of the fraction that represents significand. Representable value ranges from -65504 to 65504 with about 3.31 decimal digits precision.

Double precision shader

Compute shader for double precision is the same as for float precision except using of double variables:

void main()
{
	// initial values of x, y and z
	double x = gl_GlobalInvocationID.x;
	double y = gl_GlobalInvocationID.y;
	double z = gl_GlobalInvocationID.z;

	FMA10000;

	[...]
}

Half precision shader

Compute shader for half precision is slightly modified version of float precision. Apart from using half type, we need to allow two half operations to be executed in parallel:

#define FMA10 \
	x1 = x1 * y1 + z; \
	x2 = x2 * y2 + z; \
	x1 = x1 * y1 + z; \
	x2 = x2 * y2 + z; \
	x1 = x1 * y1 + z; \
	x2 = x2 * y2 + z; \
	x1 = x1 * y1 + z; \
	x2 = x2 * y2 + z; \
	x1 = x1 * y1 + z; \
	x2 = x2 * y2 + z

#define FMA100 \
	[...]

#define FMA1000 \
	[...]

#define FMA10000 \
	[...]


void main()
{
	// initial values of x, y and z
	float16_t x1 = float16_t(gl_GlobalInvocationID.x);
	float16_t y1 = float16_t(gl_GlobalInvocationID.y);
	float16_t z  = float16_t(gl_GlobalInvocationID.z);
	float16_t x2 = x1 + float16_t(0.5);
	float16_t y2 = y1 + float16_t(0.5);

	FMA10000;

	[...]
}

The code uses x1, y1, x2 and y2 variables to allow two independent half precision computations to happen in parallel. It depends on Vulkan device if it can execute half precision operations faster. Often, we get two times higher performance compared standard float precision computations.

Double and half type support

We need to know whether double precision and half precision operations are supported. Floats are always supported, but halfs and doubles are optional. We find the information in Vulkan Physical Device Features:

// get supported features
vk::PhysicalDeviceVulkan12Features features12;
vk::PhysicalDeviceFeatures2 features10 = { .pNext = &features12 };
vk::getPhysicalDeviceFeatures2(get<0>(*selectedDevice), features10);
bool float64Supported = features10.features.shaderFloat64;
bool float16Supported = features12.shaderFloat16;
if(testIndex == 1 && !float64Supported)
	throw runtime_error("Vulkan device does not support double precision computations.");
if(testIndex == 2 && !float16Supported)
	throw runtime_error("Vulkan device does not support half precision computations.");

Then, the features need to be enabled during device creation:

// create device
vk::initDevice(
	get<0>(*selectedDevice),  // physicalDevice
	vk::DeviceCreateInfo{  // pCreateInfo
		.flags = {},
		.queueCreateInfoCount = 1,  // at least one queue is mandatory
		.pQueueCreateInfos =
			array{
				vk::DeviceQueueCreateInfo{
					.flags = {},
					.queueFamilyIndex = queueFamily,
					.queueCount = 1,
					.pQueuePriorities = &(const float&)1.f,
				}
			}.data(),
		.enabledLayerCount = 0,  // no enabled layers
		.ppEnabledLayerNames = nullptr,
		.enabledExtensionCount = 0,  // no enabled extensions
		.ppEnabledExtensionNames = nullptr,
		.pEnabledFeatures =
			&(const vk::PhysicalDeviceFeatures&)vk::PhysicalDeviceFeatures{
				.shaderFloat64 = float64Supported,
				.shaderInt64 = true,
			},
	}.setPNext(
		&(const vk::PhysicalDeviceVulkan12Features&)vk::PhysicalDeviceVulkan12Features{
			.shaderFloat16 = float16Supported,
			.bufferDeviceAddress = true,
		}
	)
);

Measurement

The heart of measurement logic is the following loop:

size_t halfNumWorkgroups = 1;
size_t floatNumWorkgroups = 1;
size_t doubleNumWorkgroups = 1;
float halfTime;
float floatTime;
float doubleTime;
vector halfPerformanceList;
vector floatPerformanceList;
vector doublePerformanceList;
chrono::time_point startTime = chrono::high_resolution_clock::now();
do {

	// perform tests
	if(float16Supported) {
		halfTime = performTest(pipelineList[0], halfNumWorkgroups);
		processResult(halfTime, halfNumWorkgroups, halfPerformanceList);
	}
	floatTime = performTest(pipelineList[1], floatNumWorkgroups);
	processResult(floatTime, floatNumWorkgroups, floatPerformanceList);
	if(float64Supported) {
		doubleTime = performTest(pipelineList[2], doubleNumWorkgroups);
		processResult(doubleTime, doubleNumWorkgroups, doublePerformanceList);
	}

	// stop measurements after three seconds
	double totalTime = chrono::duration(chrono::high_resolution_clock::now() - startTime).count();
	if(totalTime >= 3.)
		break;

	// compute new numWorkgroups
	if(float16Supported)
		halfNumWorkgroups = computeNumWorkgroups(halfNumWorkgroups, halfTime);
	floatNumWorkgroups = computeNumWorkgroups(floatNumWorkgroups, floatTime);
	if(float64Supported)
		doubleNumWorkgroups = computeNumWorkgroups(doubleNumWorkgroups, doubleTime);

} while(true);

It keeps going for three seconds. First, it performs float16, float32 and float64 test. Initially, the tests are performed with a single workgroup. Then, test for timeout of three seconds is made. Finally, new number of workgroups is computed. The number of workgroups is updated to make the test lasting about 20ms.

The results are recorded in halfPerformanceList, floatPerformanceList and doublePerformanceList as performance in FLOPS. When we are done with measurements, we sort the lists to be able to get median:

>
// sort the results
sort(halfPerformanceList.begin(), halfPerformanceList.end());
sort(floatPerformanceList.begin(), floatPerformanceList.end());
sort(doublePerformanceList.begin(), doublePerformanceList.end());

Finally, we print median value as the final performance. We do not use average value, because it is affected by all extreme values collected during the measurements. We do not use std::nth_element() that is more effective for median than std::sort, because we will print IQR dispersion as well. The code is the same for all three precisions:

// print median
cout << formatFloatSI(performanceList[performanceList.size()/2]) << "FLOPS";

// print dispersion using IQR (Interquartile Range);
// Q1 is the value in 25% and Q3 in 75%
cout << "  (Q1: " << formatFloatSI(performanceList[performanceList.size()/4]) << "FLOPS";
cout << " Q3: " << formatFloatSI(performanceList[performanceList.size()/4*3]) << "FLOPS)";

First, we print median as the value in the middle of the sorted list. Then, we print IQR (Interquartile Range). First quartile Q1 is taken from 25% of sorted data and third quartile Q3 is taken from 75% of sorted data. IQR gives the idea of statistical dispersion of measured values. Q1 value might be considered median of values in lower half of measured data and Q3 might be considered median of upper half. One important property is that half of measured values are between Q1 and Q3 values.