Vulkan Tutorial

2-5 - Float Precision

Standard float type has limited precision of about 7 digits. So, double type is often used as an alternative when higher precision is needed. On the other hand side, we might sometimes look for performance or small storage requirements. Half floating point type might be interesting option in that case.

Float type (float32)

Float type occupies 4 bytes, e.g. 32 bits:

sign	exponent (8 bits)								23 fraction bits representing significand
0	0	1	1	1	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0

It is composed of the sign bit, 8 bits of the exponent and 23 bits of the fraction that represents significand. Representable value ranges from -3.40e38 to 3.40e38 with about 7.2 decimal digits precision.

Double type (float64)

Double type takes 8 bytes, e.g. 64 bits:

sign	exponent (11 bits)											52 fraction bits representing significand
0	0	1	1	1	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0

It is composed of the sign bit, 11 bits of the exponent and 52 bits of the fraction that represents significand. Representable value ranges from -1.80e308 to 1.80e308 with about 16.0 decimal digits precision.

Half type (float16)

Half type is stored on 2 bytes, e.g. 16 bits:

sign	exponent (5 bits)					10 fraction bits representing significand
0	0	1	1	1	1	0	0	0	0	0	0	0	0	0	0

It is composed of the sign bit, 5 bits of the exponent and 10 bits of the fraction that represents significand. Representable value ranges from -65504 to 65504 with about 3.31 decimal digits precision.

Double precision shader

Compute shader for double precision is the same as for float precision except using of double type for the variables:

void main()
{
	// initial values of x, y and z
	double x = gl_GlobalInvocationID.x;
	double y = gl_GlobalInvocationID.y;
	double z = gl_GlobalInvocationID.z;

	FMA10000;

	[...]
}

Half precision shader

Compute shader for half precision is slightly modified version of float precision. Apart from using half type, we need to allow two half operations to be executed in parallel in single shader invocation:

#define FMA10 \
	x1 = x1 * y1 + z; \
	x2 = x2 * y2 + z; \
	x1 = x1 * y1 + z; \
	x2 = x2 * y2 + z; \
	x1 = x1 * y1 + z; \
	x2 = x2 * y2 + z; \
	x1 = x1 * y1 + z; \
	x2 = x2 * y2 + z; \
	x1 = x1 * y1 + z; \
	x2 = x2 * y2 + z

#define FMA100 \
	[...]

#define FMA1000 \
	[...]

#define FMA10000 \
	[...]


void main()
{
	// initial values of x, y and z
	float16_t x1 = float16_t(gl_GlobalInvocationID.x);
	float16_t y1 = float16_t(gl_GlobalInvocationID.y);
	float16_t z  = float16_t(gl_GlobalInvocationID.z);
	float16_t x2 = x1 + float16_t(0.5);
	float16_t y2 = y1 + float16_t(0.5);

	FMA10000;

	[...]
}

The code uses x1, y1, x2 and y2 variables to allow two independent half precision computations to happen in parallel. It depends on Vulkan device if it can execute half precision operations faster. Often, we get two times higher performance compared to standard float precision computations.

Double and half type support

We need to know whether double precision and half precision operations are supported. Floats are always supported, but halfs and doubles are optional. We find the information in Vulkan physical device features:

// get supported features
vk::PhysicalDeviceVulkan12Features features12;
vk::PhysicalDeviceFeatures2 features10 = { .pNext = &features12 };
vk::getPhysicalDeviceFeatures2(pd, features10);
bool float64Supported = features10.features.shaderFloat64;
bool float16Supported = features12.shaderFloat16;

Then, the features need to be enabled during device creation:

// create device
vk::initDevice(
	pd,  // physicalDevice
	vk::DeviceCreateInfo{  // pCreateInfo
		.flags = {},
		.queueCreateInfoCount = 1,
		.pQueueCreateInfos =
			array{
				vk::DeviceQueueCreateInfo{
					.flags = {},
					.queueFamilyIndex = queueFamily,
					.queueCount = 1,
					.pQueuePriorities = &(const float&)1.f,
				}
			}.data(),
		.enabledLayerCount = 0,  // no enabled layers
		.ppEnabledLayerNames = nullptr,
		.enabledExtensionCount = 0,  // no enabled extensions
		.ppEnabledExtensionNames = nullptr,
		.pEnabledFeatures =
			&(const vk::PhysicalDeviceFeatures&)vk::PhysicalDeviceFeatures{
				.shaderFloat64 = float64Supported,
				.shaderInt64 = true,
			},
	}.setPNext(
		&(const vk::PhysicalDeviceVulkan12Features&)vk::PhysicalDeviceVulkan12Features{
			.shaderFloat16 = float16Supported,
			.bufferDeviceAddress = true,
		}
	)
);

Measurement

The heart of measurement logic is the following loop:

size_t halfNumWorkgroups = 1;
size_t floatNumWorkgroups = 1;
size_t doubleNumWorkgroups = 1;
float halfTime;
float floatTime;
float doubleTime;
vector halfPerformanceList;
vector floatPerformanceList;
vector doublePerformanceList;
chrono::time_point startTime = chrono::high_resolution_clock::now();
do {

	// perform tests
	if(float16Supported) {
		halfTime = performTest(pipelineList[0], halfNumWorkgroups);
		processResult(halfTime, halfNumWorkgroups, halfPerformanceList);
	}
	floatTime = performTest(pipelineList[1], floatNumWorkgroups);
	processResult(floatTime, floatNumWorkgroups, floatPerformanceList);
	if(float64Supported) {
		doubleTime = performTest(pipelineList[2], doubleNumWorkgroups);
		processResult(doubleTime, doubleNumWorkgroups, doublePerformanceList);
	}

	// stop measurements after three seconds
	double totalTime = chrono::duration(chrono::high_resolution_clock::now() - startTime).count();
	if(totalTime >= 3.)
		break;

	// compute new numWorkgroups
	if(float16Supported)
		halfNumWorkgroups = computeNumWorkgroups(halfNumWorkgroups, halfTime);
	floatNumWorkgroups = computeNumWorkgroups(floatNumWorkgroups, floatTime);
	if(float64Supported)
		doubleNumWorkgroups = computeNumWorkgroups(doubleNumWorkgroups, doubleTime);

} while(true);

It keeps going for three seconds. First, it performs float16, float32 and float64 test. Initially, the tests are performed with single workgroup. Then, the test for timeout of three seconds is made. Finally, new number of workgroups is computed in the way to target the test time of about 20ms.

The results are recorded in halfPerformanceList, floatPerformanceList and doublePerformanceList as performance in FLOPS. When we are done with measurements, we sort the lists to be able to get median:

>
// sort the results
sort(halfPerformanceList.begin(), halfPerformanceList.end());
sort(floatPerformanceList.begin(), floatPerformanceList.end());
sort(doublePerformanceList.begin(), doublePerformanceList.end());

Finally, we print median values for all three precisions. We do not use average value, because it is affected by all extreme values collected during the measurements. We do not use std::nth_element() that is more effective for median than std::sort, because we will print IQR dispersion as well. The code is the same for all three precisions:

// print median
cout << formatFloatSI(performanceList[performanceList.size()/2]) << "FLOPS";

// print dispersion using IQR (Interquartile Range);
// Q1 is the value in 25% and Q3 in 75%
cout << "  (Q1: " << formatFloatSI(performanceList[performanceList.size()/4]) << "FLOPS";
cout << " Q3: " << formatFloatSI(performanceList[performanceList.size()/4*3]) << "FLOPS)";

First, we print median as the value in the middle of the sorted list. Then, we print IQR (Interquartile Range). First quartile Q1 is taken from 25% of sorted data and third quartile Q3 is taken from 75% of sorted data. IQR gives the idea of statistical dispersion of measured values. Q1 value might be considered median of values in lower half of measured data and Q3 might be considered median of upper half. One important property is that 50% of measured values are between Q1 and Q3 values.