Carry-less product: std::clmul

Document number:
P3642R1
Date:
2025-05-26
Audience:
LEWGI, SG6
Project:
ISO/IEC 14882 Programming Languages — C++, ISO/IEC JTC1/SC22/WG21
Reply-to:
Jan Schultke <janschultke@gmail.com>
GitHub Issue:
wg21.link/P3642/github
Source:
github.com/Eisenwave/cpp-proposals/blob/master/src/clmul.cow

Add widening and non-widening carry-less multiplication functions.

Revision history

Changes since R0

Contents

1

Introduction

2

Motivation

2.1

Parity computation and JSON parsing

2.2

Fast space-filling curves

3

Possible implementation

3.1

Hardware support

4

Design considerations

4.1

Naming

4.2

SIMD support

5

Proposed wording

6

References

1. Introduction

Carry-less multiplication is a simple numerical operation on unsigned integers. It can be a seen as a regular multiplication where xor is being used as a reduction instead of +.

It is also known as "XOR multiplication" and "polynomial multiplication". The latter name is used because mathematically, it is equivalent to performing a multiplication of two polynomials in GF(2), where each bit is a coefficient.

I propose a std::clmul function to perform this operation:

template<class T> constexpr T clmul(T x, T y) noexcept;

I also propose a widening operation in the style of [P3161R4], as follows:

template<class T> struct mul_wide_result { T low_bits; T high_bits; }; template<class T> constexpr mul_wide_result<T> clmul_wide(T x, T y) noexcept;

2. Motivation

Carry-less multiplication is an important operation in a number of use cases:

Carry-less multiplication is of such great utility that there is widespread hardware support, some dating back more than a decade. See below for motivating examples.

2.1. Parity computation and JSON parsing

The parity of an integer x is 0 if the number of one-bits is even, and 1 if it is odd. The parity can also be computed with popcount(x) & 1.

The special form clmul(x, -1) computes the parity of each bit in x and the bits to its right. The most significant bit holds the parity of x as a whole.

bool parity(std::uint32_t x) { return std::clmul(x, -1u) >> 31; }

While the parity of all bits can be obtained with clmul, it computes the inclusive cumulative parity, which can be used to accelerate parsing JSON and other file formats ([SimdJsonClmul]). This can be done by mapping each " character onto a 1-bit, and any other character onto 0. clmul(x, -1) would then produce masks where string characters corresponds to a 1-bit.

abc xxx "foobar" zzz "a" // input string
000000001000000100000101 // quote_mask
00000000.111111.00000.1. // clmul(quote_mask, -1), ignoring 1-bits of quote_mask

2.2. Fast space-filling curves

The special form clmul(x, -1) can be used to accelerate the computation of Hilbert curves. To properly understand this example, I will explain the basic notion of space-filling curves.

We can fill space using a 2D curve by mapping the index i on the curve onto Cartesian coordinates x and y. A naive curve that fills a 4x4 square can be computed as follows:

struct pos { uint32_t x, y; }; pos naive_curve(uint32_t i) { return { i % 4, i / 4 }; }

When mapping the index i = 0, 1, ..., 0xf onto the returned 2D coordinates, we obtain the following pattern:

0 1 2 3
4 5 6 7
8 9 a b
c d e f

The problem with such a naive curve is that adjacent indices can be positioned very far apart (the distance increases with row length). For image processing, if we store pixels in this pattern, cache locality is bad; two adjacent pixels can be very far apart in memory.

A Hilbert curve is a family of space-filling curves where the distance between two adjacent elements is 1:

0 1 e f
3 2 d c
4 7 8 b
5 6 9 a

De-interleaving bits of i into x and y yields a Z-order curve, and performing further transformations yields a Hilbert curve.

clmul can be used to compute the bitwise parity for each bit and the bits to its right, which is helpful for computing Hilbert curves. Note that the following example uses the std::bit_compress function from [P3104R3], which may also be accelerated using clmul.

pos hilbert_to_xy(uint32_t i) { // De-interleave the bits of i. uint32_t i0 = std::bit_compress(i, 0x55555555u); // abcdefgh bdfh uint32_t i1 = std::bit_compress(i, 0xaaaaaaaau); // abcdefgh aceg // Undo the permutation that Hilbert curves apply on top of Z-order curves. uint32_t A = i0 & i1; uint32_t B = i0 ^ i1 ^ 0xffffu; uint32_t C = std::clmul(A, -1u) >> 16; uint32_t D = std::clmul(B, -1u) >> 16; uint32_t a = C ^ (i0 & D); return { .x = a ^ i1, .y = a ^ i0 ^ i1 }; }

This specific example is taken from [FastHilbertCurves]. [HackersDelight] explains the basis behind this computation of Hilbert curves using bitwise operations.

When working with space-filling curves, the inverse operation is also common: mapping the Cartesian coordinates onto an index on the curve. In the case of Z-order curves aka. Morton curves, this can be done by simply interleaving the bits of x and y. A Z-order curve is laid out as follows:

0 1 4 5
2 3 6 7
8 9 c d
a b e f

clmul can be used to implement bit-interleaving in order to generate a Z-order curves.

uint32_t xy_to_morton(uint32_t x, uint32_t y) { uint32_t lo = std::clmul(x, x) << 0; // abcd -> 0a0b0c0d uint32_t hi = std::clmul(y, y) << 1; // abcd -> a0b0c0d0 return hi | lo; }

In the example above, std::clmul(x, x) is equivalent to [P3104R3]'s std::bit_expand(x, 0x55555555u).

3. Possible implementation

A naive and unconstrained implementation looks as follows:

template<class T> constexpr T clmul(const T x, const T y) noexcept { T result = 0; for (int i = 0; i < numeric_limits<T>::digits; ++i) { result ^= (x << i) * ((y >> i) & 1); } return result; }

3.1. Hardware support

The implementation difficulty lies mostly in utilizing available hardware instructions, not in the naive fallback implementation.

In the following table, let uN denote N-bit unsigned integer operands, and ×N denote the amount of operands that are processed in parallel.

Operationx86_64ARMRV64
clmul u64×4 → u128×4 vpclmulqdqVPCLMULQDQ
clmul u64×2 → u128×2 vpclmulqdqVPCLMULQDQ
clmul u64 → u128 pclmulqdqPCLMULQDQ pmull+pmull2Neon clmul+clmulhZbc, Zbkc
clmul u64 → u128 pclmulqdqPCLMULQDQ pmull+pmull2Neon clmul+clmulhZbc, Zbkc
clmul u64 → u64 pmullNeon clmulZbc, Zbkc
clmul u8×8 → u16×8 pmullNeon
clmul u8×8 → u8×8 pmulNeon

A limited x86_64 implementation of clmul_wide may look as follows:

#include <immintrin.h> #include <cstdint> mul_wide_result<uint64_t> clmul_wide(uint64_t x, uint64_t y) noexcept { __m128i x_128 = _mm_set_epi64x(0, x); __m128i y_128 = _mm_set_epi64x(0, y); __m128i result_128 = _mm_clmulepi64_si128(x_128, y_128, 0); return { .low_bits = uint64_t(_mm_extract_epi64(result_128, 0)), .high_bits = uint64_t(_mm_extract_epi64(result_128, 1)) }; }

There also exists an LLVM pull request ([LLVMClmul]) which would add an @llvm.clmul intrinsic function.

4. Design considerations

Multiple design choices lean on [P0543R3] and [P3161R4]. Specifically,

4.1. Naming

Carry-less multiplication is also commonly called "Galois Field Multiplication" or "Polynomial Multiplication".

The name clmul was chosen because it carries no domain-specific connotation, and because it is widespread:

4.2. SIMD support

The proposal is currently limited to the scalar version of clmul. However, <simd> support may be useful to add, and would be backed by hardware support to an extent.

If LEWG approves, should decide whether <simd> support should be added in this paper, or in a separate paper, if at all.

5. Proposed wording

The proposed changes are relative to the working draft of the standard as of [N5008], with the changes in [P3161R4] applied.

Update subclause [version.syn] paragraph 2 as follows:

#define __cpp_lib_clamp 201603L // also in <algorithm> #define __cpp_lib_clmul 20????L // also in <numeric> [...] #define __cpp_lib_overflow_arithmetic 20????L 20????L // also in <numeric>

Update the synopsis in [numeric.ops.overview] as follows:

template<class T, class U> constexpr T saturate_cast(U x) noexcept; // freestanding template<class T> struct add_carry_result { // freestanding T low_bits; bool overflow; }; template<class T> using sub_borrow_result = add_carry_result; // freestanding template<class T> struct mul_wide_result { // freestanding T low_bits; T high_bits; }; template<class T> struct div_result { // freestanding T quotient; T remainder; }; template<class T> constexpr add_carry_result<T> add_carry(T x, T y, bool carry) noexcept; // freestanding template<class T> constexpr sub_borrow_result<T> sub_borrow(T left, T right, bool borrow) noexcept; // freestanding template<class T> constexpr mul_wide_result<T> mul_wide(T x, T y) noexcept; // freestanding template<class T> constexpr mul_wide_result<T> clmul_wide(T x, T y) noexcept; // freestanding template<class T> constexpr div_result<T> div_wide(T dividend_high, T dividend_low, // freestanding T divisor ) noexcept; template<class T> constexpr bool is_div_defined(T dividend, T divisor) noexcept // freestanding template<class T> constexpr bool is_div_wide_defined(T dividend_high, // freestanding T dividend_low, T divisor) noexcept; // [numeric.clmul], carry-less product template<class T> constexpr T clmul(T x, U y) noexcept; // freestanding

In subclause [numeric.overflow] (known as [numeric.sat] at the time of writing), insert the following item, immediately following the description of mul_wide:

template<class T> constexpr mul_wide_result<T> clmul_wide(T x, T y) noexcept;

Let ⊕ denote the exclusive OR operation ([expr.xor]). Given an integer α, let αi denote the ith least significant bit in the base-2 representation of α.

Constraints: T is an unsigned integer type ([basic.fundamental]).

Returns: An object storing the bits of an integer c, where the value of ci is given by Formula ?.?, x is x, y is y, and N is the width of T. The result object is initialized so that

  • low_bits stores the N least significant bits of c, and
  • high_bits stores the subsequent N bits of c.
[FORMULA ?.?] c i = j=0 i xj y ij

If the mathematical notation in the block above does not render for you, you are using an old browser with no MathML support. Please open the document in a recent version of Firefox or Chrome.

The formula is taken from [IntelClmul], with different variable names, and with no special case for the upper N bits; we can simply treat the integers as mathematical integers with 2N width.

See [iterator.concept.wine] for precedent on using N to denote the width of a type.

See [sf.cmath.riemann.zeta] for precedent on wording which includes formulae.

The formula above in TeX notation is:

c_i = \bigoplus_{j = 0}^i x_i y_{i - j}

In subclause [numeric.ops], append a subclause immediately following [numeric.overflow] (known as [numeric.sat] at the time of writing):

Carry-less product [numeric.clmul]

template<class T> constexpr T clmul(T x, T y) noexcept;

Constraints: T is an unsigned integer type ([basic.fundamental]).

Returns: clmul_wide(x, y).low_bits ([numeric.overflow]).

6. References

[N5008] Thomas Köppe. Working Draft, Programming Languages — C++ 2025-03-15 https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2025/n5008.pdf
[P3161R4] Tiago Freire. Unified integer overflow arithmetic 2025-03-26 https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2025/p3161R4.pdf
[BitPermutations] Jan Schultke. C++26 Bit permutations reference implementation https://github.com/Eisenwave/cxx26-bit-permutations
[SimdJsonClmul] Geoff Langdale. Code Fragment: Finding quote pairs with carry-less multiply (PCLMULQD) 2019-03-06 https://branchfree.org/2019/03/06/code-fragment-finding-quote-pairs-with-carry-less-multiply-pclmulqdq/
[IntelClmul] Shay Gueron, Michael E. Kounavis. Intel® Carry-Less Multiplication Instruction and its Usage for Computing the GCM Mode https://www.intel.com/content/dam/develop/external/us/en/documents/clmul-wp-rev-2-02-2014-04-20.pdf
[HackersDelight] Henry S. Warren. Hacker's Delight, Second Edition https://doc.lagout.org/security/Hackers'Delight.pdf
[FastHilbertCurves] rawrunprotected. 2D Hilbert curves in O(1) "http://threadlocalmutex.com/?p=188
[LLVMClmul] Oscar Smith. [IR] Add llvm clmul intrinsic https://github.com/llvm/llvm-project/pull/140301