PxxxxR0
Carry-less product: std::clmul

New Proposal,

This version:
https://eisenwave.github.io/cpp-proposals/clmul.html
Issue Tracking:
Inline In Spec
Author:
Audience:
SG6, SG18, LEWG
Project:
ISO/IEC 14882 Programming Languages — C++, ISO/IEC JTC1/SC22/WG21
Source:
eisenwave/cpp-proposals

Abstract

This proposal adds carry-less multiplication functions to the standard library. This operation is also known as GF(2) polynomial multiplication.

1. Introduction

Carry-less multiplication is a simple numerical operation on unsigned integers. It can be a seen as a regular multiplication, but with all carry-bits being discarded. In other words, xor is being used as a reduction instead of +.

It is also known as "XOR multiplication" and "polynomial multiplication". The latter name is used because mathematically, it is equivalent to performing a multiplication of two polynomials in GF(2), where each bit is a coefficient.

This proposal adds a std::clmul function to the bit manipulation library in <bit>:

template<unsigned_integral T, unsigned_integral U> // constraint is exposition-only
constexpr common_type_t<T, U> clmul(T x, U y) noexcept;

2. Motivation

Carry-less multiplication is an important operation in a number of use cases:

Carry-less multiplication is of such great utility that there is widespread hardware support, some of it dating back more than a decade.

2.1. Motivating examples

2.2. Parity computation

The parity of an integer x is 0 if the number of one-bits is even, and 1 if it is odd. It is also equivalent to popcount(x) & 1.

The special form clmul(x, -1) computes the parity of each bit in x and the bits to its right. The most significant bit holds the parity of x as a whole.
bool parity(uint32_t x) {    return std::clmul(x, -1u) >> 31;}

2.3. Space filling curves in O(1)

The special form clmul(x, -1) can be used to accelerate the computation of Hilbert curves. To properly understand this example, I will explain the basic notion of space-filling curves.

We can fill space using a 2D curve by mapping the index i on the curve onto Cartesian coordinates x and y. A naive curve that fills a 4x4 square can be computed as follows:

struct pos { uint32_t x, y; };

pos naive_curve(uint32_t i) { return { i % 4, i / 4 }; }

When mapping the index i = 0, 1, ..., 0xf onto the returned 2D coordinates, we obtain the following pattern:

0 1 2 3
4 5 6 7
8 9 a b
c d e f

This is a suboptimal curve because especially for large squares, the gaps between two points on the curve can be very large. If we use this curve to store images, this is problematic for processing because two adjacent pixels can be very far apart in memory, which is bad for cache locality.

A Hilbert curve is a family of space-filling curves where the distance between two adjacent elements is 1:

0 1 e f
3 2 d c
4 7 8 b
5 6 9 a

De-interleaving bits of i into x and y yields a Z-order curve, and performing further transformations yields a Hilbert curve.

clmul can be used to compute the bitwise parity for each bit and the bits to its right. In Hilbert curve computation, this turns O(log N) bitwise operations into O(1).
pos hilbert_to_xy(uint32_t i){    // Using functions from P3104: Bit permutations, we de-interleave the bits of i.    uint32_t i0 = std::bit_compressr(i, 0x55555555u); // abcdefgh -> bdfh    uint32_t i1 = std::bit_compressr(i, 0xaaaaaaaau); // abcdefgh -> aceg        // Undo the permutation that Hilbert curves apply on top of Z-order curves.    uint32_t A = i0 & i1;    uint32_t B = i0 ^ i1 ^ 0xffffu;    uint32_t C = std::clmul(A, -1u) >> 16;    uint32_t D = std::clmul(B, -1u) >> 16;        uint32_t a = C ^ (i0 & D);    return { .x = a ^ i1, .y = a ^ i0 ^ i1 };}

This specific example is taken from [rawrunprotected1]. [Warren1] explains the basis behind this computation of Hilbert curves using bitwise operations.

When working with space filling curves, the inverse operation is also common: mapping the Cartesian coordinates onto an index on the curve. In the case of Z-order curve, aka. Morton curves, this can be done by simply interleaving the bits of x and y.

clmul can be used to implement bit-interleaving in order to generate a Z-order curves.
uint32_t xy_to_morton(uint32_t x, uint32_t y){    // With P3014: Bit permutations, interleaving can also be implemented using    // bit_expandr(x, 0x55555555)    uint32_t lo = std::clmul(x, x) << 0; // abcd -> 0a0b0c0d    uint32_t hi = std::clmul(y, y) << 1; // abcd -> a0b0c0d0    return hi | lo;}

3. Possible implementation

A branchless naive implementation looks as follows:

template<unsigned_integral T, unsigned_integral U>
constexpr common_type_t<T, U> clmul(T x, U y) noexcept
{
    using Common = common_type_t<T, U>;
    Common result = 0;
    for (int i = 0; i < numeric_limits<Common>::digits; ++i) {
        result ^= (Common{x} << i) * ((Common{y} >> i) & 1);
    }
    return result;
}

3.1. Hardware support

The implementation difficulty lies mostly in utilizing available hardware instructions, not in the naive fallback implementation.

In the following table, let uN denote N-bit unsigned integer operands.

Operation x86_64 ARM RISC-V
clmul u64 -> u128 pclmulqdqPCLMULQDQ pmull+pmull2Neon
clmul u64 -> u64 pmullNeon clmul+clmulhZbc, Zbkc
clmul u32 -> u64 clmulZbc, Zbkc
clmul u8x8 -> u16x8 pmullNeon
clmul u8x8 -> u8x8 pmulNeon
A limited x86_64 implementation of clmul may look as follows:
uint64_t clmul(uint64_t x, uint64_t y)  {
    __m128i x_128 = _mm_set_epi64x(0, x);
    __m128i neg1_128 = _mm_set_epi64x(0, y);
    __m128i result_128 = _mm_clmulepi64_si128(x_128, neg1_128, 0);
    return static_cast<uint64_t>(_mm_extract_epi64(result_128, 0));
}

4. Design Considerations

4.1. Naming

The name clmul was chosen because it does not carry a domain-specific connotation. clmul is often, but not always used for CRC computation or cryptography in Galois counter mode. Outside these domains, such as in pure bit-manipulation, this interpretation makes no sense.

Ultimately, carry-less multiplication is a fundamental operation on integers like multiplication, and the interpretation of the bits as coefficients in a GF(2)[X] polynomial is domain-specific. Similarly, std::accumulate is not called std::sum despite its predominant use for computing the sum of a range.

4.1.1. Argumentum ad populum

While the consensus is not unanimous,

4.2. Permitting promotion

common_type_t<T, U> is used as the return type. Therefore, the wider of the two types may be returned, and unbeknownst to the developer, clmul is performed with a wider range of operands than anticipated.

This is not any more of an issue than for other fundamental operations. Intuitively, carry-less multiplication should behave similarly to regular multiplication. However, I don’t feel strongly on this issue, and it would be unusual for the <bit> header.

Feedback required: Should the signature be common_type_t<T, U>(T, U), or simply T(T, T)?

4.3. Widening operations

[P3018R0] proposes functions for widening addition, multiplication, and other operations. Due to the excellent hardware support for widening carry-less multiplication, clmul is a strong candidate for a clmul_wide variant.

Feedback required: Should work into a widening variant of clmul be pursued?

4.4. Choice of header

Currently, <bit> seems like a more appropriate candidate than <cmath>. However, [P3018R0] proposes a new <integer> header which may be a more suitable candidate.

Feedback required: Should the proposed function be located in <bit>?

5. Proposed wording

The proposed changes are relative to the working draft of the standard as of [N4917].

Update subclause 17.3.2 [version.syn], paragraph 2 as follows:

#define __cpp_lib_clmul 20XXXXL
#define __cpp_lib_bitops 201411L20XXXXL

In subclause 22.15.2 [bit.syn], update the synopsis as follows:

// 22.15.X, carry-less product
template<class T, class U>
   constexpr common_type_t<T, U> clmul(T x, U y) noexcept;

In subclause 22.15 [bit], add a new subclause:

22.15.X Carry-less product [bit.clmul]

template<class T, class U>
   constexpr common_type_t<T, U> clmul(T x, U y) noexcept;

1 Let N denote numeric_limits<common_type_t<T, U>>::digits. Let ⊕ denote the exclusive bitwise OR operation ([expr.xor]). Let yn denote the n-th least significant bit of y, so that y equals

2 Constraints: T and U are unsigned integer types ([basic.fundamental]).

3 Returns: modulo 2N.
[Note: x * y equals modulo 2N. — end note]

References

Normative References

[N4917]
Thomas Köppe. Working Draft, Standard for Programming Language C++. 5 September 2022. URL: https://wg21.link/n4917

Informative References

[Intel1]
Shay Gueron; Michael E. Kounavis. Intel® Carry-Less Multiplication Instruction and its Usage for Computing the GCM Mode. URL: https://www.intel.com/content/dam/develop/external/us/en/documents/clmul-wp-rev-2-02-2014-04-20.pdf
[Intel2]
Intel Corporation. Intel® 64 and IA-32 Architectures Software Developer's Manual. URL: https://software.intel.com/en-us/download/intel-64-and-ia-32-architectures-sdm-combined-volumes-1-2a-2b-2c-2d-3a-3b-3c-3d-and-4
[P3018R0]
Andreas Weis. Low-Level Integer Arithmetic. URL: https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2023/p3018r0.pdf
[P3104]
Jan Schultke. Bit permutations. URL: https://eisenwave.github.io/cpp-proposals/bit-permutations.html
[RAWRUNPROTECTED1]
rawrunprotected. 2D Hilbert curves in O(1). URL: http://threadlocalmutex.com/?p=188
[Schultke1]
Jan Schultke. C++26 Bit permutations reference implementation. URL: https://github.com/Eisenwave/cxx26-bit-permutations
[Warren1]
Henry S. Warren, Jr. Hacker's Delight, Second Edition. URL: https://doc.lagout.org/security/Hackers%20Delight.pdf
[Wikipedia1]
Wikipedia community. Carry-less product. URL: https://en.wikipedia.org/wiki/Carry-less_product

Issues Index

Feedback required: Should the signature be common_type_t<T, U>(T, U), or simply T(T, T)?
Feedback required: Should work into a widening variant of clmul be pursued?
Feedback required: Should the proposed function be located in <bit>?