Carry-less product: std::clmul
- Document number:
- P3642R2
- Date:
2025-07-02 - Audience:
- LEWG
- Project:
- ISO/IEC 14882 Programming Languages — C++, ISO/IEC JTC1/SC22/WG21
- Reply-to:
- Jan Schultke <janschultke@gmail.com>
- GitHub Issue:
- wg21.link/P3642/github
- Source:
- github.com/Eisenwave/cpp-proposals/blob/master/src/clmul.cow
Contents
Revision history
Changes since R1
Changes since R0
Introduction
Motivation
Parity computation and JSON parsing
Fast space-filling curves
Possible implementation
Hardware support
Design considerations
Naming
Widening operation
SIMD support
SIMD widening operations are out of scope
Proposed wording
References
1. Revision history
1.1. Changes since R1
The paper was seen by SG6 at Sofia 2025 with the following feedback:
Summary: SG6 had no numerics concerns but recommended to include std::simd overloads into the paper.
POLL: Forward P3642R1 to LEWG with the expectation that the next revision includes
overloads.
std :: simd
SF F N A SA 7 4 0 0 0
The following changes were made:
- Provide (non-widening) §5.3. SIMD support
- Use two-space indentation, and generally match the code style of the C++ standard
- Provide detailed design description for §5.2. Widening operation
- Make §6. Proposed wording and design independent of [P3161R4]
- Fix stray
type name in §6. Proposed wording, and improve wording generallyU - Rebase §6. Proposed wording on
naming changes in [P3691R1]std :: simd
1.2. Changes since R0
- Generate the proposal using COWEL instead of bikeshed
- Fix incorrect formula in §6. Proposed wording for bits ≥ the integer width
- Fix §4.1. Hardware support missing new VPCLMULQDQ instructions
- Fix improper uses of
in §2. Introductionstd :: unsigned_integral - Make slight editorial wording adjustments
- Rebase on N5008 and [P3161R4]
- Mention [SimdJsonClmul] in §3. Motivation
2. Introduction
Carry-less multiplication
is a simple numerical operation on unsigned integers.
It can be a seen as a regular multiplication where
is being used as a reduction instead of
.
It is also known as "XOR multiplication" and "polynomial multiplication". The latter name is used because mathematically, it is equivalent to performing a multiplication of two polynomials in GF(2), where each bit is a coefficient.
I propose a
function to perform this operation:
I also propose a widening operation in the style of [P3161R4], as follows:
3. Motivation
Carry-less multiplication is an important operation in a number of use cases:
-
CRC Computation: While cyclic redundancy checks can theoretically be performed with a finite
field of any length, in practice,
GF(2)[X],
the polynomial ring over the Galois field with two elements is used.
Polynomial addition in this ring can be implemented via
, and multiplication viaxor
, which makes cyclic redundancy checks considerably faster.clmul -
Cryptography:
may be used to implement AES-GCM. [IntelClmul] describes this process in great detail and motivates hardware support for carry-less multiplication via theclmul
instruction.pclmulqdq -
Bit manipulation:
performs a large amount ofclmul
and<<
operations in parallel. This is utilized in the reference implementation [BitPermutations] ofxor
, proposed in [P3104R3]. For example, the formstd :: bit_compressr
computes the bitwise inclusive parity for each bit ofclmul ( x , - 1 u )
and the bits to its right.x
Carry-less multiplication is of such great utility that there is widespread hardware support, some dating back more than a decade. See below for motivating examples.
3.1. Parity computation and JSON parsing
The parity of an integer
is
if the number of one-bits is even,
and
if it is odd.
The parity can also be computed with
.
computes the parity of each bit in
and the bits to its right.
The most significant bit holds the parity of
as a whole.
While the parity of all bits can be obtained with
,
it computes the inclusive cumulative parity,
which can be used to accelerate parsing JSON and other file formats ([SimdJsonClmul]).
This can be done by mapping each
character onto a
-bit,
and any other character onto
.
would then produce masks where string characters
corresponds to a
-bit.
abc xxx" foobar " zzz" a " // input string 000000001 0000001 000001 01 // quote_mask 00000000. 111111 . 00000. 1 . // clmul(quote_mask, -1), ignoring 1-bits of quote_mask
3.2. Fast space-filling curves
The special form
can be used to accelerate the computation of Hilbert curves.
To properly understand this example, I will explain the basic notion of space-filling curves.
We can fill space using a 2D curve by mapping the index
on the curve
onto Cartesian coordinates
and
.
A naive curve that fills a 4x4 square can be computed as follows:
When mapping the index
onto the returned 2D coordinates,
we obtain the following pattern:
0 1 2 3 4 5 6 7 8 9 a b c d e f
The problem with such a naive curve is that adjacent indices can be positioned very far apart (the distance increases with row length). For image processing, if we store pixels in this pattern, cache locality is bad; two adjacent pixels can be very far apart in memory.
A Hilbert curve
is a family of space-filling curves where the distance between two adjacent
elements is
:
0 1 e f 3 2 d c 4 7 8 b 5 6 9 a
De-interleaving bits of
into
and
yields a Z-order curve,
and performing further transformations yields a
Hilbert curve.
can be used to compute the bitwise parity for each bit and the bits to its right,
which is helpful for computing Hilbert curves.
Note that the following example uses the
function from [P3104R3],
which may also be accelerated using
.
This specific example is taken from [FastHilbertCurves]. [HackersDelight] explains the basis behind this computation of Hilbert curves using bitwise operations.
When working with space-filling curves, the inverse operation is also common:
mapping the Cartesian coordinates onto an index on the curve.
In the case of Z-order curves aka. Morton curves,
this can be done by simply interleaving the bits of
and
.
A Z-order curve is laid out as follows:
0 1 4 5 2 3 6 7 8 9 c d a b e f
can be used to implement bit-interleaving in order to generate a
Z-order curves.
is equivalent to
[P3104R3]'s
.
4. Possible implementation
A naive and unconstrained implementation looks as follows:
4.1. Hardware support
The implementation difficulty lies mostly in utilizing available hardware instructions, not in the naive fallback implementation.
In the following table, let
Operation | x86_64 | ARM | RV64 |
---|---|---|---|
may look as follows:
There also exists an LLVM pull request ([LLVMClmul])
which would add an
5. Design considerations
Multiple design choices lean on [P0543R3] and [P3161R4]. Specifically,
- the choice of header
,<numeric> - the choice to have a widening operation,
- the
naming scheme,_wide - the
template, andwide_result - the decision to have a
parameter list.( T , T )
5.1. Naming
Carry-less multiplication is also commonly called "Galois Field Multiplication" or "Polynomial Multiplication".
The name
was chosen because it carries no domain-specific connotation,
and because it is widespread:
-
Intel refers to
As "Carry-Less Multiplication Quadword" in its manual; see [IntelManual].PCLMULQDQ -
RISC-V refers to
as carry-less multiplication, and this is obvious from the mnemonic.clmul - The Wikipedia article ([WikipediaClmul]) for this operation is titled "Carry-less product".
-
The (proposed) LLVM intrinsic function ([LLVMClmul]) is tentatively named
@llvm.clmul .
5.2. Widening operation
In addition to the
function template,
there exists a
function template:
Such a widening function is important in a various cryptographic use cases. There is universal §4.1. Hardware support for obtaining all 128 bits of a multiplication for that reason.
Most of the design choices take the design of [P3161R4] into consideration:
-
The result type is deliberately not named
so that futureclmul_wide_result
and other operations can use the same result type, which avoids creating an ever-growing set of equivalent (but distinct) types.mul_wide -
The
appear before thelow_bits
, so that on the more widespread little-endian architectures, the layout of thehigh_bits
is identical to that of an integer, which is slightly better for calling conventions.struct
However, the comparison operators are a novel invention of this proposal.
They are intended to behave as if the comparisons were performed on an
integer with twice the width of
.
These comparisons exists so that the result can be easily compared
against expected results in test cases,
stored in containers like
,
used out of the box with
, etc.
Also,
should be a broadly useful vocabulary type
which may be instantiated with user-defined numeric types,
simply because that seems like a useful side product of this proposal.
cannot be a hidden friend because
is meant to compile even when given a type
that has no
operator.
inputs.
5.3. SIMD support
Upon seeing this proposal at Sofia 2025, SG6 recommended to add SIMD support. This recommendation was provided under the assumption that it would be a simple addition in the style of [P2933R4]. Therefore, this proposal provides non-widening SIMD carry-less multiplication with the following signature:
5.3.1. SIMD widening operations are out of scope
AVX-512 provides a
It would take considerable design and wording effort to standardize this,
especially if one wants to expose the full
In conclusion, a proposal for widening SIMD operations in general would be well-motivated.
For
, designing SIMD widening operations would be scope creep.
6. Proposed wording
The proposed changes are relative to [N5008] with the changes in [P3691R1] applied.
Change [version.syn] paragraph 2 as follows:
without creating a new SIMD
feature test macro because [P2933R4] did the same,
and because
can be used to test for the presence
of both the scalar and SIMD version.
Add the following declarations to the synopsis in [numeric.ops.overview], immediately following the declarations associated with [numeric.sat]:
In subclause [numeric.ops], append a subclause immediately following [numeric.sat]:
Carry-less product [numeric.clmul]
1
Returns:
.
2 Let:
- be a reduction using the exclusive OR operation ([expr.xor]);
- be the least significant bit in the base-2 representation of an integer ;
- be the width of
.T
3
Constraints:
is an unsigned integer type ([basic.fundamental]).
4
Returns:
A
object storing the bits of an integer ,
where the value of
is given by Formula
, and
is
.
The result object is initialized so that
stores the least significant bits of , andlow_bits
stores the subsequent bits of .high_bits
5
Effects:
Equivalent to
.
See [iterator.concept.wine] for precedent on using to denote the width of a type.
See [sf.cmath.riemann.zeta] for precedent on wording which includes formulae.
Add the following declarations to the synopsis in [simd.syn]:
In subclause [simd], append a subclause immediately preceding [simd.math]:
basic_vec
carry-less product [simd.clmul]
1
Constraints:
The type
is an unsigned integer type ([basic.fundamental]).
2
Returns:
A
object where the element is initialized
to the result of
([numeric.clmul])
for all in the range [
,
).