Carry-less product: std::clmul
- Document number:
- P3642R1
- Date:
2025-05-26 - Audience:
- LEWGI, SG6
- Project:
- ISO/IEC 14882 Programming Languages — C++, ISO/IEC JTC1/SC22/WG21
- Reply-to:
- Jan Schultke <janschultke@gmail.com>
- GitHub Issue:
- wg21.link/P3642/github
- Source:
- github.com/Eisenwave/cpp-proposals/blob/master/src/clmul.cow
Revision history
Changes since R0
- Generate the proposal using COWEL instead of bikeshed
- Fix incorrect formula in §5. Proposed wording for bits ≥ the integer width
- Fix §3.1. Hardware support missing new VPCLMULQDQ instructions
- Fix improper uses of
in §1. Introductionstd :: unsigned_integral - Make slight editorial wording adjustments
- Rebase on [N5008] and [P3161R4]
- Mention [SimdJsonClmul] in §2. Motivation
Contents
Introduction
Motivation
Parity computation and JSON parsing
Fast space-filling curves
Possible implementation
Hardware support
Design considerations
Naming
SIMD support
Proposed wording
References
1. Introduction
Carry-less multiplication
is a simple numerical operation on unsigned integers.
It can be a seen as a regular multiplication where
is being used as a reduction instead of
.
It is also known as "XOR multiplication" and "polynomial multiplication". The latter name is used because mathematically, it is equivalent to performing a multiplication of two polynomials in GF(2), where each bit is a coefficient.
I propose a
function to perform this operation:
I also propose a widening operation in the style of [P3161R4], as follows:
2. Motivation
Carry-less multiplication is an important operation in a number of use cases:
-
CRC Computation: While cyclic redundancy checks can theoretically be performed with a finite
field of any length, in practice,
GF(2)[X],
the polynomial ring over the Galois field with two elements is used.
Polynomial addition in this ring can be implemented via
, and multiplication viaxor
, which makes cyclic redundancy checks considerably faster.clmul -
Cryptography:
may be used to implement AES-GCM. [IntelClmul] describes this process in great detail and motivates hardware support for carry-less multiplication via theclmul
instruction.pclmulqdq -
Bit manipulation:
performs a large amount ofclmul
and<<
operations in parallel. This is utilized in the reference implementation [BitPermutations] ofxor
, proposed in [P3104R3]. For example, the formstd :: bit_compressr
computes the bitwise inclusive parity for each bit ofclmul ( x , - 1 u )
and the bits to its right.x
Carry-less multiplication is of such great utility that there is widespread hardware support, some dating back more than a decade. See below for motivating examples.
2.1. Parity computation and JSON parsing
The parity of an integer
is
if the number of one-bits is even,
and
if it is odd.
The parity can also be computed with
.
computes the parity of each bit in
and the bits to its right.
The most significant bit holds the parity of
as a whole.
While the parity of all bits can be obtained with
,
it computes the inclusive cumulative parity,
which can be used to accelerate parsing JSON and other file formats ([SimdJsonClmul]).
This can be done by mapping each
character onto a
-bit,
and any other character onto
.
would then produce masks where string characters
corresponds to a
-bit.
abc xxx" foobar " zzz" a " // input string 000000001 0000001 000001 01 // quote_mask 00000000. 111111 . 00000. 1 . // clmul(quote_mask, -1), ignoring 1-bits of quote_mask
2.2. Fast space-filling curves
The special form
can be used to accelerate the computation of Hilbert curves.
To properly understand this example, I will explain the basic notion of space-filling curves.
We can fill space using a 2D curve by mapping the index
on the curve
onto Cartesian coordinates
and
.
A naive curve that fills a 4x4 square can be computed as follows:
When mapping the index
onto the returned 2D coordinates,
we obtain the following pattern:
0 1 2 3 4 5 6 7 8 9 a b c d e f
The problem with such a naive curve is that adjacent indices can be positioned very far apart (the distance increases with row length). For image processing, if we store pixels in this pattern, cache locality is bad; two adjacent pixels can be very far apart in memory.
A Hilbert curve
is a family of space-filling curves where the distance between two adjacent
elements is
:
0 1 e f 3 2 d c 4 7 8 b 5 6 9 a
De-interleaving bits of
into
and
yields a Z-order curve,
and performing further transformations yields a
Hilbert curve.
can be used to compute the bitwise parity for each bit and the bits to its right,
which is helpful for computing Hilbert curves.
Note that the following example uses the
function from [P3104R3],
which may also be accelerated using
.
This specific example is taken from [FastHilbertCurves]. [HackersDelight] explains the basis behind this computation of Hilbert curves using bitwise operations.
When working with space-filling curves, the inverse operation is also common:
mapping the Cartesian coordinates onto an index on the curve.
In the case of Z-order curves aka. Morton curves,
this can be done by simply interleaving the bits of
and
.
A Z-order curve is laid out as follows:
0 1 4 5 2 3 6 7 8 9 c d a b e f
can be used to implement bit-interleaving in order to generate a
Z-order curves.
is equivalent to
[P3104R3]'s
.
3. Possible implementation
A naive and unconstrained implementation looks as follows:
3.1. Hardware support
The implementation difficulty lies mostly in utilizing available hardware instructions, not in the naive fallback implementation.
In the following table, let
Operation | x86_64 | ARM | RV64 |
---|---|---|---|
may look as follows:
There also exists an LLVM pull request ([LLVMClmul])
which would add an
4. Design considerations
Multiple design choices lean on [P0543R3] and [P3161R4]. Specifically,
- the choice of header
,<numeric> - the choice to have a widening operation,
- the
naming scheme,_wide - the
template, andmul_wide_result - the decision to have a
parameter list.( T , T )
4.1. Naming
Carry-less multiplication is also commonly called "Galois Field Multiplication" or "Polynomial Multiplication".
The name
was chosen because it carries no domain-specific connotation,
and because it is widespread:
-
Intel refers to
As "Carry-Less Multiplication Quadword" in its manual; see [IntelManual].PCLMULQDQ -
RISC-V refers to
as carry-less multiplication, and this is obvious from the mnemonic.clmul - The Wikipedia article ([WikipediaClmul]) for this operation is titled "Carry-less product".
-
The (proposed) LLVM intrinsic function ([LLVMClmul]) is tentatively named
@llvm.clmul .
4.2. SIMD support
The proposal is currently limited to the scalar version of
.
However,
support may be useful to add,
and would be backed by hardware support to an extent.
support should be added in this paper,
or in a separate paper, if at all.
5. Proposed wording
The proposed changes are relative to the working draft of the standard as of [N5008], with the changes in [P3161R4] applied.
Update subclause [version.syn] paragraph 2 as follows:
Update the synopsis in [numeric.ops.overview] as follows:
In subclause
[numeric.overflow]
(known as [numeric.sat] at the time of writing),
insert the following item,
immediately following the description of
:
Let ⊕ denote the exclusive OR operation ([expr.xor]). Given an integer , let denote the least significant bit in the base-2 representation of .
Constraints:
is an unsigned integer type ([basic.fundamental]).
Returns:
An object storing the bits of an integer ,
where the value of
is given by Formula
,
is
,
and is the width of
.
The result object is initialized so that
stores the least significant bits of , andlow_bits
stores the subsequent bits of .high_bits
See [iterator.concept.wine] for precedent on using to denote the width of a type.
See [sf.cmath.riemann.zeta] for precedent on wording which includes formulae.
In subclause [numeric.ops], append a subclause immediately following [numeric.overflow] (known as [numeric.sat] at the time of writing):
Carry-less product [numeric.clmul]
Constraints:
is an unsigned integer type ([basic.fundamental]).
Returns:
([numeric.overflow]).