1. Introduction
Carry-less multiplication is a simple numerical operation on unsigned integers.
It can be a seen as a regular multiplication,
but with all carry-bits being discarded.
In other words,
is being used as a reduction instead of
.
It is also known as "XOR multiplication" and "polynomial multiplication". The latter name is used because mathematically, it is equivalent to performing a multiplication of two polynomials in GF(2), where each bit is a coefficient.
This proposal adds a
function to the bit manipulation library in
:
template < unsigned_integral T , unsigned_integral U > // constraint is exposition-only constexpr common_type_t < T , U > clmul ( T x , U y ) noexcept ;
2. Motivation
Carry-less multiplication is an important operation in a number of use cases:
-
CRC Computation: While cyclic redundancy checks can theretically be performed with a finite field of any length, in practice, GF(2)[X], the polynomial ring over the Golois field with two elements is used. Polynomial addition in this ring can be implemented via
, and multiplication viaxor
, which makes cyclic redundancy checks considerably faster.clmul -
Cryptography:
may be used to implement AES-GCM. [Intel1] describes this process in great detail and motivates hardware support for carry-less multiplication via theclmul
instruction.pclmulqdq -
Bit manipulation:
performs a large amount ofclmul
and<<
operations in parallel. This is utilized in the reference implementation [Schultke1] ofxor
, proposed in [P3104]. Specifically, the formstd :: bit_compressr
computes the bitwise inclusive parity for each bit ofclmul ( x , -1 )
and the bits to its right.x
Carry-less multiplication is of such great utility that there is widespread hardware support, some of it dating back more than a decade.
2.1. Motivating examples
2.2. Parity computation
The parity of an integer
is
if the number of one-bits is even, and
if it is odd.
It is also equivalent to
.
clmul ( x , -1 )
computes the parity of each bit in x
and the bits to its right.
The most significant bit holds the parity of x
as a whole.
bool parity ( uint32_t x ) { return std :: clmul ( x , -1u ) >> 31 ; }
2.3. Space filling curves in O(1)
The special form
can be used to accelerate the computation of Hilbert curves.
To properly understand this example, I will explain the basic notion of space-filling curves.
We can fill space using a 2D curve by mapping the index
on the curve onto Cartesian
coordinates
and
.
A naive curve that fills a 4x4 square can be computed as follows:
struct pos { uint32_t x , y ; }; pos naive_curve ( uint32_t i ) { return { i % 4 , i / 4 }; }
When mapping the index
onto the returned 2D coordinates,
we obtain the following pattern:
0 1 2 3 4 5 6 7 8 9 a b c d e f
This is a suboptimal curve because especially for large squares, the gaps between two points on the curve can be very large. If we use this curve to store images, this is problematic for processing because two adjacent pixels can be very far apart in memory, which is bad for cache locality.
A Hilbert curve is a family of space-filling curves where the distance between two adjacent
elements is
:
0 1 e f 3 2 d c 4 7 8 b 5 6 9 a
De-interleaving bits of
into
and
yields a Z-order curve,
and performing further transformations yields a Hilbert curve.
clmul
can be used to compute the bitwise parity for each bit and the bits to its right.
In Hilbert curve computation, this turns O(log N) bitwise operations into O(1).
pos hilbert_to_xy ( uint32_t i ) { // Using functions from P3104: Bit permutations, we de-interleave the bits of i. uint32_t i0 = std :: bit_compressr ( i , 0x55555555u ); // abcdefgh -> bdfh uint32_t i1 = std :: bit_compressr ( i , 0xaaaaaaaau ); // abcdefgh -> aceg // Undo the permutation that Hilbert curves apply on top of Z-order curves. uint32_t A = i0 & i1 ; uint32_t B = i0 ^ i1 ^ 0xffffu ; uint32_t C = std :: clmul ( A , -1u ) >> 16 ; uint32_t D = std :: clmul ( B , -1u ) >> 16 ; uint32_t a = C ^ ( i0 & D ); return { . x = a ^ i1 , . y = a ^ i0 ^ i1 }; }
This specific example is taken from [rawrunprotected1]. [Warren1] explains the basis behind this computation of Hilbert curves using bitwise operations.
When working with space filling curves, the inverse operation is also common:
mapping the Cartesian coordinates onto an index on the curve.
In the case of Z-order curve,
aka. Morton curves, this can be done by simply interleaving the bits of
and
.
clmul
can be used to implement bit-interleaving in order to generate a Z-order curves.
uint32_t xy_to_morton ( uint32_t x , uint32_t y ) { // With P3014: Bit permutations, interleaving can also be implemented using // bit_expandr(x, 0x55555555) uint32_t lo = std :: clmul ( x , x ) << 0 ; // abcd -> 0a0b0c0d uint32_t hi = std :: clmul ( y , y ) << 1 ; // abcd -> a0b0c0d0 return hi | lo ; }
3. Possible implementation
A branchless naive implementation looks as follows:
template < unsigned_integral T , unsigned_integral U > constexpr common_type_t < T , U > clmul ( T x , U y ) noexcept { using Common = common_type_t < T , U > ; Common result = 0 ; for ( int i = 0 ; i < numeric_limits < Common >:: digits ; ++ i ) { result ^= ( Common { x } << i ) * (( Common { y } >> i ) & 1 ); } return result ; }
3.1. Hardware support
The implementation difficulty lies mostly in utilizing available hardware instructions, not in the naive fallback implementation.
In the following table, let
denote
-bit unsigned integer operands.
Operation | x86_64 | ARM | RISC-V |
---|---|---|---|
clmul u64 -> u128
| PCLMULQDQ
| + Neon
| |
clmul u64 -> u64
| Neon
| + Zbc, Zbkc
| |
clmul u32 -> u64
| Zbc, Zbkc
| ||
clmul u8x8 -> u16x8
| Neon
| ||
clmul u8x8 -> u8x8
| Neon
|
clmul
may look as follows:
uint64_t clmul ( uint64_t x , uint64_t y ) { __m128i x_128 = _mm_set_epi64x ( 0 , x ); __m128i neg1_128 = _mm_set_epi64x ( 0 , y ); __m128i result_128 = _mm_clmulepi64_si128 ( x_128 , neg1_128 , 0 ); return static_cast < uint64_t > ( _mm_extract_epi64 ( result_128 , 0 )); }
4. Design Considerations
4.1. Naming
The name
was chosen because it does not carry a domain-specific connotation.
is often, but not always used for CRC computation or cryptography in Galois counter mode.
Outside these domains, such as in pure bit-manipulation, this interpretation makes no sense.
Ultimately, carry-less multiplication is a fundamental operation on integers like multiplication,
and the interpretation of the bits as coefficients in a GF(2)[X] polynomial is domain-specific.
Similarly,
is not called
despite its predominant use for computing
the sum of a range.
4.1.1. Argumentum ad populum
While the consensus is not unanimous,
-
Intel refers to
As "Carry-Less Multiplication Quadword" in its manual; see [Intel2]PCLMULQDQ -
RISC-V refers to
as carry-less multiplication, and this is obvious from the mnemonicclmul -
The Wikipedia article for this operation is titled "Carry-less product" [Wikipedia1]
4.2. Permitting promotion
is used as the return type.
Therefore, the wider of the two types may be returned, and unbeknownst to the developer,
is performed with a wider range of operands than anticipated.
This is not any more of an issue than for other fundamental operations.
Intuitively, carry-less multiplication should behave similarly to regular multiplication.
However, I don’t feel strongly on this issue, and it would be unusual for the
header.
Feedback required: Should the signature be
, or simply
?
4.3. Widening operations
[P3018R0] proposes functions for widening addition, multiplication, and other operations.
Due to the excellent hardware support for widening carry-less multiplication,
is a strong candidate for a
variant.
Feedback required: Should work into a widening variant of
be pursued?
4.4. Choice of header
Currently,
seems like a more appropriate candidate than
.
However, [P3018R0] proposes a new
header which may be a more suitable candidate.
Feedback required: Should the proposed function be located in
?
5. Proposed wording
The proposed changes are relative to the working draft of the standard as of [N4917].
Update subclause 17.3.2 [version.syn], paragraph 2 as follows:
#define __cpp_lib_clmul 20XXXXL #define __cpp_lib_bitops 201411L 20XXXXL
In subclause 22.15.2 [bit.syn], update the synopsis as follows:
// 22.15.X, carry-less product template < class T , class U > constexpr common_type_t < T , U > clmul ( T x , U y ) noexcept ;
In subclause 22.15 [bit], add a new subclause:
22.15.X Carry-less product [bit.clmul]
template < class T , class U > constexpr common_type_t < T , U > clmul ( T x , U y ) noexcept ; 1 Let N denote
. Let ⊕ denote the exclusive bitwise OR operation ([expr.xor]). Let yn denote the n-th least significant bit of
numeric_limits < common_type_t < T , U >>:: digits , so that
y equals
y 2 Constraints:
and
T are unsigned integer types ([basic.fundamental]).
U 3 Returns: modulo 2N.
[Note:equals modulo 2N. — end note]
x * y