1. Introduction
Carry-less multiplication is a simple numerical operation on unsigned integers.
It can be a seen as a regular multiplication where
is being used as a reduction instead of
.
It is also known as "XOR multiplication" and "polynomial multiplication". The latter name is used because mathematically, it is equivalent to performing a multiplication of two polynomials in GF(2), where each bit is a coefficient.
I propose a
function to perform this operation:
template < unsigned_integral T > // constraint is exposition-only constexpr T clmul ( T x , T y ) noexcept ;
I also propose a widening operation in the style of [P3161R2], as follows:
template < class T > struct mul_wide_result { T low_bits ; T high_bits ; }; template < unsigned_integral T > constexpr mul_wide_result < T > clmul_wide ( T x , T y ) noexcept ;
2. Motivation
Carry-less multiplication is an important operation in a number of use cases:
-
CRC Computation: While cyclic redundancy checks can theoretically be performed with a finite field of any length, in practice, GF(2)[X], the polynomial ring over the Golois field with two elements is used. Polynomial addition in this ring can be implemented via
, and multiplication viaxor
, which makes cyclic redundancy checks considerably faster.clmul -
Cryptography:
may be used to implement AES-GCM. [Intel1] describes this process in great detail and motivates hardware support for carry-less multiplication via theclmul
instruction.pclmulqdq -
Bit manipulation:
performs a large amount ofclmul
and<<
operations in parallel. This is utilized in the reference implementation [Schultke1] ofxor
, proposed in [P3104R3]. For example, the formstd :: bit_compressr
computes the bitwise inclusive parity for each bit ofclmul ( x , -1u )
and the bits to its right.x
Carry-less multiplication is of such great utility that there is widespread hardware support, some dating back more than a decade. See below for motivating examples.
2.1. Parity computation
The parity of an integer
is
if the number of one-bits is even, and
if it is odd.
It is also equivalent to
.
clmul ( x , -1 )
computes the parity of each bit in x
and the bits to its right.
The most significant bit holds the parity of x
as a whole.
bool parity ( uint32_t x ) { return std :: clmul ( x , -1u ) >> 31 ; }
2.2. Fast space-filling curves
The special form
can be used to accelerate the computation of Hilbert curves.
To properly understand this example, I will explain the basic notion of space-filling curves.
We can fill space using a 2D curve by mapping the index
on the curve
onto Cartesian coordinates
and
.
A naive curve that fills a 4x4 square can be computed as follows:
struct pos { uint32_t x , y ; }; pos naive_curve ( uint32_t i ) { return { i % 4 , i / 4 }; }
When mapping the index
onto the returned 2D coordinates,
we obtain the following pattern:
0 1 2 3 4 5 6 7 8 9 a b c d e f
The problem with such a naive curve is that adjacent indices can be positioned very far apart (the distance increases with row length). For image processing, if we store pixels in this pattern, cache locality is bad; two adjacent pixels can be very far apart in memory.
A Hilbert curve is a family of space-filling curves where the distance between two adjacent
elements is
:
0 1 e f 3 2 d c 4 7 8 b 5 6 9 a
De-interleaving bits of
into
and
yields a Z-order curve,
and performing further transformations yields a Hilbert curve.
clmul
can be used to compute the bitwise parity for each bit and the bits to its right,
which is helpful for computing Hilbert curves.
Note that the following example uses the std :: bit_compress
function from [P3104R3],
which may also be accelerated using std :: clmul
.
pos hilbert_to_xy ( uint32_t i ) { // De-interleave the bits of i. uint32_t i0 = std :: bit_compress ( i , 0x55555555u ); // abcdefgh -> bdfh uint32_t i1 = std :: bit_compress ( i , 0xaaaaaaaau ); // abcdefgh -> aceg // Undo the permutation that Hilbert curves apply on top of Z-order curves. uint32_t A = i0 & i1 ; uint32_t B = i0 ^ i1 ^ 0xffffu ; uint32_t C = std :: clmul ( A , -1u ) >> 16 ; uint32_t D = std :: clmul ( B , -1u ) >> 16 ; uint32_t a = C ^ ( i0 & D ); return { . x = a ^ i1 , . y = a ^ i0 ^ i1 }; }
This specific example is taken from [rawrunprotected1]. [Warren1] explains the basis behind this computation of Hilbert curves using bitwise operations.
When working with space-filling curves, the inverse operation is also common:
mapping the Cartesian coordinates onto an index on the curve.
In the case of Z-order curves aka. Morton curves,
this can be done by simply interleaving the bits of
and
.
A Z-order curve is laid out as follows:
0 1 4 5 2 3 6 7 8 9 c d a b e f
clmul
can be used to implement bit-interleaving in order to generate a Z-order curves.
uint32_t xy_to_morton ( uint32_t x , uint32_t y ) { uint32_t lo = std :: clmul ( x , x ) << 0 ; // abcd -> 0a0b0c0d uint32_t hi = std :: clmul ( y , y ) << 1 ; // abcd -> a0b0c0d0 return hi | lo ; }
Note: In the example above,
is equivalent to [P3104R3]'s
.
3. Possible implementation
A naive implementation looks as follows:
template < unsigned_integral T > constexpr T clmul ( const T x , const T y ) noexcept { T result = 0 ; for ( int i = 0 ; i < numeric_limits < T >:: digits ; ++ i ) { result ^= ( x << i ) * (( y >> i ) & 1 ); } return result ; }
3.1. Hardware support
The implementation difficulty lies mostly in utilizing available hardware instructions, not in the naive fallback implementation.
In the following table, let
denote
-bit unsigned integer operands,
and
denote the amount of operands that are processed in parallel.
Operation | x86_64 | ARM | RISC-V |
---|---|---|---|
clmul u64 -> u128
| PCLMULQDQ
| + Neon
| |
clmul u64 -> u64
| Neon
| + Zbc, Zbkc
| |
clmul u32 -> u64
| Zbc, Zbkc
| ||
clmul u8x8 -> u16x8
| Neon
| ||
clmul u8x8 -> u8x8
| Neon
|
clmul_wide
may look as follows:
#include <immintrin.h>#include <cstdint>mul_wide_result < uint64_t > clmul_wide ( uint64_t x , uint64_t y ) noexcept { __m128i x_128 = _mm_set_epi64x ( 0 , x ); __m128i y_128 = _mm_set_epi64x ( 0 , y ); __m128i result_128 = _mm_clmulepi64_si128 ( x_128 , y_128 , 0 ); return { . low_bits = uint64_t ( _mm_extract_epi64 ( result_128 , 0 )), . high_bits = uint64_t ( _mm_extract_epi64 ( result_128 , 1 )) }; }
4. Design Considerations
Multiple design choices lean on [P0543R3] and [P3161R2]. Specifically,
-
the choice of header
,< numeric > -
the choice to have a widening operation,
-
the
naming scheme,_wide -
the
template, andmul_wide_result -
the decision to have a
parameter list.( T , T )
4.1. Naming
Carry-less multiplication is also commonly called "Galois Field Multiplication" or "Polynomial Multiplication".
The name
was chosen because it carries no domain-specific connotation,
and because it is widespread:
-
Intel refers to
As "Carry-Less Multiplication Quadword" in its manual; see [Intel2].PCLMULQDQ -
RISC-V refers to
as carry-less multiplication, and this is obvious from the mnemonic.clmul -
The Wikipedia article for this operation is titled "Carry-less product" [Wikipedia1].
4.2. SIMD
Many but not all use cases of carry-less multiplication operate on large blocks of data,
and SIMD is ideal for that purpose.
I do not yet propose carry-less multiplication for
,
but a future paper, or perhaps this paper could do so, if LEWG wants that.
Note: Only ARM supports eight 8-bit multiplications in parallel. Other than that, hardware only supports scalar multiplications, albeit using vector registers except for RISC-V; see § 3.1 Hardware support.
5. Proposed wording
The proposed changes are relative to the working draft of the standard as of [N5001], with the changes in [P3161R2] applied.
Update subclause [version.syn] paragraph 2 as follows:
#define __cpp_lib_clamp 201603L // also in <algorithm> #define __cpp_lib_clmul 20????L // also in <numeric> [...] #define __cpp_lib_overflow_arithmetic 20????L 20????L // also in <numeric>
In subclause [numeric.ops.overview], update the synopsis as follows:
template < class T , class U > constexpr T saturate_cast ( U x ) noexcept ; // freestanding template < class T > struct add_carry_result { // freestanding T low_bits ; bool overflow ; }; template < class T > using sub_borrow_result = add_carry_result ; // freestanding template < class T > struct mul_wide_result { // freestanding T low_bits ; T high_bits ; }; template < class T > struct div_result { // freestanding T quotient ; T remainder ; }; template < class T > constexpr add_carry_result < T > add_carry ( T x , T y , bool carry ) noexcept ; // freestanding template < class T > constexpr sub_borrow_result < T > sub_borrow ( T left , T right , bool borrow ) noexcept ; // freestanding template < class T > constexpr mul_wide_result < T > mul_wide ( T x , T y ) noexcept ; // freestanding template < class T > constexpr mul_wide_result < T > clmul_wide ( T x , T y ) noexcept ; // freestanding template < class T > constexpr div_result < T > div_wide ( T dividend_high , T dividend_low , // freestanding T divisor ) noexcept ; template < class T > constexpr bool is_div_defined ( T dividend , T divisor ) noexcept // freestanding template < class T > constexpr bool is_div_wide_defined ( T dividend_high , // freestanding T dividend_low , T divisor ) noexcept ; // [numeric.clmul], carry-less product template < class T > constexpr T clmul ( T x , U y ) noexcept ; // freestanding
In subclause [numeric.overflow] (known as [numeric.sat] at the time of writing),
insert the following, immediately following the description of
:
template < class T > constexpr mul_wide_result < T > clmul_wide ( T x , T y ) noexcept ; Let ⊕ denote the exclusive bitwise OR operation ([expr.xor]). Let xn and yn denote the nth least significant bit in the base-2 representation of
and
x respectively. Let w be the width of
y .
T Constraints:
is an unsigned integer type ([basic.fundamental]).
T Returns: Let c be an integer, where the value of the nth least significant bit in the base-2 representation of c, cn, is given by Formula ?.? below. The result object is initialized so that
- the bits of
are the w least significant bits of c, and
low_bits - the bits of
are subsequent bits of c .
high_bits
Note: Italicized text such as c shall be in math font.
In subclause [numeric.ops], append a subclause immediately following [numeric.overflow] (known as [numeric.sat] at the time of writing):
Carry-less product [numeric.clmul]
template < class T > constexpr T clmul ( T x , T y ) noexcept ; Constraints:
is an unsigned integer type ([basic.fundamental]).
T Returns:
([numeric.overflow]).
clmul_wide ( x , y ). low_bits