1. Revision history
1.1. Changes since R2
Revise the proposed wording. No design changes are made.
1.2. Changes since R1
The paper has been seen in Tokyo by SG18 with positive reception,
except for the
functions.
These have been removed to strengthen consensus.
1.3. Changes since R0
-
Expand § 4.3 Hardware support, taking more instructions into account, including AVX-512.
-
Minor editorial fixes.
2. Introduction
The C++ bit manipulation library in
is an invaluable abstraction from hardware operations.
Functions like
help the programmer avoid use of intrinsic functions or inline
assembly.
However, there are still a few operations which are non-trivial to implement in software and have widely available hardware support. Therefore, I propose to expand the bit manipulation library with multiple bit permutation functions described below. Even without hardware support, these functions provide great utility and/or help the developer write more expressive code.
3. Proposed features
I propose to add the following functions to the bit manipulation library (
).
Note: All constraints are exposition-only. The function bodies contain a naive implementation that merely illustrates the behavior.
Note: See § 4.3 Hardware support for the corresponding hardware instructions.
3.1. bit_reverse
template < unsigned_integral T > constexpr T bit_reverse ( T x ) noexcept { T result = 0 ; for ( int i = 0 ; i < numeric_limits < T >:: digits ; ++ i ) { result <<= 1 ; result |= x & 1 ; x >>= 1 ; } return result ; }
Reverses the bits of
so that the least significant bit becomes the most significant.
3.2. bit_repeat
template < unsigned_integral T > constexpr T bit_repeat ( T x , int length ) noexcept ( false) { T result = 0 ; for ( int i = 0 ; i < numeric_limits < T >:: digits ; ++ i ) { result |= (( x >> ( i % length )) & 1 ) << i ; } return result ; }
Repeats a pattern stored in the least significant
bits of
, as many times as fits into
.
3.3. bit_compress
template < unsigned_integral T > constexpr T bit_compressr ( T x , T m ) noexcept { T result = 0 ; for ( int i = 0 , j = 0 ; i < numeric_limits < T >:: digits ; ++ i ) { bool mask_bit = ( m >> i ) & 1 ; result |= ( mask_bit & ( x >> i )) << j ; j += mask_bit ; } return result ; }
For each one-bit in
, the corresponding bit in
is taken and packed
contiguously into the result, starting with the least significant result bit.
template < unsigned_integral T > constexpr T bit_compressl ( T x , T m ) noexcept { return bit_reverse ( bit_compressr ( bit_reverse ( x )), bit_reverse ( m ))); }
For each one-bit in
, the corresponding bit in
is taken and packed
contiguously into the result, starting with the most significant result bit.
x: +---+---+---+---+ | a | b | c | d | +---+---+---+---+ m: | | bit_compressl(x, m): bit_compressr(x, m): +---+---+---+---+ +---+---+---+---+ +---+---+---+---+ | 0 | 1 | 0 | 1 | | b | d | 0 | 0 | | 0 | 0 | b | d | +---+---+---+---+ +---+---+---+---+ +---+---+---+---+
Note: Intuitively,
is a "mask" that determines which bits of
are packed.
3.4. bit_expand
template < unsigned_integral T > constexpr T bit_expandr ( T x , T m ) noexcept { T result = 0 ; for ( int i = 0 , j = 0 ; i < numeric_limits < T >:: digits ; ++ i ) { bool mask_bit = ( m >> i ) & 1 ; result |= ( mask_bit & ( x >> j )) << i ; j += mask_bit ; } return result ; }
For each one-bit in
, a bit from
, starting with the least significant bit
is taken and shifted into the corresponding position of the
bit.
template < unsigned_integral T > constexpr T bit_expandl ( T x , T m ) noexcept { return bit_reverse ( bit_expandr ( bit_reverse ( x )), bit_reverse ( m ))); }
For each one-bit in
, a bit from
, starting with the most significant bit
is taken and shifted into the corresponding position of the
bit.
x: +---+---+---+---+ | a | b | c | d | +---+---+---+---+ m: bit_expandl(x, m): bit_expandr(x, m): +---+---+---+---+ +---+---+---+---+ +---+---+---+---+ | 0 | 1 | 0 | 1 | | 0 | a | 0 | b | | 0 | c | 0 | d | +---+---+---+---+ +---+---+---+---+ +---+---+---+---+
Note: Intuitively,
is a "mask" that determines where the bits of
are unpacked into.
4. Motivation and scope
Bit-reversal, repetition, compression, and expansion are fundamental operations that meet multiple criteria which make them suitable for standardization:
-
They are common and useful operations.
-
They can be used to implement numerous other operations.
-
At least on some architectures, they have direct hardware support.
-
They are non-trivial to implement efficiently in software.
-
For known masks, numerous optimization opportunities are available.
4.1. Applications
4.1.1. Applications of bit_reverse
Bit-reversal is a common operation with uses in:
-
Cryptography: scrambling bits
-
Networking: as part of cyclic redundancy check computation
-
Graphics: mirroring of images with one bit per pixel
-
Random number generation: reversal can counteract low entropy of low-order bits such as in the case of linear congruential generators
-
Digital signal processing: for radix-2 Cooley-Tukey FFT algorithms
-
Code obfuscation: security by obscurity
4.1.2. Applications of bit_repeat
The generation of recurring bit patterns is such a fundamental operation that it’s hard to tie to any specific domain, but here are some use cases:
-
Debugging: using obvious recurring bit pattenrs like
to identify "garbage memory"0xcccc ... -
Bit manipulation: generating alternating sequences of
and1
for various algorithms0 -
Eliminating integer divison: for
with very smallx >> ( i % N )
,N
can be used insteadbit_repeat ( x , N ) >> i -
Testing: recurring bit patterns can make for good test cases when implementing numerics
-
Error detection: known bit patterns can be introduced to spot failed transmission
4.1.3. Applications of bit_compress
and bit_expand
Compression and expansion are also common, with uses in:
-
Space-filling curves: Morton/Z-Order and Hilbert curves
-
Input/output: especially for variable-length encodings, such as UTF-8 (§ 4.2.3 UTF-8 decoding with bit_compressr)
-
Chess engines: for bitboards; see [ChessProgramming1]
-
Genomics: according to [ARM1]
A GitHub code search for
reveals ~1300 files which use the intrinsic wrappers for the x86 instructions.
4.2. Motivating examples
4.2.1. Implementing countr_zero
with bit_repeat
[Anderson1] contains a vast amount of algorithms,
many of which involve masks of alternating
s and
s.
When written as "magic numbers" in code,
these masks can make it quite hard to understand the overall pattern
and to generalize these algorithms.
allows one to be more expressive here:
unsigned int v ; // 32-bit word input to count zero bits on right unsigned int c = 32 ; // c will be the number of zero bits on the right v &= - v ; if ( v ) c -- ; if ( v & 0x0000FFFF bit_repeat (( 1 << 16 ) - 1 , 32 ) ) c -= 16 ; if ( v & 0x00FF00FF bit_repeat ( ( 1 << 8 ) - 1 , 16 ) ) c -= 8 ; if ( v & 0x0F0F0F0F bit_repeat ( ( 1 << 4 ) - 1 , 8 ) ) c -= 4 ; if ( v & 0x33333333 bit_repeat ( ( 1 << 2 ) - 1 , 4 ) ) c -= 2 ; if ( v & 0x55555555 bit_repeat ( ( 1 << 1 ) - 1 , 2 ) ) c -= 1 ;
It is now obvious how this can be expressed in a loop:
// ... for ( int i = 16 ; i != 0 ; i /= 2 ) { unsigned mask = bit_repeat (( 1u << i ) - 1 , i * 2 ); if ( v & mask ) c -= i ; }
has been an invaluable asset in the implementation of the remaining functions
in this proposal (see [Schultke1] for details).
4.2.2. Interleaving bits with bit_expand
A common use case for expansion is interleaving bits. This translates Cartesian coordinates to the index on a Z-order curve. Space filling curves are a popular technique in compression.
unsigned x = 3 ; // 0b011 unsigned y = 5 ; // 0b101 const auto i = bit_expandr ( x , bit_repeat ( 0b10u , 2 )) // i = 0b01'10'11 | bit_expandr ( y , bit_repeat ( 0b01u , 2 ));
4.2.3. UTF-8 decoding with bit_compressr
and
are useful in various I/O-related applications.
They are particularly helpful for dealing with variable-length encodings,
where "data bits" are interrupted by bits which signal continuation of the data.
uint_least32_t x = load32_little_endian ( utf8_data_pointer ); switch ( countl_one ( uint8_t ( x ))) { case 0 : return bit_compressr ( x , 0b01111111 ); case 1 : /* error */ ; case 2 : return bit_compressr ( x , 0b00111111'00011111 ); case 3 : return bit_compressr ( x , 0b00111111'00111111'00001111 ); case 4 : return bit_compressr ( x , 0b00111111'00111111'00111111'00000111 ); }
4.2.4. Other operations based on bit_compress
and bit_expand
Many operations can be built on top of
and
.
However, direct hardware support is often needed for the proposed functions to efficiently
implement them.
Even without such support, they can be canonalized into a faster form.
The point is that
and
allow you to express these operations.
// x & 0xf bit_expandr ( x , 0xf ) bit_compressr ( x , 0xf ) // (x & 0xf) << 4 bit_expandr ( x , 0xf0 ) // (x >> 4) & 0xf bit_compressr ( x , 0xf0 ) // Clear the least significant one-bit of x. x ^= bit_expandr ( 1 , x ) // Clear the nth least significant one-bit of x. x ^= bit_expandr ( 1 << n , x ) // Clear the n least significant one-bits of x. x ^= bit_expandr (( 1 << n ) - 1 , x ) // (x >> n) & 1 bit_compressr ( x , 1 << n ) // Get the least significant bit of x. bit_compressr ( x , x ) & 1 // Get the nth least significant bit of x. ( bit_compressr ( x , x ) >> n ) & 1 // popcount(x) countr_one ( bit_compressr ( -1u , x )) countr_one ( bit_compressr ( x , x ))
4.3. Hardware support
Operation | x86_64 | ARM | RISC-V |
---|---|---|---|
| AVX512_BITALG, ( )
| SVE2
| Zbb+ Zbkb,( Zbb)
|
| AVX512_BITALG
| ||
| BMI2
| SVE2
| ( V)
|
| BMI2
| SVE2
| ( + V)
|
| ( BMI2+ ABM)
| SVE2, ( SVE2+ SVE)
| ( V)
|
| ( BMI2+ ABM)
| ( SVE2+ SVE)
| ( + V)
|
(Parenthesized) entries signal that the instruction does not directly implement the function, but greatly assists in its implementation.
Note: The AVX-512
instruction can implement any bit permutation in the mathematical sense, and more.
Note: The RISC-V
instruction can also be found under the name
.
There appears to have been a name change in 2022.
4.3.1. Support for bit_reverse
This operation is directly implemented in ARM through
.
Any architecture with support for
(such as x86 with
)
also supports bit-reversal in part. [Warren1] presents an O(log n) algorithm which operates by swapping lower and upper
, ...,
,
,
,
, and
bits in parallel.
Byte-swapping implements these individual swaps up to 8 bits, requiring only three more
parallel swaps in software:
// assuming a byte is an octet of bits, and assuming the width of x is a power of two x = byteswap ( x ); x = ( x & 0x0F0F0F0F ) << 4 | ( x & 0xF0F0F0F0 ) >> 4 ; // ... quartets of bits x = ( x & 0x33333333 ) << 2 | ( x & 0xCCCCCCCC ) >> 2 ; // ... pairs of bits x = ( x & 0x55555555 ) << 1 | ( x & 0xAAAAAAAA ) >> 1 ; // ... individual bits
It is worth noting that clang provides a cross-platform family of intrinsics.
uses byte-swapping or bit-reversal instructions if possible.
Such an intrinsic has been requested from GCC users a number of times in [GNU1].
4.3.2. Support for bit_repeat
Firstly, note that for the pattern length, there are only up to
relevant cases,
where
is the operand width in bits.
It is feasible to
between these cases, where the length is constant in each case.
While the AVX-512 instruction
can be used for all cases, this is not the ideal
solution for most cases.
For very low or very great lengths, a naive solution is sufficient (and even optimal),
where we simply use
and
to duplicate the pattern.
What actually matters is how often the pattern is repeated, i.e.
.
Specific cases like
,
can be implemented using
permutation/duplication/gather/broadcast instructions.
However, note that the primary use of
is to express repeating bit patterns without magic numbers,
i.e. to improve code quality.
Often, both the pattern and the length are known at compile-time, making hardware support less relevant.
Even without hardware support, the reference implementation [Schultke1] requires only O(log N)
fundamental bitwise operations.
4.3.3. Support for bit_compress
and bit_expand
Starting with Haswell (2013), Intel CPUs directly implement compression and expansion with
with
and
respectively.
AMD CPUs starting with Zen 3 implement
and
with 3 cycles
latency, like Intel.
Zen 2 and older implement
/
in microcode, with 18 cycles latency.
ARM also supports these operations directly with
,
, and
in the SVE2 instruction set. [Warren1] mentions other older architectures with direct support.
Overall, only recent instruction set extensions offer this functionality directly. However, when the mask is a constant, many different strategies for hardware acceleration open up. For example
-
interleaving bits can be assisted (though not fully implemented) using ARM
/zip1 zip2 -
other permutations can be assisted by ARM
andtbl tbx
As [Warren1] explains, the cost of computing
and
in software is
dramatically lower for a constant mask.
For specific known masks (such as a mask with a single one-bit), the cost is extremely low.
All in all, there are multiple factors that strongly suggest a standard library implementation:
-
The strategy for computing
andbit_compress
depends greatly on the architecture and on information about the mask, even if the exact mask isn’t known.bit_expand -
,tzcnt
(see [Schultke1] or [Zp7] for specifics), andclmul
are helpful.popcnt
-
-
ISO C++ does not offer a mechanism through which all of this information can be utilized. Namely, it is not possible to change strategy based on information that only becomes available during optimization passes. Compiler extensions such as
offer a workaround.__builtin_constant_p -
ISO C++ does not offer a mechanism through which function implementations can be chosen based on the surrounding context. In a situation where multiple
calls with the same maskbit_compress
are performed, it is significantly faster to pre-compute information based on the mask once, and utilize it in subsequent calls. The same technique can be used to accelerate integer division for multiple divisions with the same divisor.m
Bullets 2. and 3. suggest that
and
benefit from being
implemented directly in the compiler via intrinsic,
even if hardware does not directly implement these operations.
Even with a complete lack of hardware support, a software implementation of
in [Schultke1] emits essentially optimal code if the mask is known.
unsigned bit_compress_known_mask ( unsigned x ) { return cxx26bp :: bit_compressr ( x , 0xf0f0u ); }
Clang 18 emits the following (and GCC virtually the same); see [CompilerExplorer1]:
bit_compress_known_mask ( unsigned int ): # bit_compress_known_mask(unsigned int edi) mov eax , edi # { unsigned int eax = edi; shr eax , 4 # eax >>= 4; and eax , 15 # eax &= 0xf; shr edi , 8 # edi >>= 8; and edi , 240 # edi &= 0xf0; or eax , edi # eax |= edi; ret # return eax; }
Knowing the implementation of
, this feels like dark magic.
This is an optimizing compiler at its finest hour.
5. Impact on existing code
This proposal is purely a standard library expansion. No existing code is affected.
6. Design considerations
The design choices in this paper are based on [P0553R4], wherever applicable.
6.1. Signature of bit_repeat
follows the "use
if possible" rule mentioned in [P0553R4].
Other functions such as
and
also accept an
.
It is also the only function not marked
.
It does not throw, but it is not
due to its narrow contract (Lakos rule).
6.2. Why the names compress and expand?
The use of
and
is consistent with the mask-based permutations for
proposed in [P2664R6].
Furthermore, there are multiple synonymous sets of terminology:
-
anddeposit extract -
andcompress expand -
andgather scatter
I have decided against
and
because of its ambiguity:
Taking the input
and densely packing it to
could be described as:
Extract each second bit from
and densely deposit it into the result.
0b10101
Similarly, taking the input
and expanding it into
could be described as:
Extract each bit from
and sparsely deposit it in the result.
0b111
Both operations can be described with
and
terminology,
making it virtually useless for keeping the operations apart.
and
are simply the least common way to describe these operations, which makes
and
the best candidates.
Further design choices are consistent with [P0553R4].
The abbreviations
and
for left/right are consistent with
/
.
The prefix
is consistent with
and
.
6.3. Why the lack of generalization?
6.3.1. No generalized bit_compress
and bit_expand
[N3864] originally suggested much more general versions of compression and expansion, which support:
-
performing the operation not just on the whole operand, but on "words" of it, in parallel
-
performing the operation not just on bits, but on arbitrarily sized groups of bits
I don’t propose this generality for the following reasons:
-
The utility functions in
are not meant to provide a full bitwise manipulation library, but fundamental operations, especially those that can be accelerated in hardware while still having reasonable software fallbacks.< bit > -
These more general form can be built on top of the proposed hardware-oriented versions. This can be done with relative ease and with little to no overhead.
-
The generality falsely suggests hardware support for all forms, despite the function only being accelerated for specific inputs. This makes the performance characteristics unpredictable.
-
The proposed functions have wide contracts and can be
(Lakos rule). Adding additional parameters would likely require a narrow contract.noexcept -
Generality adds complexity to the standardization process, to implementation, and from the perspective of language users. It is unclear whether this added complexity is worth it in this case.
6.3.2. No generalized bit_reverse
Bit reversal can also be generalized to work with any group size:
template < typename T > T bit_reverse ( T x , int group_size = 1 ) noexcept ( false);
With this generalization,
on conventional platforms
is equivalent to
.
However, this general version is much less used, not as consistently supported in
hardware, and has a narrow contract.
must be a nonzero factor of
for this operation to be meaningful.
Therefore, a generalized bit-reversal is not proposed in this paper.
6.4. Why does the signature of bit_compress
require two same T
s?
Initially, I went through a number of different signatures.
template < unsigned_integral T , unsigned_integral X > constexpr T bit_compressr ( X x , T m ) noexcept ;
This signature is quite clever because the result never has more bits than the mask
.
However, it is surprising that the mask plays such a significant role here.
Furthermore, I’ve realized that while the result never has more bits than
,
must still deposit bits starting with the most significant bits of the result.
This suggests the following:
template < unsigned_integral T , unsigned_integral X > constexpr common_type_t < T , X > bit_compressr ( X x , T m ) noexcept ;
However, it is not trivial to juggle bits between the left and right versions of
and
.
The behavior is also not intuitive when a zero-extension occurs.
For wider
, the mask is always zero-extended to the left, which makes the left and right
versions slightly asymmetrical.
Since this proposal includes low-level bit operations, it is reasonable and safe to require
the user to be explicit.
A call to
or
with two different types is likely a design flaw or bug.
Therefore, I have settled on the very simple signature:
template < unsigned_integral T > constexpr T bit_compressr ( T x , T m ) noexcept ;
7. Possible implementation
7.1. Reference implementation
All proposed functions have been implemented in [Schultke1]. This reference implementation is compatible with all three major compilers, and leverages hardware support from ARM and x86_64 where possible.
7.2. Other implementations
[Warren1] presents algorithms which are the basis for [Schultke1].
-
An O(log n)
bit_reverse -
An O(log2 n)
andbit_compress bit_expand -
can be O(log n) with hardware support for carry-less multiplication aka. GF(2) polynomial multiplication
-
[Zp7] offers fast software implementations for
and
, optimized for x86_64.
[StackOverflow1] contains discussion of various possible software implementations
of
and
.
8. Proposed wording
The wording is relative to [N5001].
In subclause [version.syn], paragraph 2, update the synopsis as follows:
#define __cpp_lib_bitops 201907L 20XXXXL // freestanding, also in <bit>
In subclause [bit.syn], update the synopsis as follows:
[ …] template < class T > constexpr int popcount ( T x ) noexcept ; // [bit.permute], permutations template < class T > constexpr T bit_reverse ( T x ) noexcept ; template < class T > constexpr T bit_repeat ( T x , int l ); template < class T > constexpr T bit_compressr ( T x , T m ) noexcept ; template < class T > constexpr T bit_expandr ( T x , T m ) noexcept ; template < class T > constexpr T bit_compressl ( T x , T m ) noexcept ; template < class T > constexpr T bit_expandl ( T x , T m ) noexcept ; // [bit.endian], endian [ …]
In subclause [bit], add the following subclause after [bit.count]:
X.X.X Permutation [bit.permute]
1 In the following descriptions, let N denote the value of
, and
let αn denote the value of the n-th least significant bit
in the base-2 representation of α, so that α equals
template < class T > constexpr T bit_reverse ( T x ) noexcept ;
2 Constraints:
is an unsigned integer type ([basic.fundamental]).
3 Returns: ,
where is given by [FORMULA 1], and is
.
bit_reverse ( bit_reverse ( x ))
equals x
. — end note] template < class T > constexpr T bit_repeat ( T x , int l );
4 Constraints:
is an unsigned integer type ([basic.fundamental]).
5 Preconditions:
is greater than zero.
6 Returns:
where is given by [FORMULA 2], is
, and is
.
7 Throws: Nothing.
template < class T > constexpr T bit_compressr ( T x , T m ) noexcept ;
8 Constraints:
is an unsigned integer type ([basic.fundamental]).
9 Returns: where is given by [FORMULA 3], is
, and is
.
template < class T > constexpr T bit_expandr ( T x , T m ) noexcept ;
10 Constraints:
is an unsigned integer type ([basic.fundamental]).
11 Returns: where is given by [FORMULA 4], is
, and is
.
x & ~ m
equals zero, then bit_expandr ( bit_compressr ( x , m ), m )
equals x
.
If x >> popcount ( m )
equals zero, then bit_compressr ( bit_expandr ( x , m ), m )
equals x
.
— end note] template < class T > constexpr T bit_compressl ( T x , T m ) noexcept ;
12 Constraints:
is an unsigned integer type ([basic.fundamental]).
13 Returns:
.
template < class T > constexpr T bit_expandl ( T x , T m ) noexcept ;
14 Constraints:
is an unsigned integer type ([basic.fundamental]).
15 Returns:
.
In the subclause above, substitute the symbolic placeholders [FORMULA N] for formula numbers, in the style of subclause [c.math].
Note: I would have preferred a less mathematical approach to defining these functions.
However, it is too difficult to precisely define
and
without
visual aids, pseudo-code, or other crutches.
9. Acknowledgements
I greatly appreciate the assistance of Stack Overflow users in assisting me with research for this proposal. I especially thank Peter Cordes for his tireless and selfless dedication to sharing knowledge.
I also thank various Discord users from Together C & C++ and #include<C++> who have reviewed drafts of this proposal and shared their thoughts.