1. Revision history
1.1. Changes since R1
The paper has been seen in Tokyo by SG18 with positive reception,
except for the
functions.
These have been removed to strengthen consensus.
1.2. Changes since R0

Expand § 4.3 Hardware support, taking more instructions into account, including AVX512.

Minor editorial fixes.
2. Introduction
The C++ bit manipulation library in
is an invaluable abstraction from hardware operations.
Functions like
help the programmer avoid use of intrinsic functions or inline
assembly.
However, there are still a few operations which are nontrivial to implement in software and have widely available hardware support. Therefore, I propose to expand the bit manipulation library with multiple bit permutation functions described below. Even without hardware support, these functions provide great utility and/or help the developer write more expressive code.
3. Proposed features
I propose to add the following functions to the bit manipulation library (
).
Note: All constraints are expositiononly. The function bodies contain a naive implementation that merely illustrates the behavior.
Note: See § 4.3 Hardware support for the corresponding hardware instructions.
3.1. bit_reverse
template < unsigned_integral T > constexpr T bit_reverse ( T x ) noexcept { T result = 0 ; for ( int i = 0 ; i < numeric_limits < T >:: digits ; ++ i ) { result <<= 1 ; result = x & 1 ; x >>= 1 ; } return result ; }
Reverses the bits of
so that the least significant bit becomes the most significant.
3.2. bit_repeat
template < unsigned_integral T > constexpr T bit_repeat ( T x , int length ) noexcept ( false) { T result = 0 ; for ( int i = 0 ; i < numeric_limits < T >:: digits ; ++ i ) { result = (( x >> ( i % length )) & 1 ) << i ; } return result ; }
Repeats a pattern stored in the least significant
bits of
, as many times as fits into
.
3.3. bit_compress
template < unsigned_integral T > constexpr T bit_compressr ( T x , T m ) noexcept { T result = 0 ; for ( int i = 0 , j = 0 ; i < numeric_limits < T >:: digits ; ++ i ) { bool mask_bit = ( m >> i ) & 1 ; result = ( mask_bit & ( x >> i )) << j ; j += mask_bit ; } return result ; }
For each onebit in
, the corresponding bit in
is taken and packed
contiguously into the result, starting with the least significant result bit.
template < unsigned_integral T > constexpr T bit_compressl ( T x , T m ) noexcept { return bit_reverse ( bit_compressr ( bit_reverse ( x )), bit_reverse ( m ))); }
For each onebit in
, the corresponding bit in
is taken and packed
contiguously into the result, starting with the most significant result bit.
x: +++++  a  b  c  d  +++++ m:   bit_compressl(x, m): bit_compressr(x, m): +++++ +++++ +++++  0  1  0  1   b  d  0  0   0  0  b  d  +++++ +++++ +++++
Note: Intuitively,
is a "mask" that determines which bits of
are packed.
3.4. bit_expand
template < unsigned_integral T > constexpr T bit_expandr ( T x , T m ) noexcept { T result = 0 ; for ( int i = 0 , j = 0 ; i < numeric_limits < T >:: digits ; ++ i ) { bool mask_bit = ( m >> i ) & 1 ; result = ( mask_bit & ( x >> j )) << i ; j += mask_bit ; } return result ; }
For each onebit in
, a bit from
, starting with the least significant bit
is taken and shifted into the corresponding position of the
bit.
template < unsigned_integral T > constexpr T bit_expandl ( T x , T m ) noexcept { return bit_reverse ( bit_expandr ( bit_reverse ( x )), bit_reverse ( m ))); }
For each onebit in
, a bit from
, starting with the most significant bit
is taken and shifted into the corresponding position of the
bit.
x: +++++  a  b  c  d  +++++ m: bit_expandl(x, m): bit_expandr(x, m): +++++ +++++ +++++  0  1  0  1   0  a  0  b   0  c  0  d  +++++ +++++ +++++
Note: Intuitively,
is a "mask" that determines where the bits of
are unpacked into.
4. Motivation and scope
Bitreversal, repetition, compression, and expansion are fundamental operations that meet multiple criteria which make them suitable for standardization:

They are common and useful operations.

They can be used to implement numerous other operations.

At least on some architectures, they have direct hardware support.

They are nontrivial to implement efficiently in software.

For known masks, numerous optimization opportunities are available.
4.1. Applications
4.1.1. Applications of bit_reverse
Bitreversal is a common operation with uses in:

Cryptography: scrambling bits

Networking: as part of cyclic redundancy check computation

Graphics: mirroring of images with one bit per pixel

Random number generation: reversal can counteract low entropy of loworder bits such as in the case of linear congruential generators

Digital signal processing: for radix2 CooleyTukey FFT algorithms

Code obfuscation: security by obscurity
4.1.2. Applications of bit_repeat
The generation of recurring bit patterns is such a fundamental operation that it’s hard to tie to any specific domain, but here are some use cases:

Debugging: using obvious recurring bit pattenrs like
to identify "garbage memory"0xcccc ... 
Bit manipulation: generating alternating sequences of
and1
for various algorithms0 
Eliminating integer divison: for
with very smallx >> ( i % N )
,N
can be used insteadbit_repeat ( x , N ) >> i 
Testing: recurring bit patterns can make for good test cases when implementing numerics

Error detection: known bit patterns can be introduced to spot failed transmission
4.1.3. Applications of bit_compress
and bit_expand
Compression and expansion are also common, with uses in:

Spacefilling curves: Morton/ZOrder and Hilbert curves

Input/output: especially for variablelength encodings, such as UTF8 (§ 4.2.3 UTF8 decoding with bit_compressr)

Chess engines: for bitboards; see [ChessProgramming1]

Genomics: according to [ARM1]
A GitHub code search for
reveals ~1300 files which use the intrinsic wrappers for the x86 instructions.
4.2. Motivating examples
4.2.1. Implementing countr_zero
with bit_repeat
[Anderson1] contains a vast amount of algorithms,
many of which involve masks of alternating
s and
s.
When written as "magic numbers" in code,
these masks can make it quite hard to understand the overall pattern
and to generalize these algorithms.
allows one to be more expressive here:
unsigned int v ; // 32bit word input to count zero bits on right unsigned int c = 32 ; // c will be the number of zero bits on the right v &=  v ; if ( v ) c  ; if ( v & 0x0000FFFF bit_repeat (( 1 << 16 )  1 , 32 ) ) c = 16 ; if ( v & 0x00FF00FF bit_repeat ( ( 1 << 8 )  1 , 16 ) ) c = 8 ; if ( v & 0x0F0F0F0F bit_repeat ( ( 1 << 4 )  1 , 8 ) ) c = 4 ; if ( v & 0x33333333 bit_repeat ( ( 1 << 2 )  1 , 4 ) ) c = 2 ; if ( v & 0x55555555 bit_repeat ( ( 1 << 1 )  1 , 2 ) ) c = 1 ;
It is now obvious how this can be expressed in a loop:
// ... for ( int i = 16 ; i != 0 ; i /= 2 ) { unsigned mask = bit_repeat (( 1u << i )  1 , i * 2 ); if ( v & mask ) c = i ; }
has been an invaluable asset in the implementation of the remaining functions
in this proposal (see [Schultke1] for details).
4.2.2. Interleaving bits with bit_expand
A common use case for expansion is interleaving bits. This translates Cartesian coordinates to the index on a Zorder curve. Space filling curves are a popular technique in compression.
unsigned x = 3 ; // 0b011 unsigned y = 5 ; // 0b101 const auto i = bit_expandr ( x , bit_repeat ( 0b10u , 2 )) // i = 0b01'10'11  bit_expandr ( y , bit_repeat ( 0b01u , 2 ));
4.2.3. UTF8 decoding with bit_compressr
and
are useful in various I/Orelated applications.
They are particularly helpful for dealing with variablelength encodings,
where "data bits" are interrupted by bits which signal continuation of the data.
uint_least32_t x = load32_little_endian ( utf8_data_pointer ); switch ( countl_one ( uint8_t ( x ))) { case 0 : return bit_compressr ( x , 0b01111111 ); case 1 : /* error */ ; case 2 : return bit_compressr ( x , 0b00111111'00011111 ); case 3 : return bit_compressr ( x , 0b00111111'00111111'00001111 ); case 4 : return bit_compressr ( x , 0b00111111'00111111'00111111'00000111 ); }
4.2.4. Other operations based on bit_compress
and bit_expand
Many operations can be built on top of
and
.
However, direct hardware support is often needed for the proposed functions to efficiently
implement them.
Even without such support, they can be canonalized into a faster form.
The point is that
and
allow you to express these operations.
// x & 0xf bit_expandr ( x , 0xf ) bit_compressr ( x , 0xf ) // (x & 0xf) << 4 bit_expandr ( x , 0xf0 ) // (x >> 4) & 0xf bit_compressr ( x , 0xf0 ) // Clear the least significant onebit of x. x ^= bit_expandr ( 1 , x ) // Clear the nth least significant onebit of x. x ^= bit_expandr ( 1 << n , x ) // Clear the n least significant onebits of x. x ^= bit_expandr (( 1 << n )  1 , x ) // (x >> n) & 1 bit_compressr ( x , 1 << n ) // Get the least significant bit of x. bit_compressr ( x , x ) & 1 // Get the nth least significant bit of x. ( bit_compressr ( x , x ) >> n ) & 1 // popcount(x) countr_one ( bit_compressr ( 1u , x )) countr_one ( bit_compressr ( x , x ))
4.3. Hardware support
Operation  x86_64  ARM  RISCV 

 ^{AVX512_BITALG}, ( )
 ^{SVE2}
 ^{Zbb}+ ^{Zbkb},( ^{Zbb})

 ^{AVX512_BITALG}
 
 ^{BMI2}
 ^{SVE2}
 ( ^{V})

 ^{BMI2}
 ^{SVE2}
 ( + ^{V})

 ( ^{BMI2}+ ^{ABM})
 ^{SVE2}, ( ^{SVE2}+ ^{SVE})
 ( ^{V})

 ( ^{BMI2}+ ^{ABM})
 ( ^{SVE2}+ ^{SVE})
 ( + ^{V})

(Parenthesized) entries signal that the instruction does not directly implement the function, but greatly assists in its implementation.
Note: The AVX512
instruction can implement any bit permutation in the mathematical sense, and more.
Note: The RISCV
instruction can also be found under the name
.
There appears to have been a name change in 2022.
4.3.1. Support for bit_reverse
This operation is directly implemented in ARM through
.
Any architecture with support for
(such as x86 with
)
also supports bitreversal in part. [Warren1] presents an O(log n) algorithm which operates by swapping lower and upper
, ...,
,
,
,
, and
bits in parallel.
Byteswapping implements these individual swaps up to 8 bits, requiring only three more
parallel swaps in software:
// assuming a byte is an octet of bits, and assuming the width of x is a power of two x = byteswap ( x ); x = ( x & 0x0F0F0F0F ) << 4  ( x & 0xF0F0F0F0 ) >> 4 ; // ... quartets of bits x = ( x & 0x33333333 ) << 2  ( x & 0xCCCCCCCC ) >> 2 ; // ... pairs of bits x = ( x & 0x55555555 ) << 1  ( x & 0xAAAAAAAA ) >> 1 ; // ... individual bits
It is worth noting that clang provides a crossplatform family of intrinsics.
uses byteswapping or bitreversal instructions if possible.
Such an intrinsic has been requested from GCC users a number of times in [GNU1].
4.3.2. Support for bit_repeat
Firstly, note that for the pattern length, there are only up to
relevant cases,
where
is the operand width in bits.
It is feasible to
between these cases, where the length is constant in each case.
While the AVX512 instruction
can be used for all cases, this is not the ideal
solution for most cases.
For very low or very great lengths, a naive solution is sufficient (and even optimal),
where we simply use
and
to duplicate the pattern.
What actually matters is how often the pattern is repeated, i.e.
.
Specific cases like
,
can be implemented using
permutation/duplication/gather/broadcast instructions.
However, note that the primary use of
is to express repeating bit patterns without magic numbers,
i.e. to improve code quality.
Often, both the pattern and the length are known at compiletime, making hardware support less relevant.
Even without hardware support, the reference implementation [Schultke1] requires only O(log N)
fundamental bitwise operations.
4.3.3. Support for bit_compress
and bit_expand
Starting with Haswell (2013), Intel CPUs directly implement compression and expansion with
with
and
respectively.
AMD CPUs starting with Zen 3 implement
and
with 3 cycles
latency, like Intel.
Zen 2 and older implement
/
in microcode, with 18 cycles latency.
ARM also supports these operations directly with
,
, and
in the SVE2 instruction set. [Warren1] mentions other older architectures with direct support.
Overall, only recent instruction set extensions offer this functionality directly. However, when the mask is a constant, many different strategies for hardware acceleration open up. For example

interleaving bits can be assisted (though not fully implemented) using ARM
/zip1 zip2 
other permutations can be assisted by ARM
andtbl tbx
As [Warren1] explains, the cost of computing
and
in software is
dramatically lower for a constant mask.
For specific known masks (such as a mask with a single onebit), the cost is extremely low.
All in all, there are multiple factors that strongly suggest a standard library implementation:

The strategy for computing
andbit_compress
depends greatly on the architecture and on information about the mask, even if the exact mask isn’t known.bit_expand 
,tzcnt
(see [Schultke1] or [Zp7] for specifics), andclmul
are helpful.popcnt


ISO C++ does not offer a mechanism through which all of this information can be utilized. Namely, it is not possible to change strategy based on information that only becomes available during optimization passes. Compiler extensions such as
offer a workaround.__builtin_constant_p 
ISO C++ does not offer a mechanism through which function implementations can be chosen based on the surrounding context. In a situation where multiple
calls with the same maskbit_compress
are performed, it is significantly faster to precompute information based on the mask once, and utilize it in subsequent calls. The same technique can be used to accelerate integer division for multiple divisions with the same divisor.m
Bullets 2. and 3. suggest that
and
benefit from being
implemented directly in the compiler via intrinsic,
even if hardware does not directly implement these operations.
Even with a complete lack of hardware support, a software implementation of
in [Schultke1] emits essentially optimal code if the mask is known.
unsigned bit_compress_known_mask ( unsigned x ) { return cxx26bp :: bit_compressr ( x , 0xf0f0u ); }
Clang 18 emits the following (and GCC virtually the same); see [CompilerExplorer1]:
bit_compress_known_mask ( unsigned int ): # bit_compress_known_mask(unsigned int edi) mov eax , edi # { unsigned int eax = edi; shr eax , 4 # eax >>= 4; and eax , 15 # eax &= 0xf; shr edi , 8 # edi >>= 8; and edi , 240 # edi &= 0xf0; or eax , edi # eax = edi; ret # return eax; }
Knowing the implementation of
, this feels like dark magic.
This is an optimizing compiler at its finest hour.
5. Impact on existing code
This proposal is purely a standard library expansion. No existing code is affected.
6. Design considerations
The design choices in this paper are based on [P0553R4], wherever applicable.
6.1. Signature of bit_repeat
follows the "use
if possible" rule mentioned in [P0553R4].
Other functions such as
and
also accept an
.
It is also the only function not marked
.
It does not throw, but it is not
due to its narrow contract (Lakos rule).
6.2. Why the names compress and expand?
The use of
and
is consistent with the maskbased permutations for
proposed in [P2664R6].
Furthermore, there are multiple synonymous sets of terminology:

anddeposit extract 
andcompress expand 
andgather scatter
I have decided against
and
because of its ambiguity:
Taking the input
and densely packing it to
could be described as:
Extract each second bit from
and densely deposit it into the result.
0b10101
Similarly, taking the input
and expanding it into
could be described as:
Extract each bit from
and sparsely deposit it in the result.
0b111
Both operations can be described with
and
terminology,
making it virtually useless for keeping the operations apart.
and
are simply the least common way to describe these operations, which makes
and
the best candidates.
Further design choices are consistent with [P0553R4].
The abbreviations
and
for left/right are consistent with
/
.
The prefix
is consistent with
and
.
6.3. Why the lack of generalization?
6.3.1. No generalized bit_compress
and bit_expand
[N3864] originally suggested much more general versions of compression and expansion, which support:

performing the operation not just on the whole operand, but on "words" of it, in parallel

performing the operation not just on bits, but on arbitrarily sized groups of bits
I don’t propose this generality for the following reasons:

The utility functions in
are not meant to provide a full bitwise manipulation library, but fundamental operations, especially those that can be accelerated in hardware while still having reasonable software fallbacks.< bit > 
These more general form can be built on top of the proposed hardwareoriented versions. This can be done with relative ease and with little to no overhead.

The generality falsely suggests hardware support for all forms, despite the function only being accelerated for specific inputs. This makes the performance characteristics unpredictable.

The proposed functions have wide contracts and can be
(Lakos rule). Adding additional parameters would likely require a narrow contract.noexcept 
Generality adds complexity to the standardization process, to implementation, and from the perspective of language users. It is unclear whether this added complexity is worth it in this case.
6.3.2. No generalized bit_reverse
Bit reversal can also be generalized to work with any group size:
template < typename T > T bit_reverse ( T x , int group_size = 1 ) noexcept ( false);
With this generalization,
on conventional platforms
is equivalent to
.
However, this general version is much less used, not as consistently supported in
hardware, and has a narrow contract.
must be a nonzero factor of
for this operation to be meaningful.
Therefore, a generalized bitreversal is not proposed in this paper.
6.4. Why does the signature of bit_compress
require two same T
s?
Initially, I went through a number of different signatures.
template < unsigned_integral T , unsigned_integral X > constexpr T bit_compressr ( X x , T m ) noexcept ;
This signature is quite clever because the result never has more bits than the mask
.
However, it is surprising that the mask plays such a significant role here.
Furthermore, I’ve realized that while the result never has more bits than
,
must still deposit bits starting with the most significant bits of the result.
This suggests the following:
template < unsigned_integral T , unsigned_integral X > constexpr common_type_t < T , X > bit_compressr ( X x , T m ) noexcept ;
However, it is not trivial to juggle bits between the left and right versions of
and
.
The behavior is also not intuitive when a zeroextension occurs.
For wider
, the mask is always zeroextended to the left, which makes the left and right
versions slightly asymmetrical.
Since this proposal includes lowlevel bit operations, it is reasonable and safe to require
the user to be explicit.
A call to
or
with two different types is likely a design flaw or bug.
Therefore, I have settled on the very simple signature:
template < unsigned_integral T > constexpr T bit_compressr ( T x , T m ) noexcept ;
7. Possible implementation
7.1. Reference implementation
All proposed functions have been implemented in [Schultke1]. This reference implementation is compatible with all three major compilers, and leverages hardware support from ARM and x86_64 where possible.
7.2. Other implementations
[Warren1] presents algorithms which are the basis for [Schultke1].

An O(log n)
bit_reverse 
An O(log^{2} n)
andbit_compress bit_expand 
can be O(log n) with hardware support for carryless multiplication aka. GF(2) polynomial multiplication

[Zp7] offers fast software implementations for
and
, optimized for x86_64.
[StackOverflow1] contains discussion of various possible software implementations
of
and
.
8. Proposed wording
The proposed changes are relative to the working draft of the standard as of [N4917].
8.1. Featuretesting
In subclause 17.3.2 [version.syn] paragraph 2, update the synopsis as follows:
#define __cpp_lib_bitops 201907L 20XXXXL // freestanding, also in <bit>
8.2. Header synopsis
In subclause 22.15.2 [bit.syn], update the synopsis as follows:
+ // 22.15.X, permutations + template < class T > + constexpr T bit_reverse ( T x ) noexcept ; + template < class T > + constexpr T bit_repeat ( T x , int l ); + template < class T > + constexpr T bit_compressr ( T x , T m ) noexcept ; + template < class T > + constexpr T bit_expandr ( T x , T m ) noexcept ; + template < class T > + constexpr T bit_compressl ( T x , T m ) noexcept ; + template < class T > + constexpr T bit_expandl ( T x , T m ) noexcept ;
8.3. New subclause
In subclause 22.15 [bit], add the following subclause:
22.15.X Permutation [bit.permute]
1 In the following descriptions, let N denote
. Let α_{n} denote the nth least significant bit of
numeric_limits < T > :: digits , so that
α equals
α template < class T > constexpr T bit_reverse ( T x ) noexcept ; 2 Constraints:
is an unsigned integer type ([basic.fundamental]).
T 3 Returns:
[Note:equals
bit_reverse ( bit_reverse ( x )) . — end note]
x template < class T > constexpr T bit_repeat ( T x , int l ); 4 Constraints:
is an unsigned integer type ([basic.fundamental]).
T 5 Preconditions:
is greater than zero.
l 6 Returns:
7 Throws: Nothing.
template < class T > constexpr T bit_compressr ( T x , T m ) noexcept ; 8 Constraints:
is an unsigned integer type ([basic.fundamental]).
T 9 Returns:
template < class T > constexpr T bit_expandr ( T x , T m ) noexcept ; 10 Constraints:
is an unsigned integer type ([basic.fundamental]).
T 11 Returns:
[Note: Ifequals zero, then
x & ~ m equals
bit_expandr ( bit_compressr ( x , m ), m ) . If
x equals zero, then
x >> popcount ( m ) equals
bit_compressr ( bit_expandr ( x , m ), m ) . — end note]
x template < class T > constexpr T bit_compressl ( T x , T m ) noexcept ; 12 Constraints:
is an unsigned integer type ([basic.fundamental]).
T 13 Returns:
.
bit_reverse ( bit_compressr ( bit_reverse ( x ), bit_reverse ( m ))) template < class T > constexpr T bit_expandl ( T x , T m ) noexcept ; 14 Constraints:
is an unsigned integer type ([basic.fundamental]).
T 15 Returns:
.
bit_reverse ( bit_expandr ( bit_reverse ( x ), bit_reverse ( m )))
Note: I would have preferred a less mathematical approach to defining these functions.
However, it is too difficult to precisely define
and
without
visual aids, pseudocode, or other crutches.
9. Acknowledgements
I greatly appreciate the assistance of Stack Overflow users in assisting me with research for this proposal. I especially thank Peter Cordes for his tireless and selfless dedication to sharing knowledge.
I also thank various Discord users from Together C & C++ and #include<C++> who have reviewed drafts of this proposal and shared their thoughts.