Deprecate implicit conversions
between char8_t and char16_t or char32_t
- Document number:
- P3695R3
- Date:
2025-11-20 - Audience:
- EWG
- Project:
- ISO/IEC 14882 Programming Languages — C++, ISO/IEC JTC1/SC22/WG21
- Author:
- Jan Schultke <janschultke@gmail.com>
- GitHub Issue:
- wg21.link/P3695/github
- Source:
- github.com/Eisenwave/cpp-proposals/blob/master/src/deprecate-unicode-conversion.cow
and or
are bug-prone and thus harmful to the language.
I propose to deprecate them.
Contents
Revision history
Changes since R2
Changes since R1
Changes since R0
Introduction
It's not hypothetical. This really happens.
The underlying problem
Scope
What about "safe" comparisons?
What about char16_t and char32_t ?
What about char ?
What about wchar_t ?
What about conversions with integers?
What comes after deprecation?
Why not make these conversions narrowing?
Impact on existing code
Replacement for deprecated behavior
Implementation experience
Wording
[conv.integral]
[expr.arith.conv]
[expr.static.cast]
[depr.conv.unicode]
Acknowledgements
References
1. Revision history
1.1. Changes since R2
R1 was seen by SG16 again,
and the question of deprecating ↔
conversions was reconsidered:
Poll 1: P3695R2: Recommend deprecating conversions between char8_t and wchar_t.
- Attendees: 10
SF F N A SA 3 3 3 1 0 - Consensus.
Consequently, the following changes were made:
- reverted the deprecation of conversions with
wchar_t - improved the wording based on [isocpp-core] reflector feedback
1.2. Changes since R1
R0 of the paper was seen by SG16, with the following poll results:
P3695R1: Recommend deprecating conversions between char and the charN_t types.
- Attendees: 10
- No objection to unanimous dissent.
P3695R1: Recommend deprecating conversions between char8_t and wchar_t.
- Attendees: 10
- No objection to unanimous consent.
P3695R1: Recommend deprecating conversions between char16_t and char32_t.
- Attendees: 10
SF F N A SA 0 0 3 7 0 - Consensus against.
Consequently, the following changes were made:
- also deprecated conversion between
andchar8_t wchar_t - changed title and abstract to reflect this new direction
- rewrote §3.2. What about
andchar16_t ?char32_t - expanded note on tautology warnings in §2.1. It's not hypothetical. This really happens.
- added §3.7. Why not make these conversions narrowing?
- restructured §6. Wording and added editorial notes
1.3. Changes since R0
-
limited deprecation to conversions involving
; see §3.2. What aboutchar8_t andchar16_t ?char32_t - rebased §6. Wording on [N5014]
2. Introduction
Implicit conversions between and invite bugs:
always fails if is a UTF-8 code unit
because it is equivalent to ,
and a UTF-8 code unit cannot have this value.
The assertion succeeds because Ԡ (U+0520) is UTF-8 encoded as , ,
and NBSP is U+00A0,
so the value matches the second UTF-8 code unit of U+0520.
:
Note that the "bad comparison" occurs between two in ,
which demonstrates that implicit conversions in general are bug-prone, not just comparisons.
We obviously don't want to deprecate .
Conversions "the other way" (e.g. → )
are obviously bug-prone too because information is lost,
but such bugs can already be caught by all major compilers' warnings,
and they are problematic for the same reason as → ,
not because of anything specific to character types.
The listed bugs are interesting precisely because no information is lost.
2.1. It's not hypothetical. This really happens.
These kinds of bugs are not far-fetched hypotheticals either;
I have written such bugs myself,
and have had them contributed
to my syntax highlighter [µlight],
which makes extensive use of and .
Very early in development, I have realized how dangerous these implicit conversions are,
so most functions in the style of have a deleted overload:
,
but technically, can have the values and ,
so it is undetectable.
Using may raise more tautology warnings because if is signed,
it can only hold values up to ,
meaning it never compares equal to, e.g. .
2.2. The underlying problem
The underlying problem is that is .
In general, it is meaningless to compare code units with different encodings.
To be fair, Unicode character types aren't strictly required to store Unicode code units.
However, that is their primary purpose, and the assumption holds true for any Unicode
3. Scope
I propose to deprecate implicit conversions between
and or .
As demonstrated above, these are extremely bug-prone.
Conversions between and are not affected.
3.1. What about "safe" comparisons?
In comparisons between code units,
certain ranges of code points yield the expected result.
For example, is
because all Unicode encodings are ASCII-compatible,
so the numeric value of anything in the basic latin block (≤ U+007F)
will have the same single-code-unit value in UTF-8, UTF-16, and UTF-32.
However, even those should be deprecated because:
- Keeping these valid would essentially leak implementation details of UTF-8 into the set of implicit conversions in the C++ core language, which seems like unclean design.
-
To rely on this "feature", the developer needs to memorize which code points are "safe to use".
It is not obvious whether
orc == U ' € ' are always safe (hint: the latter one is), and it's quite likely that someone uses this "feature" accidentally.c == U ' $ ' -
It would make this "feature" (or lack thereof) harder to teach than it needs to be.
The rule can be very simple:
and some other character types cannot be converted to one another. Simple rules are easy to teach.char8_t
3.2. What about char16_t and char32_t ?
and
are used to store a UTF-16 and UTF-32 code unit, respectively.
Following some negative feedback on [ClangWarning],
the proposal no longer seeks to deprecate conversions between and .
While these conversions are not guaranteed to be meaningful,
there are no false positives in comparisons of UTF-16 and UTF-32 code units,
and the comparison is quite likely to be correct.
↔
because any code point in [U+0000, U+D7FF] or [U+E000, U+FFFF]
is encoded using a single code unit equivalent to the code point value,
in both UTF-16 and UTF-32.
Other code points are encoded using high surrogates ([
It is possible to have false negatives
when searching for a UTF-32 code unit
outside the Basic Multilingual Plane (BMP) in UTF-16 text.
However, these searches are tautologically false because values
≥ ,
so compilers may catch some of them already.
It also also much less likely that ↔
conversions actually manifest as a bug.
An application that only uses, say, Basic Latin characters and German or Norwegian
umlauts can use and interchangeably.
By contrast, mixing with other Unicode character types will almost
certainly blow up in the user's face if the application processes any kind of non-ASCII text.
Last but not least, UTF-8 is becoming the "default encoding", especially on the web,
while UTF-16 is increasingly becoming a "legacy encoding".
This makes it unattractive to raise warnings for
when the surrounding code may exist mostly for compatibility purposes,
and C++ users are not interested in sinking much time into its maintenance.
Substantially more code may be affected by a ↔
deprecation because both types were introduced in C++11,
unlikely , which was added in C++20.
See also [WikipediaEncodingPopularity]:
Recently it has become clear that the overhead of translating from/to UTF-8 on input and output, and dealing with potential encoding errors in the input UTF-8, overwhelms any benefits UTF-16 could offer. So newer software systems are starting to use UTF-8. The default string primitive used in newer programming languages, such as Go, Julia, Rust and Swift 5, assume UTF-8 encoding. PyPy also uses UTF-8 for its strings, and Python is looking into storing all strings in UTF-8. Microsoft now recommends the use of UTF-8 for applications using the Windows API, while continuing to maintain a legacy "Unicode" (meaning UTF-16) interface.
In summary, in ↔ comparisons,
there are no false positives,
the only false negatives are tautologically false (warnings exist),
bugs are unlikely to manifest
because code points outside the BMP are relatively uncommon,
and if deprecation warnings were raised,
that may happen in low-priority legacy code.
↔
conversions; see §1.2. Changes since R1.
3.3. What about char ?
As recommended by SG16,
I propose to leave intact.
↔ in particular should not be deprecated
because the encoding of both sides
is likely UTF-8, in which case the conversion is obviously safe.
Substantial amounts of code may already rely on this.
Furthermore, deprecating any conversion from
to other character types is a bad idea,
and was unanimously recommended against by SG16.
In some code bases, is used purely for ASCII characters and strings.
In such code bases, comparing to any other character type
is always correct,
assuming that an ASCII-compatible encoding is used everywhere.
It may also be possible to deprecate conversions with
depending on ordinary literal encoding,
but is not necessarily using literal encoding,
and doing so would invite non-portable code that fails to compile on e.g. EBCDIC platforms,
to the great surprise of the author.
3.4. What about wchar_t ?
As recommended by SG16,
is also not affected by the deprecation.
While almost certainly has a different encoding than ,
converting between it and is as problematic
as the and conversions.
However, while this is practically never the case,
C2y permits UTF-8 as an encoding for .
Furthermore, is not strictly guaranteed to be any wide than
or represent any more characters.
Overall, it is a C compatibility hazard to touch .
Besides ↔ ,
↔ and ↔
may be always correct, depending on the platform.
Windows-only code can likely treat and
interchangeably,
and Linux-only code may treat and interchangeably.
3.5. What about conversions with integers?
It is quite common to compare character types to integer types.
For example, we may write
to check whether a character falls into the basic latin block.
There is nothing exceptionally bug-prone about comparing with say,
instead of ,
so we are not interested in deprecating character/integer conversions.
3.6. What comes after deprecation?
The goal is to eventually remove these conversions entirely. Since the behavior is easily detected (§5. Implementation experience) and easily replaced (§4.1. Replacement for deprecated behavior), removal should be feasible within one or two revisions of the language.
Furthermore, I don't believe that having "tombstone behavior" would be necessary.
That is, allowing the conversion to happen but making the program ill-formed if it happens.
The reason is that , , and
rarely appear in overload sets that include types that are not characters.
3.7. Why not make these conversions narrowing?
Another possible option (instead of deprecation or following deprecation)
is to make the affected conversions narrowing conversions.
This would make for some ill-formed,
but the implicit conversion from to
would remain valid.
There are multiple problems with this approach, which is why it is not proposed:
-
tochar8_t is a widening conversion, making the term "narrowing conversion" comically misleading.char32_t - A long time has passed since C++11, and there is a lot of code using list-initialization now. This means that the "blast radius" of the change may still be quite large. If we accept that a non-trivial amount of warnings is raised in existing code, this half-measure seems unattractive.
- A lot of the problematic cases are not initialization, but comparisons as shown in §2. Introduction. Narrowing conversions play no role in equality comparison or in the usual arithmetic conversions.
4. Impact on existing code
It is not trivial to estimate how much code would be affected by a deprecation like this.
However, that is ultimately not what makes or breaks this proposal.
The goal is not to deprecate a rarely used feature to give it new meaning,
like prior to [P1161R3].
The goal is to deprecate a bug-prone and harmful feature to make the language safer.
The longer we wait, the more mistakes will be made using and other types.
C++ will undoubtedly get improved support for the Unicode character types over time,
making them used more frequently,
so we better deal with this problem now than never.
4.1. Replacement for deprecated behavior
If the new deprecation warnings spot a bug like in §2. Introduction, some work will be required to fix it, but the deprecation will have done its job.
If the comparison is obviously safe, such as with ,
the resolution is usually trivial, like .
This could even be done automatically with tools like clang-tidy.
5. Implementation experience
Corentin Jabot has recently implemented a
However the warning is more conservative than the proposed deprecation; it does not warn on "safe comparisons" (§3.1. What about "safe" comparisons?).
6. Wording
The following changes are relative to [N5014].
[conv.integral]
Change [conv.integral] paragraph 1 as follows, and split it into two paragraphs:
1 A prvalue of an integer type
can be converted to a prvalue of another integer type.
The conversion is deprecated ([depr.conv.unicode]) if
one of the types involved in the conversion is ,
and the other type is or .
2 A prvalue of an unscoped enumeration type can be converted to a prvalue of an integer type.
[expr.arith.conv]
Change [expr.arith.conv] paragraph 1 as follows:
Many binary operators that expect operands of arithmetic or enumeration type cause conversions and yield result types in a similar way. The purpose is to yield a common type, which is also the type of the result. This pattern is called the usual arithmetic conversions, which are defined as follows:
- The lvalue-to-rvalue conversion ([conv.lval]) is applied to each operand and the resulting prvalues are used in place of the original operands for the remainder of this section.
- […]
-
Otherwise, each operand is converted to a common type
. The conversion is deprecated if one operand is of typeC and the other operand is of the typeschar8_t orchar16_t . The integral promotion rules ([conv.prom]) are used to determine a typechar32_t and typeT1 for each operand. Then the following rules are applied to determine C:T2 - […]
or ,
so if we didn't add this wording,
the conversion would not be deprecated.
[expr.static.cast]
Change [expr.static.cast] paragraph 5 as follows:
Otherwise, an expression can be explicitly converted
to a type if there is an implicit conversion sequence ([over.best.ics])
containing no deprecated conversion
from to , if […]
[…] Otherwise, the result is direct-initialized from .
Do not change [expr.static.cast] paragraph 6:
Otherwise, the lvalue-to-rvalue ([const.lval]), array-to-pointer ([conv.array]), and function-to-pointer ([conv.func]) conversions are applied to the operand, and the conversions that can be performed using
are listed below. No other conversion can be performed usingstatic_cast .static_cast
Immediately following [expr.static.cast] paragraph 6, insert a new paragraph:
An expression of type
can be explicitly converted to or ,
and vice versa.
Such a conversion to a target type is equivalent to
.
[Note: Integral conversions ([conv.integral]) between these types have the same effect and are deprecated, unlike this explicit conversion ([depr.conv.unicode]). — end note]
[depr.conv.unicode]
Insert a new subclause in [depr] between [depr.local] and [depr.capture.this], containing a single paragraph:
Unicode character conversions [depr.conv.unicode]
The following conversions are deprecated:
-
Integral conversions between
andchar8_t orchar16_t ([conv.integral]).char32_t -
Arithmetic conversions from one operand of type
and another operand of typechar8_t orchar16_t ([expr.arith.conv]).char32_t
[Example:
char16_t char32_t char char8_t — end example]
7. Acknowledgements
I thank Jens Maurer for reviewing the wording above and providing multiple suggestions for improvement.