Should Unicode literals be guaranteed to be well-formed?

TL;DR Betteridge’s law applies: No.

Are you still here?

Unicode Literals

In C++ 20 there are 2 kinds and 6 forms of Unicode literals. Character literals and string literals, in UTF-8, UTF-16, and UTF-32 encodings. Each of them uses a distinct char type to signal in the type system what the encoding is for the data. For plain char literals, the encoding is the execution character set, which is implementation defined. It might be something useful, like UTF-8, or old, like UCS-2, or something distinctly unhelpful, like EBCDIC. Whatever it is, it was fixed at compilation time, and is not affected by things like LOCALE settings. The source character set is the encoding the compiler believes your source code is written in. Including the octets that happen to be between " and ' characters.

char32_t s1[] = U"\u0073tring";
char16_t s2[] = U"\u0073tring";
char8_t s2[] = U"\u0073tring";

char32_t c1 = U'\u0073';
char16_t c1 = u'\u0073';
char8_t c1 = u8'\u0073';

Unicode codepoint U+0073 is ‘s’. So all of the strings are “string”, and all of the characters are ‘s’, however the rendering of each in memory is different. For the string types, each unit in the string is 32, 16 or 8 bits respectively, and for the character literals, each is one code unit, again of 32, 16, or 8 bits.

Suprises

This is due to Zach Laine, an actual expert, and sorting out what happened took several other experts, so the rest of us have little hope.

char8_t suprise[] = u8"ς";
assert(strlen(surprise) == 5);

This comes down to your editor and compiler having a minor disagreement about encoding. The compiler was under the impression that the encoding was cp-1252, where the editor believed it to be UTF-8. The underlying octets for ς are 0xCF 0x82, each of which is a character in cp-1252. All octets are valid characters in cp-1252. So each was converted to UTF-8, resulting in 0xC3 0x8F and 0xE2 0x80 0x9A.

Of course.

The contents of the string are still in the source character set, not in any of the Unicode forms. Unless that happens to be the source character set.

But at least it’s well formed.

Escapes

The \u escapes, which identify Unicode codepoints by number, will produce well formed Unicode encoding in strings, and in character literals if they fit. That is, in a char32_t, you can put any code point. In a char16_t you can put any character from the basic multilingual plane. In a char8_t you can put 7 bit ASCII characters.

Or you can use hex or octal escapes, which will be widened to code unit size and the value placed into the resultant character or string. And no current compiler checks if that makes sense, although they will warn you if, for example you try to write something like:

char16_t w4 = u'\x0001f9ff'; // NAZAR AMULET - Unicode 11.0 (June 2018)
char16_t sw4[] = u"\x0001f9ff";

where you’re trying to put a value that won’t fit into a code unit into the result.

warning: hex escape sequence out of range

The \xnn and \nnn hex and octal escapes are currently a hole that lets you construct ill-formed string literals. For example

char8_t oops = u8"\xfe\xed";

But there are lots of ways of constructing ill-formed arrays of char8_t or char16_t. Just spell them out as arrays:

char8_t ill = {0xfe, 0xed};

The type system doesn’t provide that char8_t means well formed UTF-8. All it does is tell you that the intended encoding is UTF-8. Which is a huge improvement over char.

But it does not provide any guarantee to an API taking a char8_t*.

\0 is an octal escape sequence

You can spell it \u0000. But that’s weird, since we spell it as \0 everywhere, and that it’s an octal escape is a C++ trivia question.

I want to be told if I’m forming ill-formed Unicode.

Then don’t use escape sequences. If you use either well encoded source code encodings or universal character names you will get well formed Unicode.

The primary reason for wanting ill-formed Unicode is for testing. It’s a convenience. And there are straightforward workarounds.

But disallowing hex and octal escapes in Unicode strings makes the language less regular while preventing an error that you had to go out of your way to create, and does not actually produce more runtime safety.

Leave a comment

Your email address will not be published. Required fields are marked *