Skip to content

Commit 15d1e07

Browse files
authored
Unicode char limit (carbon-language#2040)
The `\u{HHHH...}` can be an arbitrary length, potentially including `\u{}`. Restrict to 1 to 8 characters.
1 parent c9855c4 commit 15d1e07

File tree

2 files changed

+117
-2
lines changed

2 files changed

+117
-2
lines changed

docs/design/lexical_conventions/string_literals.md

+8-2
Original file line numberDiff line numberDiff line change
@@ -204,8 +204,8 @@ While octal escape sequences are expected to remain not permitted (even though
204204
In the above table, `H` represents an arbitrary hexadecimal character, `0`-`9`
205205
or `A`-`F` (case-sensitive). Unlike in C++, but like in Python, `\x` expects
206206
exactly two hexadecimal digits. As in JavaScript, Rust, and Swift, Unicode code
207-
points can be expressed by number using `\u{10FFFF}` notation, which accepts any
208-
number of hexadecimal characters. Any numeric code point in the ranges
207+
points can be expressed by number using `\u{10FFFF}` notation. This accepts
208+
between 1 and 8 hexadecimal characters. Any numeric code point in the ranges
209209
0<sub>16</sub>-D7FF<sub>16</sub> or E000<sub>16</sub>-10FFFF<sub>16</sub> can be
210210
expressed this way.
211211

@@ -338,6 +338,10 @@ string in the type system. In such string literals, we should consider rejecting
338338
- [Leading whitespace removal](/proposals/p0199.md#leading-whitespace-removal)
339339
- [Terminating newline](/proposals/p0199.md#terminating-newline)
340340
- [Escape sequences](/proposals/p0199.md#escape-sequences-1)
341+
- Unicode escape sequences:
342+
- [Allow zero digits](/proposals/p2040.md#allow-zero-digits)
343+
- [Allow any number of hexadecimal characters](/proposals/p2040.md#allow-any-number-of-hexadecimal-characters)
344+
- [Limiting to 6 digits versus 8](/proposals/p2040.md#limiting-to-6-digits-versus-8)
341345
- [Raw string literals](/proposals/p0199.md#raw-string-literals-1)
342346
- [Trailing whitespace](/proposals/p0199.md#trailing-whitespace)
343347
- [Line separators](/proposals/p0199.md#line-separators)
@@ -347,3 +351,5 @@ string in the type system. In such string literals, we should consider rejecting
347351

348352
- Proposal
349353
[#199: String literals](https://github.com/carbon-language/carbon-lang/pull/199)
354+
- Proposal
355+
[#2040: Unicode escape code length](https://github.com/carbon-language/carbon-lang/pull/2040)

proposals/p2040.md

+109
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,109 @@
1+
# Unicode escape code length
2+
3+
<!--
4+
Part of the Carbon Language project, under the Apache License v2.0 with LLVM
5+
Exceptions. See /LICENSE for license information.
6+
SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
7+
-->
8+
9+
[Pull request](https://github.com/carbon-language/carbon-lang/pull/2040)
10+
11+
<!-- toc -->
12+
13+
## Table of contents
14+
15+
- [Abstract](#abstract)
16+
- [Problem](#problem)
17+
- [Background](#background)
18+
- [Proposal](#proposal)
19+
- [Rationale](#rationale)
20+
- [Alternatives considered](#alternatives-considered)
21+
- [Allow zero digits](#allow-zero-digits)
22+
- [Allow any number of hexadecimal characters](#allow-any-number-of-hexadecimal-characters)
23+
- [Limiting to 6 digits versus 8](#limiting-to-6-digits-versus-8)
24+
25+
<!-- tocstop -->
26+
27+
## Abstract
28+
29+
The `\u{HHHH...}` can be an arbitrary length, potentially including `\u{}`.
30+
Restrict to 1 to 8 characters.
31+
32+
## Problem
33+
34+
[Proposal #199: String literals](https://github.com/carbon-language/carbon-lang/pull/199)
35+
says "any number of hexadecimal characters" is valid for `\u{HHHH}`. This is
36+
undesirable, because it means `\u{000 ... 000E9}` is a valid escape sequence,
37+
for any number of `0` characters. Additionally, it's not clear if `\u{}` is
38+
meant to be valid.
39+
40+
## Background
41+
42+
[Proposal #199: String literals](https://github.com/carbon-language/carbon-lang/pull/199)
43+
says:
44+
45+
> As in JavaScript, Rust, and Swift, Unicode code points can be expressed by
46+
> number using `\u{10FFFF}` notation, which accepts any number of hexadecimal
47+
> characters. Any numeric code point in the ranges
48+
> 0<sub>16</sub>-D7FF<sub>16</sub> or E000<sub>16</sub>-10FFFF<sub>16</sub> can
49+
> be expressed this way.
50+
51+
When it comes to the number of digits, the languages differ:
52+
53+
- In [JavaScript](https://262.ecma-international.org/13.0/#prod-CodePoint),
54+
between 1 and 6 digits are supported, and it must be less than or equal to
55+
`10FFFF`.
56+
- In [Rust](https://doc.rust-lang.org/reference/tokens.html), between 1 and 6
57+
digits are supported.
58+
- In
59+
[Swift](https://docs.swift.org/swift-book/LanguageGuide/StringsAndCharacters.html),
60+
between 1 and 8 digits are supported.
61+
62+
Unicode's codespace is 0 to [`10FFFF`](https://unicode.org/glossary/#codespace).
63+
64+
## Proposal
65+
66+
The `\u{H...}` syntax is only valid for 1 to 8 unicode characters.
67+
68+
## Rationale
69+
70+
- [Code that is easy to read, understand, and write](/docs/project/goals.md#code-that-is-easy-to-read-understand-and-write)
71+
- This restriction does not affect the ability to write valid Unicode.
72+
Instead, it restricts the ability to write confusing or invalid unicode,
73+
which should make it easier to detect errors.
74+
- [Fast and scalable development](/docs/project/goals.md#fast-and-scalable-development)
75+
- Simplifies tooling by reducing the number of syntaxes that need to be
76+
supported, and allowing early failure on obviously invalid inputs.
77+
78+
## Alternatives considered
79+
80+
### Allow zero digits
81+
82+
We could allow `\u{}` as a version of `\u{0}`. However, as shorthand, it doesn't
83+
save much and `\x00` is both equal length and clearer.
84+
85+
Rather than allowing this syntax, we prefer to disallow it for consistency with
86+
other languages.
87+
88+
### Allow any number of hexadecimal characters
89+
90+
We could allow any number of digits in the `\u` escape. However, this has the
91+
consequence of requiring parsing of escapes of completely arbitrary length.
92+
93+
This creates unnecessary complexity in the parser because we need to consider
94+
what happens if the result is greater than 32 bits, significantly larger than
95+
unicode's current `10FFFF` limit. One way to do this would be to store the
96+
result in a 32-bit integer and keep parsing until the value goes above `10FFFF`,
97+
then error as invalid if that's exceeded. This would allow an arbitrary number
98+
of leading `0`'s to correctly parse.
99+
100+
It should make it easier to write a simple parser if we instead limit the number
101+
of digits to a reasonable amount.
102+
103+
### Limiting to 6 digits versus 8
104+
105+
A limit of 6 digits offers a reasonable limit as the minimum needed to represent
106+
Unicode's codespace. A limit of 8 digits offers a reasonable limit as a standard
107+
4-byte value, and roughly matches UTF-32.
108+
109+
While it seems a weak advantage, this proposal leans towards 8.

0 commit comments

Comments
 (0)