|
| 1 | +# Unicode escape code length |
| 2 | + |
| 3 | +<!-- |
| 4 | +Part of the Carbon Language project, under the Apache License v2.0 with LLVM |
| 5 | +Exceptions. See /LICENSE for license information. |
| 6 | +SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception |
| 7 | +--> |
| 8 | + |
| 9 | +[Pull request](https://github.com/carbon-language/carbon-lang/pull/2040) |
| 10 | + |
| 11 | +<!-- toc --> |
| 12 | + |
| 13 | +## Table of contents |
| 14 | + |
| 15 | +- [Abstract](#abstract) |
| 16 | +- [Problem](#problem) |
| 17 | +- [Background](#background) |
| 18 | +- [Proposal](#proposal) |
| 19 | +- [Rationale](#rationale) |
| 20 | +- [Alternatives considered](#alternatives-considered) |
| 21 | + - [Allow zero digits](#allow-zero-digits) |
| 22 | + - [Allow any number of hexadecimal characters](#allow-any-number-of-hexadecimal-characters) |
| 23 | + - [Limiting to 6 digits versus 8](#limiting-to-6-digits-versus-8) |
| 24 | + |
| 25 | +<!-- tocstop --> |
| 26 | + |
| 27 | +## Abstract |
| 28 | + |
| 29 | +The `\u{HHHH...}` can be an arbitrary length, potentially including `\u{}`. |
| 30 | +Restrict to 1 to 8 characters. |
| 31 | + |
| 32 | +## Problem |
| 33 | + |
| 34 | +[Proposal #199: String literals](https://github.com/carbon-language/carbon-lang/pull/199) |
| 35 | +says "any number of hexadecimal characters" is valid for `\u{HHHH}`. This is |
| 36 | +undesirable, because it means `\u{000 ... 000E9}` is a valid escape sequence, |
| 37 | +for any number of `0` characters. Additionally, it's not clear if `\u{}` is |
| 38 | +meant to be valid. |
| 39 | + |
| 40 | +## Background |
| 41 | + |
| 42 | +[Proposal #199: String literals](https://github.com/carbon-language/carbon-lang/pull/199) |
| 43 | +says: |
| 44 | + |
| 45 | +> As in JavaScript, Rust, and Swift, Unicode code points can be expressed by |
| 46 | +> number using `\u{10FFFF}` notation, which accepts any number of hexadecimal |
| 47 | +> characters. Any numeric code point in the ranges |
| 48 | +> 0<sub>16</sub>-D7FF<sub>16</sub> or E000<sub>16</sub>-10FFFF<sub>16</sub> can |
| 49 | +> be expressed this way. |
| 50 | +
|
| 51 | +When it comes to the number of digits, the languages differ: |
| 52 | + |
| 53 | +- In [JavaScript](https://262.ecma-international.org/13.0/#prod-CodePoint), |
| 54 | + between 1 and 6 digits are supported, and it must be less than or equal to |
| 55 | + `10FFFF`. |
| 56 | +- In [Rust](https://doc.rust-lang.org/reference/tokens.html), between 1 and 6 |
| 57 | + digits are supported. |
| 58 | +- In |
| 59 | + [Swift](https://docs.swift.org/swift-book/LanguageGuide/StringsAndCharacters.html), |
| 60 | + between 1 and 8 digits are supported. |
| 61 | + |
| 62 | +Unicode's codespace is 0 to [`10FFFF`](https://unicode.org/glossary/#codespace). |
| 63 | + |
| 64 | +## Proposal |
| 65 | + |
| 66 | +The `\u{H...}` syntax is only valid for 1 to 8 unicode characters. |
| 67 | + |
| 68 | +## Rationale |
| 69 | + |
| 70 | +- [Code that is easy to read, understand, and write](/docs/project/goals.md#code-that-is-easy-to-read-understand-and-write) |
| 71 | + - This restriction does not affect the ability to write valid Unicode. |
| 72 | + Instead, it restricts the ability to write confusing or invalid unicode, |
| 73 | + which should make it easier to detect errors. |
| 74 | +- [Fast and scalable development](/docs/project/goals.md#fast-and-scalable-development) |
| 75 | + - Simplifies tooling by reducing the number of syntaxes that need to be |
| 76 | + supported, and allowing early failure on obviously invalid inputs. |
| 77 | + |
| 78 | +## Alternatives considered |
| 79 | + |
| 80 | +### Allow zero digits |
| 81 | + |
| 82 | +We could allow `\u{}` as a version of `\u{0}`. However, as shorthand, it doesn't |
| 83 | +save much and `\x00` is both equal length and clearer. |
| 84 | + |
| 85 | +Rather than allowing this syntax, we prefer to disallow it for consistency with |
| 86 | +other languages. |
| 87 | + |
| 88 | +### Allow any number of hexadecimal characters |
| 89 | + |
| 90 | +We could allow any number of digits in the `\u` escape. However, this has the |
| 91 | +consequence of requiring parsing of escapes of completely arbitrary length. |
| 92 | + |
| 93 | +This creates unnecessary complexity in the parser because we need to consider |
| 94 | +what happens if the result is greater than 32 bits, significantly larger than |
| 95 | +unicode's current `10FFFF` limit. One way to do this would be to store the |
| 96 | +result in a 32-bit integer and keep parsing until the value goes above `10FFFF`, |
| 97 | +then error as invalid if that's exceeded. This would allow an arbitrary number |
| 98 | +of leading `0`'s to correctly parse. |
| 99 | + |
| 100 | +It should make it easier to write a simple parser if we instead limit the number |
| 101 | +of digits to a reasonable amount. |
| 102 | + |
| 103 | +### Limiting to 6 digits versus 8 |
| 104 | + |
| 105 | +A limit of 6 digits offers a reasonable limit as the minimum needed to represent |
| 106 | +Unicode's codespace. A limit of 8 digits offers a reasonable limit as a standard |
| 107 | +4-byte value, and roughly matches UTF-32. |
| 108 | + |
| 109 | +While it seems a weak advantage, this proposal leans towards 8. |
0 commit comments