Unicode char limit (carbon-language#2040)

jonmeow · web-flow · commit 15d1e07133ee · 2022-08-16T09:41:52.000-07:00
The `\u{HHHH...}` can be an arbitrary length, potentially including `\u{}`.
Restrict to 1 to 8 characters.
diff --git a/docs/design/lexical_conventions/string_literals.md b/docs/design/lexical_conventions/string_literals.md
@@ -204,8 +204,8 @@ While octal escape sequences are expected to remain not permitted (even though
 In the above table, `H` represents an arbitrary hexadecimal character, `0`-`9`
 or `A`-`F` (case-sensitive). Unlike in C++, but like in Python, `\x` expects
 exactly two hexadecimal digits. As in JavaScript, Rust, and Swift, Unicode code
-points can be expressed by number using `\u{10FFFF}` notation, which accepts any
-number of hexadecimal characters. Any numeric code point in the ranges
+points can be expressed by number using `\u{10FFFF}` notation. This accepts
+between 1 and 8 hexadecimal characters. Any numeric code point in the ranges
 0<sub>16</sub>-D7FF<sub>16</sub> or E000<sub>16</sub>-10FFFF<sub>16</sub> can be
 expressed this way.
 
@@ -338,6 +338,10 @@ string in the type system. In such string literals, we should consider rejecting
     -   [Leading whitespace removal](/proposals/p0199.md#leading-whitespace-removal)
     -   [Terminating newline](/proposals/p0199.md#terminating-newline)
 -   [Escape sequences](/proposals/p0199.md#escape-sequences-1)
+    -   Unicode escape sequences:
+        -   [Allow zero digits](/proposals/p2040.md#allow-zero-digits)
+        -   [Allow any number of hexadecimal characters](/proposals/p2040.md#allow-any-number-of-hexadecimal-characters)
+        -   [Limiting to 6 digits versus 8](/proposals/p2040.md#limiting-to-6-digits-versus-8)
 -   [Raw string literals](/proposals/p0199.md#raw-string-literals-1)
     -   [Trailing whitespace](/proposals/p0199.md#trailing-whitespace)
     -   [Line separators](/proposals/p0199.md#line-separators)
@@ -347,3 +351,5 @@ string in the type system. In such string literals, we should consider rejecting
 
 -   Proposal
     [#199: String literals](https://github.com/carbon-language/carbon-lang/pull/199)
+-   Proposal
+    [#2040: Unicode escape code length](https://github.com/carbon-language/carbon-lang/pull/2040)
diff --git a/proposals/p2040.md b/proposals/p2040.md
@@ -0,0 +1,109 @@
+# Unicode escape code length
+
+<!--
+Part of the Carbon Language project, under the Apache License v2.0 with LLVM
+Exceptions. See /LICENSE for license information.
+SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+-->
+
+[Pull request](https://github.com/carbon-language/carbon-lang/pull/2040)
+
+<!-- toc -->
+
+## Table of contents
+
+-   [Abstract](#abstract)
+-   [Problem](#problem)
+-   [Background](#background)
+-   [Proposal](#proposal)
+-   [Rationale](#rationale)
+-   [Alternatives considered](#alternatives-considered)
+    -   [Allow zero digits](#allow-zero-digits)
+    -   [Allow any number of hexadecimal characters](#allow-any-number-of-hexadecimal-characters)
+    -   [Limiting to 6 digits versus 8](#limiting-to-6-digits-versus-8)
+
+<!-- tocstop -->
+
+## Abstract
+
+The `\u{HHHH...}` can be an arbitrary length, potentially including `\u{}`.
+Restrict to 1 to 8 characters.
+
+## Problem
+
+[Proposal #199: String literals](https://github.com/carbon-language/carbon-lang/pull/199)
+says "any number of hexadecimal characters" is valid for `\u{HHHH}`. This is
+undesirable, because it means `\u{000 ... 000E9}` is a valid escape sequence,
+for any number of `0` characters. Additionally, it's not clear if `\u{}` is
+meant to be valid.
+
+## Background
+
+[Proposal #199: String literals](https://github.com/carbon-language/carbon-lang/pull/199)
+says:
+
+> As in JavaScript, Rust, and Swift, Unicode code points can be expressed by
+> number using `\u{10FFFF}` notation, which accepts any number of hexadecimal
+> characters. Any numeric code point in the ranges
+> 0<sub>16</sub>-D7FF<sub>16</sub> or E000<sub>16</sub>-10FFFF<sub>16</sub> can
+> be expressed this way.
+
+When it comes to the number of digits, the languages differ:
+
+-   In [JavaScript](https://262.ecma-international.org/13.0/#prod-CodePoint),
+    between 1 and 6 digits are supported, and it must be less than or equal to
+    `10FFFF`.
+-   In [Rust](https://doc.rust-lang.org/reference/tokens.html), between 1 and 6
+    digits are supported.
+-   In
+    [Swift](https://docs.swift.org/swift-book/LanguageGuide/StringsAndCharacters.html),
+    between 1 and 8 digits are supported.
+
+Unicode's codespace is 0 to [`10FFFF`](https://unicode.org/glossary/#codespace).
+
+## Proposal
+
+The `\u{H...}` syntax is only valid for 1 to 8 unicode characters.
+
+## Rationale
+
+-   [Code that is easy to read, understand, and write](/docs/project/goals.md#code-that-is-easy-to-read-understand-and-write)
+    -   This restriction does not affect the ability to write valid Unicode.
+        Instead, it restricts the ability to write confusing or invalid unicode,
+        which should make it easier to detect errors.
+-   [Fast and scalable development](/docs/project/goals.md#fast-and-scalable-development)
+    -   Simplifies tooling by reducing the number of syntaxes that need to be
+        supported, and allowing early failure on obviously invalid inputs.
+
+## Alternatives considered
+
+### Allow zero digits
+
+We could allow `\u{}` as a version of `\u{0}`. However, as shorthand, it doesn't
+save much and `\x00` is both equal length and clearer.
+
+Rather than allowing this syntax, we prefer to disallow it for consistency with
+other languages.
+
+### Allow any number of hexadecimal characters
+
+We could allow any number of digits in the `\u` escape. However, this has the
+consequence of requiring parsing of escapes of completely arbitrary length.
+
+This creates unnecessary complexity in the parser because we need to consider
+what happens if the result is greater than 32 bits, significantly larger than
+unicode's current `10FFFF` limit. One way to do this would be to store the
+result in a 32-bit integer and keep parsing until the value goes above `10FFFF`,
+then error as invalid if that's exceeded. This would allow an arbitrary number
+of leading `0`'s to correctly parse.
+
+It should make it easier to write a simple parser if we instead limit the number
+of digits to a reasonable amount.
+
+### Limiting to 6 digits versus 8
+
+A limit of 6 digits offers a reasonable limit as the minimum needed to represent
+Unicode's codespace. A limit of 8 digits offers a reasonable limit as a standard
+4-byte value, and roughly matches UTF-32.
+
+While it seems a weak advantage, this proposal leans towards 8.