Skip to content

Conversation

@h-a-n-a
Copy link
Contributor

@h-a-n-a h-a-n-a commented Oct 11, 2025

Description:

This PR reapplied #10987 and replaced lone_surrogates: bool mark with Wtf8Atom. lone_surrogates introduced in #10987 is more of an implicit mark, whereas the use of Wtf8Atom makes it more explicit where users need to explicitly call to_string_lossy to get the lossy UTF-8 result, making a huge difference from the original implementation where users need to convert \uFFFDxxx into the correct unicode.

BREAKING CHANGE:

Both cooked in TemplateElement and value in StringLiteral are replaced with Wtf8Atom introduced in #11104.

Wtf8Atom does not expose a to_string method like the old Atom does. Internally, it stores the code points of the characters. You can call code_points() to get an iterator of the code points or call to_string_lossy() to get an lossy string in which all unpaired surrogates are replaced with U+FFFD(Replacement Character).

Related issue (if exists):

@changeset-bot
Copy link

changeset-bot bot commented Oct 11, 2025

🦋 Changeset detected

Latest commit: 363e389

The changes in this PR will be included in the next version bump.

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@socket-security
Copy link

socket-security bot commented Oct 11, 2025

No dependency changes detected. Learn more about Socket for GitHub.

👍 No dependency changes detected in pull request

@h-a-n-a h-a-n-a force-pushed the surrogates branch 6 times, most recently from f3c6816 to 6d05f37 Compare October 15, 2025 09:36
@codspeed-hq
Copy link

codspeed-hq bot commented Oct 15, 2025

CodSpeed Performance Report

Merging #11144 will improve performances by 2.31%

Comparing h-a-n-a:surrogates (363e389) with main (dd6f71b)

Summary

⚡ 1 improvement
✅ 138 untouched
⏩ 1 skipped1

Benchmarks breakdown

Benchmark BASE HEAD Change
es/transform/baseline/common_reserved_word 285 µs 278.5 µs +2.31%

Footnotes

  1. 1 benchmark was skipped, so the baseline result was used instead. If it was deleted from the codebase, click here and archive it to remove it from the performance reports.

@h-a-n-a h-a-n-a force-pushed the surrogates branch 9 times, most recently from 9df7324 to e9740e5 Compare October 20, 2025 09:05
@h-a-n-a h-a-n-a marked this pull request as ready for review October 20, 2025 09:38
@h-a-n-a h-a-n-a requested review from a team as code owners October 20, 2025 09:38
Copilot AI review requested due to automatic review settings October 20, 2025 09:38
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR improves handling of unpaired Unicode surrogates in JavaScript/TypeScript by replacing the implicit lone_surrogates: bool marker with an explicit Wtf8Atom type. This breaking change requires users to call to_string_lossy() or as_str() to convert WTF-8 atoms to valid UTF-8, making surrogate handling more explicit and preventing silent data loss.

Key Changes

  • Replaced Atom with Wtf8Atom for string literal values and template element cooked values
  • Added WTF-8 handling utilities and conversion functions throughout the codebase
  • Updated all consumers to handle potential surrogate pair scenarios explicitly

Reviewed Changes

Copilot reviewed 152 out of 205 changed files in this pull request and generated no comments.

Show a summary per file
File Description
Multiple package.json files Fixed missing newline before closing brace
packages/helpers/package.json Added export entry for new TypeScript helper
crates/swc_typescript/src/fast_dts/ Updated to handle Wtf8Atom with fallback conversion
crates/swc_ecma_utils/src/unicode.rs Added Unicode surrogate pair utilities
crates/swc_ecma_utils/src/lib.rs Added WTF-8 string handling functions
crates/swc_ecma_transforms_*/ Updated string operations to use WTF-8 APIs
crates/swc_ecma_minifier/ Updated string optimization to handle surrogates
crates/swc_ecma_parser/ Core lexer changes to produce Wtf8Atom tokens
Test files Updated expected outputs and added surrogate tests

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

**Description:**

This PR fixed an issue related to lone surrogates handling in Rust.

This fix's credits all go to Oxc team
swc-project#10978 (comment).
What I'm doing is porting the fix that was made in Oxc and make it
working under SWC.

The problem is related to the fundamental difference between how Rust
and JavaScript handle Unicode, especially lone surrogates.

**JavaScript's Unicode Model**

```javascript
// JavaScript allows this - lone surrogates are stored in UTF-16
let str = "\uD800";  // High surrogate alone - technically invalid Unicode
let obj = { "\uD800": "value" };  // Works fine in JS
```
JavaScript uses UTF-16 internally and tolerates invalid Unicode
sequences:
- Strings are UTF-16 code unit sequences, not Unicode scalar sequences
- Lone surrogates (U+D800-U+DFFF) are allowed and preserved
- No validation that surrogates come in proper high/low pairs
- Engine just stores the raw UTF-16 code units

**Rust's Unicode Model**

```rust
// This CANNOT exist in Rust:
let s = "\u{D800}";  // ❌ COMPILE ERROR - not a valid Unicode scalar
let c: char = '\u{D800}';  // ❌ COMPILE ERROR - char excludes surrogates
```
Rust enforces strict Unicode validity:
- String is UTF-8 and must contain valid Unicode scalar values
- char represents Unicode scalar values (U+0000-U+D7FF, U+E000-U+10FFFF)
- Surrogate code points (U+D800-U+DFFF) are explicitly excluded
- No way to represent lone surrogates in Rust's standard string types

1. AST Structure: Added `lone_surrogates: bool` field to `Str` and
`TplElement` structs to track when strings contain lone surrogates
2. Encoding Strategy: Lone surrogates are encoded using \u{FFFD}
(replacement character) followed by the original hex digits for internal
representation
3. Code Generation: Modified string output to properly escape lone
surrogates back to \uXXXX format during codegen
4. Test: Also fixed some cases related to member expression
optimizations and string concatenation optimizations

1. Add support for serializing and deserializing literals with lone
surrogates in `swc_estree_compat`
2. Reflect AST changes in `binding` crates

Breaks the AST by adding `lone_surrogates` field to `Str` and
`TplElement` and breaks the `value` and `cooked` respectly in `Str` and
`TplElement`. Both of the field is using `\u{FFFD}` (Replacement
Character) as an escape if `lone_surrogates` set to `true`.

To consume the real value, you need to first check if `lone_surrogates`
is `true`, then unescape it by removing the char and construct it with
the four trailing hexs(from `\u{FFFD}D800` to `\uD800`).

**Related issue:**

 - Closes swc-project#10978
 - Closes swc-project#10353

Fixed a regression of swc-project#7678
Copilot AI review requested due to automatic review settings October 22, 2025 06:46
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 154 out of 207 changed files in this pull request and generated no new comments.

Comments suppressed due to low confidence (5)

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@h-a-n-a
Copy link
Contributor Author

h-a-n-a commented Oct 22, 2025

CI is failed here https://github.com/swc-project/swc/actions/runs/18711211814/job/53359904217?pr=11144 due to different layout of ArchivedString and ArchivedVec<u8>

@kdy1
Copy link
Member

kdy1 commented Oct 22, 2025

@h-a-n-a If so, I think you can reflect the difference in the AST, too. I mean, CI failure seems to be related to mix of Vec and String, but I don't think we need to mix it, because this is a breaking change regardless

@h-a-n-a
Copy link
Contributor Author

h-a-n-a commented Oct 22, 2025

@h-a-n-a If so, I think you can reflect the difference in the AST, too. I mean, CI failure seems to be related to mix of Vec and String, but I don't think we need to mix it, because this is a breaking change regardless

@kdy1 I thought the breaking change would be only on the API side and not the ABI side, but for now I think it's a breaking change for all scenarios. You're right. We don't need to mix these things.

@kdy1
Copy link
Member

kdy1 commented Oct 27, 2025

Is this PR ready for a review? Asking because CI failed

Copilot AI review requested due to automatic review settings October 27, 2025 06:07
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@kdy1 kdy1 self-assigned this Oct 27, 2025
@kdy1 kdy1 added this to the Planned milestone Oct 27, 2025
@h-a-n-a
Copy link
Contributor Author

h-a-n-a commented Oct 27, 2025

I think the prior CI failure is related to the wasm breaking. Looks like now the issue is gone? But I have no idea how it's fixed though.

And for the current CI failure, I was not able to reproduce the test failure of terser_drop_unused_drop_toplevel_funcs_retain locally.

https://github.com/swc-project/swc/actions/runs/18831533377/job/53723863035?pr=11144

@kdy1 kdy1 enabled auto-merge (squash) October 27, 2025 06:42
@kdy1 kdy1 disabled auto-merge October 27, 2025 06:42
@kdy1 kdy1 merged commit 845512c into swc-project:main Oct 27, 2025
38 of 39 checks passed
@kdy1
Copy link
Member

kdy1 commented Oct 27, 2025

I'll postpone publish of the PR until reviews are addressed

Ok(Wtf8Atom::new(
// SAFETY: `ArchivedVec<u8>` is guaranteed to be serialized with `Wtf8Atom` byte
// sequence. `Wtf8Atom` byte sequence is identical to `Wtf8` byte sequence.
unsafe { Wtf8::from_bytes_unchecked(&s) },
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ArchivedVec<u8> is guaranteed to be serialized with Wtf8Atom byte sequence.

This guarantee doesn't seem to be enforced, and users can construct a non-wtf8 ArchivedVec<u8>?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documentation comment is a bit misleading, but in the compatible version of swc_core, the position of the field will ensure that Wtf8Atom is not mixed with other fields.

Copy link
Contributor Author

@h-a-n-a h-a-n-a Oct 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you please shed more light on the scenario that you're referring to?

Previously, I was thinking about the case of mixing old plugins with a newer core:

  • a new Str core + old plugins: will deserialize it from ArchivedString, I assume it will return an Result::Err in ArchivedString::deserialize as the ArchivedVec<u8> is different from ArchivedString
  • a new Str core + new plugins , will deserialize it from ArchivedVec<u8> and it will be a success
  • an old Str core + new plugins, will deserialize it with ArchivedVec<u8>, and it will fail as the original value is an ArchivedString
  • an old Str core + old plugins, will be fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm referring to the scenario where users construct malicious payload.

I realize that this seems outside of our security model due to the use of unchecked rkyv. However, this would make load untrusted plugins and untrusted caches a security risk.

@h-a-n-a h-a-n-a deleted the surrogates branch October 27, 2025 09:02
kdy1 added a commit that referenced this pull request Oct 28, 2025
**Description:**

This PR fixes a regression of `quote!` macro caused by #11144
@kdy1 kdy1 modified the milestones: Planned, v1.14.0 Oct 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

4 participants