Skip to content

Support supplementary Unicode and emoji sequences#357

Open
ccoVeille wants to merge 1 commit into
yaml:mainfrom
ccoVeille:fix-emojis
Open

Support supplementary Unicode and emoji sequences#357
ccoVeille wants to merge 1 commit into
yaml:mainfrom
ccoVeille:fix-emojis

Conversation

@ccoVeille
Copy link
Copy Markdown
Contributor

@ccoVeille ccoVeille commented May 24, 2026

Treat valid 4-byte UTF-8 scalars as printable so supplementary Unicode
characters and emoji sequences parse correctly.

Fixes #354

Supersedes #356

Copilot AI review requested due to automatic review settings May 24, 2026 22:41
@ccoVeille
Copy link
Copy Markdown
Contributor Author

@TomWright I feel like this might be enough for everyone, without a need for a plugin.

I'm curious what you think about my implementation compared to yours in TomWright/dasel#549

cc @colinjlacy @ingydotnet

This comment was marked as outdated.

Treat valid 4-byte UTF-8 scalars as printable so supplementary Unicode
characters and emoji sequences parse correctly.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Comment on lines +631 to +634
// 4-byte UTF-8 range for valid Unicode scalars: U+10000..U+10FFFF.
(b[i] == 0xF0 && b[i+1] >= 0x90 && b[i+1] <= 0xBF) ||
(b[i] > 0xF0 && b[i] < 0xF4 && b[i+1] >= 0x80 && b[i+1] <= 0xBF) ||
(b[i] == 0xF4 && b[i+1] >= 0x80 && b[i+1] <= 0x8F))
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Existing code doesn't do it either

Comment on lines +708 to +715
name: Is printable - emoji with skin tone modifier 👍🏽
func: isPrintable
index: 0 # Only check the first code point (👍) for printability
data: [0xF0, 0x9F, 0x91, 0x8D, 0xF0, 0x9F, 0x8F, 0xBD] # U+1F44D U+1F3FD
want: true

- char-predicate:
name: Is printable - emoji with skin tone modifier 👍🏽
@TomWright
Copy link
Copy Markdown

I've tested these changes in dasel locally using the tests in my PR TomWright/dasel#549 and it's all green.
I've also ran the entire suit that has various YAML read/writes and didn't find any breakage.
I've obviously only tested as a consumer, but it looks good to me.

My only comment would mirror the AI review on the bounds checks to make sure we don't get panics, but since I haven't seen them happen before I'm guessing there's some safety checks somewhere up the call stack?

@ccoVeille
Copy link
Copy Markdown
Contributor Author

I've tested these changes in dasel locally using the tests in my PR TomWright/dasel#549 and it's all green.

👍

I've also ran the entire suit that has various YAML read/writes and didn't find any breakage.
I've obviously only tested as a consumer, but it looks good to me.

🎉

My only comment would mirror the AI review on the bounds checks to make sure we don't get panics, but since I haven't seen them happen before I'm guessing there's some safety checks somewhere up the call stack?

I will assume you are talking about this

isPrintable is used at two places:

  • when scanning
  • when parsing

About the scanner, I think we are safe because libyaml always make sure to fetch a few more character. But scanning a YAML ending with an invalid UTF-8 character could panic.

Then about emitter, I'm unsure it's possible, but to write something invalid. I will give a try.

That said, if you take a look at isPrintable, the line above are also requesting i+1 and i+2. So if the potential bug already exists.

I think this should be addressed in another issue once we will validate it can occur.

I would like to keep the PR simple, but I can give a try and see how much work it could be.

@ccoVeille
Copy link
Copy Markdown
Contributor Author

ccoVeille commented May 25, 2026

For records, I have checked:

  • the scanner cannot fail with a panic

YAML somehow prevents this. And the scanner is built to avoid issues.

I also tried with a YAML !!binary tag to pass an invalid UTF-8 char, but the scanner prevented the issue by returning an error.

  • emmiter can fail when a list of YAML events is forged with byte directly.

The fix for the emitter is easy.

I plan to add tests for both. But as I said it would be in another PR.

For now, I created #358 to keep a track of the issue

@TomWright
Copy link
Copy Markdown

OK great. All sounds good to me, I'm happy for you to go ahead. Appreciate the speedy response!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Emoji/unicode is incorrectly escaped

4 participants