Support supplementary Unicode and emoji sequences#357
Conversation
|
@TomWright I feel like this might be enough for everyone, without a need for a plugin. I'm curious what you think about my implementation compared to yours in TomWright/dasel#549 |
Treat valid 4-byte UTF-8 scalars as printable so supplementary Unicode characters and emoji sequences parse correctly.
| // 4-byte UTF-8 range for valid Unicode scalars: U+10000..U+10FFFF. | ||
| (b[i] == 0xF0 && b[i+1] >= 0x90 && b[i+1] <= 0xBF) || | ||
| (b[i] > 0xF0 && b[i] < 0xF4 && b[i+1] >= 0x80 && b[i+1] <= 0xBF) || | ||
| (b[i] == 0xF4 && b[i+1] >= 0x80 && b[i+1] <= 0x8F)) |
There was a problem hiding this comment.
Existing code doesn't do it either
| name: Is printable - emoji with skin tone modifier 👍🏽 | ||
| func: isPrintable | ||
| index: 0 # Only check the first code point (👍) for printability | ||
| data: [0xF0, 0x9F, 0x91, 0x8D, 0xF0, 0x9F, 0x8F, 0xBD] # U+1F44D U+1F3FD | ||
| want: true | ||
|
|
||
| - char-predicate: | ||
| name: Is printable - emoji with skin tone modifier 👍🏽 |
|
I've tested these changes in dasel locally using the tests in my PR TomWright/dasel#549 and it's all green. My only comment would mirror the AI review on the bounds checks to make sure we don't get panics, but since I haven't seen them happen before I'm guessing there's some safety checks somewhere up the call stack? |
👍
🎉
I will assume you are talking about this isPrintable is used at two places:
About the scanner, I think we are safe because libyaml always make sure to fetch a few more character. But scanning a YAML ending with an invalid UTF-8 character could panic. Then about emitter, I'm unsure it's possible, but to write something invalid. I will give a try. That said, if you take a look at isPrintable, the line above are also requesting i+1 and i+2. So if the potential bug already exists. I think this should be addressed in another issue once we will validate it can occur. I would like to keep the PR simple, but I can give a try and see how much work it could be. |
|
For records, I have checked:
YAML somehow prevents this. And the scanner is built to avoid issues. I also tried with a YAML
The fix for the emitter is easy. I plan to add tests for both. But as I said it would be in another PR. For now, I created #358 to keep a track of the issue |
|
OK great. All sounds good to me, I'm happy for you to go ahead. Appreciate the speedy response! |
Treat valid 4-byte UTF-8 scalars as printable so supplementary Unicode
characters and emoji sequences parse correctly.
Fixes #354
Supersedes #356