-
Notifications
You must be signed in to change notification settings - Fork 290
Add ensure_ascii
option
#1689
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add ensure_ascii
option
#1689
Conversation
CodSpeed Performance ReportMerging #1689 will not alter performanceComparing Summary
|
}; | ||
} | ||
|
||
#[allow(clippy::needless_lifetimes)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a false positive from Clippy? There are already a bunch.
Any chance you'd follow the core python default FYI, the
|
This can only be done in Pydantic V3, and still I don't know if the motivation is strong enough to do the change. |
@@ -363,6 +364,8 @@ class SchemaSerializer: | |||
Arguments: | |||
value: The Python object to serialize. | |||
indent: If `None`, the JSON will be compact, otherwise it will be pretty-printed with the indent provided. | |||
ensure_ascii: If `True`, the output is guaranteed to have all incoming non-ASCII characters escaped. | |||
If `False` (the default), these characters will be outputted as-is. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If `False` (the default), these characters will be outputted as-is. | |
If `False` (the default), these characters will be output as-is. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took the wording verbatim from the stdlib json
module, but seems like output is by far the most common.
src/serializers/shared.rs
Outdated
for ch in fragment.chars() { | ||
if ch.is_ascii() { | ||
writer.write_all(ch.encode_utf8(&mut [0; 4]).as_bytes())?; | ||
} else { | ||
for escape in ch.encode_utf16(&mut [0; 2]) { | ||
write!(writer, "\\u{escape:04x}")?; | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of thoughts here:
- it's probably faster to write whole runs of bytes rather than one char at a time
- I think if the char is below
0xFFFF
codepoint it is supposed to be written as that single codepoint. https://www.rfc-editor.org/rfc/rfc8259#section-7
So I came up with this algorithm:
let mut out = Vec::new();
let mut input = "What does 😮💨 this emoji mean?😮";
while let Some((idx, non_ascii_char)) = input
.chars()
.enumerate()
.find(|(_, c)| !c.is_ascii())
{
if idx > 0 {
// write all ascii characters before the non-ascii one
let ascii_run = &input[..idx];
out.write_all(ascii_run.as_bytes()).unwrap();
}
let codepoint = non_ascii_char as u32;
if codepoint < 0xFFFF {
// write basic codepoint as single escape
write!(out, "\\u{codepoint:04x}").unwrap();
} else {
// encode extended plane character as utf16 pair
for escape in non_ascii_char.encode_utf16(&mut [0; 2]) {
write!(out, "\\u{escape:04x}").unwrap();
}
}
input = &input[(idx + non_ascii_char.len_utf8())..];
}
// write any ascii trailer
out.write_all(input.as_bytes()).unwrap();
Change Summary
Part of pydantic/pydantic#11202.
Related issue number
Checklist
pydantic-core
(except for expected changes)