Add `ensure_ascii` option #1689

Viicos · 2025-04-11T10:45:19Z

Change Summary

Part of pydantic/pydantic#11202.

Related issue number

Checklist

Unit tests for the changes exist
Documentation reflects the changes where applicable
Pydantic tests pass with this pydantic-core (except for expected changes)
My PR is ready to review, please add a comment including the phrase "please review" to assign reviewers

src/serializers/shared.rs

codspeed-hq · 2025-04-11T10:52:08Z

CodSpeed Performance Report

Merging #1689 will not alter performance

_{Comparing ensure-ascii (5c32b10) with main (52e9a53)}

Summary

✅ 157 untouched benchmarks

Viicos · 2025-04-11T10:54:14Z

src/serializers/shared.rs

+    };
+}
+
+#[allow(clippy::needless_lifetimes)]


Looks like a false positive from Clippy? There are already a bunch.

D-stefaang · 2025-05-05T09:43:04Z

Any chance you'd follow the core python default ensure_ascii=true here? That'd be a breaking change so that's unlikely.
Maybe a mention in the documentation?
https://docs.python.org/3/library/json.html#json.dump

FYI, the httpx library handles unicode nicely, the underlying requests library doesn't... unless explicitly sending UTF-8 bytes.

In [8]: import httpx

In [9]: data = g.model_dump_json()

In [10]: httpx.post('https://httpbin.org/post', data=data)
[info     ] HTTP Request: POST https://httpbin.org/post "HTTP/1.1 200 OK" [httpx]
Out[10]: <Response [200 OK]>

In [11]: import requests

In [12]: requests.post('https://httpbin.org/post', data=data)
---------------------------------------------------------------------------
UnicodeEncodeError 
...
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 110-111: Body ('湫莓') is not valid Latin-1. Use body.encode('utf-8') if you want to send it encoded in UTF-8.

In [13]: requests.post('https://httpbin.org/post', data=data.encode())
Out[13]: <Response [200 OK]>

Viicos · 2025-05-05T09:56:21Z

Any chance you'd follow the core python default ensure_ascii=true here?

This can only be done in Pydantic V3, and still I don't know if the motivation is strong enough to do the change.

davidhewitt · 2025-04-17T12:23:32Z

python/pydantic_core/_pydantic_core.pyi

@@ -363,6 +364,8 @@ class SchemaSerializer:
        Arguments:
            value: The Python object to serialize.
            indent: If `None`, the JSON will be compact, otherwise it will be pretty-printed with the indent provided.
+            ensure_ascii: If `True`, the output is guaranteed to have all incoming non-ASCII characters escaped.
+                If `False` (the default), these characters will be outputted as-is.


Suggested change

If `False` (the default), these characters will be outputted as-is.

If `False` (the default), these characters will be output as-is.

Took the wording verbatim from the stdlib json module, but seems like output is by far the most common.

python/pydantic_core/_pydantic_core.pyi

src/serializers/shared.rs

davidhewitt · 2025-06-05T17:07:36Z

src/serializers/shared.rs

+        for ch in fragment.chars() {
+            if ch.is_ascii() {
+                writer.write_all(ch.encode_utf8(&mut [0; 4]).as_bytes())?;
+            } else {
+                for escape in ch.encode_utf16(&mut [0; 2]) {
+                    write!(writer, "\\u{escape:04x}")?;
+                }
+            }
+        }


A couple of thoughts here:

it's probably faster to write whole runs of bytes rather than one char at a time

I think if the char is below 0xFFFF codepoint it is supposed to be written as that single codepoint. https://www.rfc-editor.org/rfc/rfc8259#section-7

So I came up with this algorithm:

let mut out = Vec::new(); let mut input = "What does 😮💨 this emoji mean?😮"; while let Some((idx, non_ascii_char)) = input .chars() .enumerate() .find(|(_, c)| !c.is_ascii()) { if idx > 0 { // write all ascii characters before the non-ascii one let ascii_run = &input[..idx]; out.write_all(ascii_run.as_bytes()).unwrap(); } let codepoint = non_ascii_char as u32; if codepoint < 0xFFFF { // write basic codepoint as single escape write!(out, "\\u{codepoint:04x}").unwrap(); } else { // encode extended plane character as utf16 pair for escape in non_ascii_char.encode_utf16(&mut [0; 2]) { write!(out, "\\u{escape:04x}").unwrap(); } } input = &input[(idx + non_ascii_char.len_utf8())..]; } // write any ascii trailer out.write_all(input.as_bytes()).unwrap();

Viicos force-pushed the ensure-ascii branch from 0ce656c to f74399f Compare April 11, 2025 10:47

Viicos commented Apr 11, 2025

View reviewed changes

src/serializers/shared.rs Show resolved Hide resolved

Viicos force-pushed the ensure-ascii branch from f74399f to 27a843b Compare April 11, 2025 10:53

Viicos commented Apr 11, 2025

View reviewed changes

Viicos force-pushed the ensure-ascii branch from 27a843b to 5aba587 Compare June 5, 2025 13:30

Viicos requested a review from davidhewitt June 5, 2025 13:31

davidhewitt reviewed Jun 5, 2025

View reviewed changes

Add ensure_ascii option

5c32b10

Viicos force-pushed the ensure-ascii branch from 5aba587 to 5c32b10 Compare June 5, 2025 17:48

Feedback

236f97f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add `ensure_ascii` option #1689

Add `ensure_ascii` option #1689

Uh oh!

Viicos commented Apr 11, 2025

Uh oh!

Uh oh!

codspeed-hq bot commented Apr 11, 2025 •

edited

Loading

Uh oh!

Viicos Apr 11, 2025

Uh oh!

D-stefaang commented May 5, 2025 •

edited

Loading

Uh oh!

Viicos commented May 5, 2025

Uh oh!

davidhewitt Apr 17, 2025

Uh oh!

Viicos Jun 5, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

davidhewitt Jun 5, 2025

Uh oh!

Uh oh!

	If `False` (the default), these characters will be outputted as-is.
	If `False` (the default), these characters will be output as-is.

Add ensure_ascii option #1689

Are you sure you want to change the base?

Add ensure_ascii option #1689

Uh oh!

Conversation

Viicos commented Apr 11, 2025

Change Summary

Related issue number

Checklist

Uh oh!

Uh oh!

codspeed-hq bot commented Apr 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CodSpeed Performance Report

Merging #1689 will not alter performance

Summary

Uh oh!

Viicos Apr 11, 2025

Choose a reason for hiding this comment

Uh oh!

D-stefaang commented May 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Viicos commented May 5, 2025

Uh oh!

davidhewitt Apr 17, 2025

Choose a reason for hiding this comment

Uh oh!

Viicos Jun 5, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

davidhewitt Jun 5, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Add `ensure_ascii` option #1689

Add `ensure_ascii` option #1689

codspeed-hq bot commented Apr 11, 2025 •

edited

Loading

D-stefaang commented May 5, 2025 •

edited

Loading