Skip to content

Commit

Permalink
feat: make utf-8 validation in default
Browse files Browse the repository at this point in the history
  • Loading branch information
liuq19 committed Oct 23, 2023
1 parent a5a279c commit f936ac8
Show file tree
Hide file tree
Showing 9 changed files with 61 additions and 74 deletions.
4 changes: 2 additions & 2 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ jobs:
- name: Run tests
run: |
cargo check
cargo test --features utf8
cargo test
cargo bench check
cargo install cargo-fuzz
cargo +nightly fuzz run fuzz_value -- -max_total_time=5m
Expand All @@ -50,7 +50,7 @@ jobs:
- name: Run tests
run: |
cargo check
cargo test --features utf8
cargo test
lint:
runs-on: [self-hosted, X64]
Expand Down
29 changes: 0 additions & 29 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

9 changes: 2 additions & 7 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ smallvec = "1.11"
bumpalo = "3.13"
bytes = "1.4"
thiserror = "1.0"
simdutf8 = { version = "0.1", optional = true}
simdutf8 = "0.1"

[dev-dependencies]
jemallocator = "0.5"
Expand All @@ -36,7 +36,6 @@ core_affinity = "0.8"
criterion = { version = "0.5", features = ["html_reports"] }
gjson = "0.8"
serde_derive = "1.0"
env_logger = "0.10"
faststr = "0.2"
# This config will disable rustc-serialize crate to avoid security warnings in ci
json-benchmark = { git = "https://github.com/serde-rs/json-benchmark", default-features = false, features = ["all-files", "lib-serde"]}
Expand Down Expand Up @@ -75,8 +74,4 @@ name = "get_from"
harness = false

[features]
# default feature, not validate utf-8 when parsing json from slice
default = []

# validate the utf-8 when when parsing json from slice
utf8 = ["simdutf8"]
default = []
24 changes: 11 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,8 +25,7 @@ More details about optimization can be found in [performance.md](docs/performanc

1. Support x86_64 or aarch64. Note that the performance in aarch64 is lower and needs optimization.
2. Requires Rust nightly version, as we use the `packed_simd` crate.
3. Does NOT validate the UTF-8 when parsing from a slice by default. You can use the `utf8` feature to enable validation. The performance loss is about 3% ~ 10%.
4. When using `get_from`, `get_many`, `JsonIter` or `RawValue`, ***Warn:*** the JSON should be well-formed and valid.
3. When using `get_from`, `get_many`, `JsonIter` or `RawValue`, ***Warn:*** the JSON should be well-formed and valid.

## Features
1. Serde into Rust struct as `serde_json` and `serde`.
Expand All @@ -45,12 +44,11 @@ More details about optimization can be found in [performance.md](docs/performanc

To ensure that SIMD instruction is used in sonic-rs, you need to add rustflags `-C target-cpu=native` and compile on the host machine. For example, Rust flags can be configured in Cargo [config](.cargo/config).

Choose what features?

`default`: the fast version that does not validate UTF-8 when parsing for performance.

`utf8`: provides UTF-8 validation when parsing JSON from a slice.

Add sonic-rs in `Cargo.toml`
```
[dependencies]
sonic-rs = 0.2.0
```

## Benchmark

Expand All @@ -70,13 +68,13 @@ The serialize benchmarks work in the opposite way.

All deserialized benchmark enabled utf-8, and enabled `float_roundtrip` in `serde-json` to get sufficient precision as Rust std.

### Deserialize Struct (Enabled utf8 validation)
### Deserialize Struct

The benchmark will parse JSON into a Rust struct, and there are no unknown fields in JSON text. All fields are parsed into struct fields in the JSON.

Sonic-rs is faster than simd-json because simd-json (Rust) first parses the JSON into a `tape`, then parses the `tape` into a Rust struct. Sonic-rs directly parses the JSON into a Rust struct, and there are no temporary data structures. The [flamegraph](assets/pngs/) is profiled in the citm_catalog case.

`cargo bench --bench deserialize_struct --features utf8 -- --quiet`
`cargo bench --bench deserialize_struct -- --quiet`

```
twitter/sonic_rs::from_slice
Expand Down Expand Up @@ -108,14 +106,14 @@ canada/serde_json::from_str
```


### Deserialize Untyped (Enabled utf8 validation)
### Deserialize Untyped

The benchmark will parse JSON into a document. Sonic-rs seems faster for several reasons:
- There are also no temporary data structures in sonic-rs, as detailed above.
- Sonic-rs uses a memory arena for the whole document, resulting in fewer memory allocations, better cache-friendliness, and mutability.
- The JSON object in sonic-rs's document is actually a vector. Sonic-rs does not build a hashmap.

`cargo bench --bench deserialize_value --features utf8 -- --quiet`
`cargo bench --bench deserialize_value -- --quiet`

```
twitter/sonic_rs_dom::from_slice
Expand Down Expand Up @@ -368,7 +366,7 @@ Detailed examples can be found in [raw_value.rs](examples/raw_value.rs) and [jso

By default, sonic-rs does not enable UTF-8 validation. This is a trade-off to achieve the fastest performance.

- For the `from_slice` and `dom_from_slice` interfaces, if you need to validate UTF-8 for the parsed JSON, please use the `utf8` feature.
- For the `from_slice` and `dom_from_slice` interfaces, validate UTF-8 in default. If users make sure that the json is utf-8 valid, recommended use the `from_slice_unchecked` and `dom_from_slice_unchecked`.

- For the `get` and `lazyvalue` related interfaces, due to the algorithm design, these interfaces are ***only suitable for use in valid-json scenarios***, and we will not provide UTF-8 validation in the future.

Expand Down
22 changes: 11 additions & 11 deletions README_ZH.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,8 +25,7 @@ sonic-rs 的主要优化是使用 SIMD。然而,sonic-rs 没有使用来自`si

1. 支持 x86_64 或 aarch64,aarch64 的性能较低,需要优化。
2. 需要 Rust nightly 版本,因为 sonic-rs 使用了 `packed_simd` 包。
3. 默认情况下,当 JSON 是slice 时, sonic-rs 并不校验 UTF-8。用户可以使用 `utf8` feature 来开启 utf-8 校验,性能损失约为 3% ~ 10% 不等。
4. 使用 `get_from``get_many``JsonIter``RawValue` 时,JSON 应该是格式正确且有效的。
3. 使用 `get_from``get_many``JsonIter``RawValue` 时,JSON 应该是格式正确且有效的。

## 功能

Expand All @@ -41,11 +40,12 @@ sonic-rs 的主要优化是使用 SIMD。然而,sonic-rs 没有使用来自`si

要确保在 sonic-rs 中使用 SIMD 指令,您需要添加 rustflags `-C target-cpu=native` 并在主机上进行编译。例如,Rust 标志可以在 Cargo [config](.cargo/config) 中配置。

如何选择features?

`default`:在解析时,不校验 UTF-8,性能更好。
在 Cargo 依赖中添加 sonic-rs:
```
[dependencies]
sonic-rs = 0.2.0
```

`utf8`:当 JSON 是slice 时,开启 UTF-8校验。

## 基准测试

Expand All @@ -66,13 +66,13 @@ Model name: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz

解析相关 benchmark 都开启了 UTF-8 校验,同时 `serde-json` 开启了 `float_roundtrip` feature, 以便解析浮点数具有足够精度,和 Rust 标准库对齐。

### 解析到结构体(启用 utf8 验证)
### 解析到结构体

基准测试将把 JSON 解析成 Rust 结构体,JSON 文本中没有未知字段。JSON 中的所有字段都被解析为结构体字段。

Sonic-rs 比 simd-json 更快,因为 simd-json (Rust) 首先将 JSON 解析成 `tape`,然后将 `tape` 解析成 Rust 结构体。Sonic-rs 直接将 JSON 解析成 Rust 结构体,没有临时数据结构。在 citm_catalog 案例中对 [flamegraph](assets/pngs/) 进行了分析。

`cargo bench --bench deserialize_struct --features utf8 -- --quiet`
`cargo bench --bench deserialize_struct -- --quiet`

```
twitter/sonic_rs::from_slice
Expand Down Expand Up @@ -104,14 +104,14 @@ canada/serde_json::from_str
```


### 解析到 document(启用 utf8 验证)
### 解析到 document

该测试将把 JSON 解析成 document。由于以下几个原因,Sonic-rs 会看起来更快一些:
- 如上所述,在 sonic-rs 中没有临时数据结构,例如 `tape`
- Sonic-rs 为整个 document 使用内存区,从而减少内存分配、提高缓存友好性和可变性。
- sonic-rs document中的 JSON 对象实际上是一个向量。Sonic-rs 不会构建 hashmap。

`cargo bench --bench deserialize_value --features utf8 -- --quiet`
`cargo bench --bench deserialize_value -- --quiet`

```
twitter/sonic_rs_dom::from_slice
Expand Down Expand Up @@ -365,7 +365,7 @@ fn main() {

sonic-rs 默认并不开启 utf-8 校验,这是为了性能做出的权衡。

- 对于 `from_slice``dom_from_slice` 接口,需要对解析的 JSON 校验UTF-8,请使用 `utf8` feature.
- 对于 `from_slice``dom_from_slice` 接口,默认开启了 `utf8` 校验。如果用户确保是 `utf-8`, 也可以使用 `from_slice_unchecked``dom_from_slice_unchecked`

- 对于 `get``lazyvaue` 相关接口,由于实现算法设计的原因,这些接口***只适合在 valid-json 场景下使用***,我们后续也不会提供 utf-8 校验。

Expand Down
4 changes: 2 additions & 2 deletions assets/pngs/flamegraph.sh
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@

# the command to profiling sonic-rs benchmarks

CARGO_PROFILE_BENCH_DEBUG=true cargo flamegraph --bench deserialize_struct --features utf8 -- --bench citm_catalog/sonic --profile-time 5
CARGO_PROFILE_BENCH_DEBUG=true cargo flamegraph --bench deserialize_struct -- --bench citm_catalog/sonic --profile-time 5

CARGO_PROFILE_BENCH_DEBUG=true cargo flamegraph --bench deserialize_struct --features utf8 -- --bench citm_catalog/simd_json --profile-time 5
CARGO_PROFILE_BENCH_DEBUG=true cargo flamegraph --bench deserialize_struct -- --bench citm_catalog/simd_json --profile-time 5
14 changes: 12 additions & 2 deletions src/serde/de.rs
Original file line number Diff line number Diff line change
Expand Up @@ -1052,13 +1052,12 @@ where
}

/// Deserialize an instance of type `T` from bytes of JSON text.
///
/// If user can guarantee the JSON is valid UTF-8, recommend to use `from_slice_unchecked` instead.
pub fn from_slice<'a, T>(json: &'a [u8]) -> Result<T>
where
T: de::Deserialize<'a>,
{
// validate the utf-8 at first for slice
#[cfg(feature = "utf8")]
let json = {
let json = crate::util::utf8::from_utf8(json)?;
json.as_bytes()
Expand All @@ -1067,6 +1066,17 @@ where
from_trait(SliceRead::new(json))
}

/// Deserialize an instance of type `T` from bytes of JSON text.
///
/// # Safety
/// The json passed in must be valid UTF-8.
pub unsafe fn from_slice_unchecked<'a, T>(json: &'a [u8]) -> Result<T>
where
T: de::Deserialize<'a>,
{
from_trait(SliceRead::new(json))
}

/// Deserialize an instance of type `T` from a string of JSON text.
///
pub fn from_str<'a, T>(s: &'a str) -> Result<T>
Expand Down
1 change: 0 additions & 1 deletion src/util/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -3,5 +3,4 @@ pub mod num;
pub mod private;
pub mod string;
pub mod unicode;
#[cfg(feature = "utf8")]
pub mod utf8;
28 changes: 21 additions & 7 deletions src/value/node.rs
Original file line number Diff line number Diff line change
@@ -1,8 +1,10 @@
use super::value_trait::{JsonType, JsonValue};
use crate::error::Result;
use crate::parser::Parser;
use crate::pointer::{JsonPointer, PointerNode};
use crate::reader::UncheckedSliceRead;
use crate::serde::tri;
use crate::util::utf8::from_utf8;
use crate::value::Index;
use crate::visitor::JsonVisitor;
use crate::{to_string, IndexMut, Number};
Expand All @@ -16,8 +18,6 @@ use std::ops;
use std::ptr::NonNull;
use std::slice::{from_raw_parts, from_raw_parts_mut};

use super::value_trait::{JsonType, JsonValue};

/// Value is a node in the DOM tree.
pub struct Value<'dom> {
typ: NodeMeta,
Expand Down Expand Up @@ -825,17 +825,20 @@ impl Default for Document {
}
}

/// Parse a json into a document.
pub fn dom_from_str(json: &str) -> Result<Document> {
let mut dom = Document::new();
dom.parse_bytes_impl(json.as_bytes())?;
Ok(dom)
}

/// Parse a json into a document.
///
/// If the json is valid utf-8, recommend to use `dom_from_slice_unchecked` instead.
pub fn dom_from_slice(json: &[u8]) -> Result<Document> {
// validate the utf-8 at first for slice
#[cfg(feature = "utf8")]
let json = {
let json = crate::util::utf8::from_utf8(json)?;
let json = from_utf8(json)?;
json.as_bytes()
};

Expand All @@ -844,6 +847,16 @@ pub fn dom_from_slice(json: &[u8]) -> Result<Document> {
Ok(dom)
}

/// Parse a json into a document.
///
/// # Safety
/// The json must be valid utf-8.
pub unsafe fn dom_from_slice_unchecked(json: &[u8]) -> Result<Document> {
let mut dom = Document::new();
dom.parse_bytes_impl(json)?;
Ok(dom)
}

/// ValueMut is a mutable reference to a `Value`.
#[derive(Debug)]
pub struct ValueMut<'dom> {
Expand Down Expand Up @@ -1547,14 +1560,15 @@ mod test {
}

#[test]
#[cfg(feature = "utf8")]
fn test_invalid_utf8() {
let data = [b'"', 0, 0, 0, 0x80, 0x90, b'"'];
let data = [b'"', 0x80, 0x90, b'"'];
let dom = dom_from_slice(&data);
assert_eq!(
dom.err().unwrap().to_string(),
"Invalid UTF-8 characters in json at line 1 column 4"
"Invalid UTF-8 characters in json at line 1 column 1"
);
let dom = unsafe { dom_from_slice_unchecked(&data) };
assert!(dom.is_ok());

let data = [b'"', b'"', 0x80];
let dom = dom_from_slice(&data);
Expand Down

0 comments on commit f936ac8

Please sign in to comment.