Skip to content

modules/zstd: add frame header parser #1168

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 10 commits into from

Conversation

rdob-ant
Copy link
Contributor

This PR adds Zstd frame header parsing machinery written in DSLX + tests for it.

NOTE: this is based on #1167 , please ignore commits from that branch when reviewing.

@proppy proppy changed the title modules/zstd: add frame header parser (part 3) modules/zstd: add frame header parser Nov 9, 2023
@lpawelcz
Copy link
Contributor

lpawelcz commented Nov 15, 2023

@proppy we updated the code, please take a look:

C++ tests generate 250 test cases with pseudorandom valid FrameHeaders. Test cases run in parallel through sharding mechanism.
In a single tests case a frame header in form of a vector of uint8_t is generated. It is then passed through libzstd to convert it into ZSTD_frameHeader struct and also to validate the data. After that, generated vector is converted to DSLX simulation input values and ZSTD_frameHeader structure is used to form expected return values to compare against the results of DSLX simulation.

Please do note that this branch is based on #1167 and it also includes commits cherry-picked from #1166. Please ignore those when reviewing this PR.

EDIT:

I also added next batch of parallelized tests. Those generate random vectors of bytes of length from 0 to 32 bytes and then run those vectors through libstd and DSLX simulation similarly as in previous test case. This allows us to test the parser against invalid frame headers and compare the results between the library and DSLX simulation.

strip_prefix = "zstd-1.4.7",
urls = ["https://github.com/facebook/zstd/releases/download/v1.4.7/zstd-1.4.7.tar.gz"],
build_file = "@//dependency_support/com_github_facebook_zstd:bundled.BUILD.bazel",
patches = ["@com_google_xls//dependency_support/com_github_facebook_zstd:decodecorpus.patch"],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment about the patch? has it been submitted upstream?

Copy link
Contributor

@lpawelcz lpawelcz Nov 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the comment.
Changes were not submitted upstream yet. The patch modifies decodecorpus utility from zstd library. Unmodified tool allows generating valid zstd frames with randomized contents and size. With our modifications it is possibile to generate only some parts of the zstd frame. In this case we modify decodecorpus so that it allows generating only frame headers.

args[3] = "-p" + std::string(output_path);

XLS_ASSIGN_OR_RETURN(auto result, CallDecodecorpus(args));
auto raw_data = ReadFileAsRawData(output_path);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason we need to use a real file, rather than handling in memory buffer returned by libstd?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was not possible in original decodecorpus and we wanted to provide the minimal changes required for generating only frame headers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be a "big change" to patch decodecorpus further so that it exposes its intermediate functions (IIRC they are currently static)?

We could remove the main from the XLS build and directly call them and consume their outputs (rather than dealing with processes and files).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this was discussed with you on one of the meetings. We settled on leaving decodecorpus as it is.


TEST_P(ZSTDFrameHeaderSeededTest, ParseMultipleFrameHeaders) {
auto seed = GetParam();
auto frame_header = zstd::GenerateFrameHeader(seed, false);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this redundant to what we're doing in GenerateRandomFrameHeader or does it have more coverage?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GenerateFrameHeader() calls decodecorpus in order to generate frame header which is valid in case of zstd frame header specification. The contents and length of those frame headers are randomized but the header is always valid - it should always be correctly parsed both by libzstd and by our decoder (frame header parser).

GenerateRandomFrameHeader() generates completely random vector of bytes of randomly picked size from range 0 to arbitrary max size of 32 bytes. With this generator we are able to test more 'negative' test cases, meaning:

  • Not enough data in buffer for finishing parsing
  • Corrupted header - e.g. reserved bit in Frame_Header_Descriptor is set
  • Handling of buffers larger than bare minimum for parsing frame header

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense, thanks for the explanation!

Would it makes sense to add more docstring to the tests and helpers so that it's more obvious to future reader?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added more comments

true, true);
}

std::vector<uint8_t> GenerateRandomFrameHeader() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if #476 would help doing this kind of things?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This kind of testing approach would be great if we didn't have decodecorpus which always generate valid zstd frames/frame headers. We could say that decodecorpus has embedded in itself the properties that describe valid frame header.
Without it we would have to manually specify those properties in form of multiple RC_ASSERT() statements in rapidcheck test cases in order to generate valid test data. I think that when it comes to 'positive' tests (comparing against valid frame headers) it is better to use decodecorpus.
However, I'm not sure how would it look like for 'negative' tests (generating potentially invalid frame header data and comparing parsing through libzstd against simulation of our decoder).
Frame Header parsing should fail basically only in 2 situations:

  • there was not enough data to finish the parsing
  • reserved bit in frame header descriptor was set

Currently we generate random test vectors in order to check if decoder fails when those situations occur.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe something like https://github.com/google/fuzztest could be useful; I'd be curious to know what @ericastor as a migrated a lot of "random generation tests" to fuzztest in 1cfc4cc.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that using GoogleFuzzTest here would probably result in better coverage - though it might end up discovering some valid headers!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion, I've replaced generating random frame headers with FuzzTest. It is true that it can discover valid headers. We can work on constraining the tested domain further to precisely test only invalid frame headers but I believe it is better to do a general check with valid and invalid headers.

@lpawelcz lpawelcz force-pushed the 49967-frame-header branch 2 times, most recently from 600d73f to 95685e8 Compare November 27, 2023 06:51
@lpawelcz lpawelcz force-pushed the 49967-frame-header branch 4 times, most recently from 77a2ce2 to c114b98 Compare December 19, 2023 11:36
@lpawelcz
Copy link
Contributor

Fixed errors in fuzz tests caused by differences in priority of reporting errors between libzstd and our implementation. libzstd prioritizes reporting error about not enough data available in the buffer for parsing the whole frame header over error caused by reserved bit in frame header descriptor being set.
I've changed our implementation to act the same way as the library. This required changing the contents of the Buffer when returning with CORRUPTED error. We treat this error as critical one which requires resetting the whole decoder because after occurring corrupted frame header it is impossible to recover from this condition. Because of that we can safely return empty buffer in such case.

@lpawelcz
Copy link
Contributor

lpawelcz commented Jan 8, 2024

I've noticed a failure in one of the CI runs caused by frame_header_cc_test and specifically by the FuzzTest case:
https://github.com/google/xls/actions/runs/7410803786/job/20163885832?pr=1214

Here is the failure log:

2024-01-04T14:55:33.3138622Z [----------] 1 test from FrameHeaderFuzzTest
2024-01-04T14:55:33.3139507Z [ RUN      ] FrameHeaderFuzzTest.ParseMultipleRandomFrameHeaders
2024-01-04T14:55:33.3140674Z FUZZTEST_PRNG_SEED=NuIZziOUrkbi_BmDtWnqoy7SbNkrX-lwmJlXdfEMCyQ
2024-01-04T14:55:33.3141491Z xls/ir/ir_test_base.cc:264: Failure
2024-01-04T14:55:33.3142131Z Value of: ValuesEqual(expected, result.value)
2024-01-04T14:55:33.3143777Z   Actual: false ((bits[2]:1, (bits[64]:0, bits[64]:0, bits[32]:0, bits[1]:0), (bits[128]:0x0, bits[32]:0)) != (bits[2]:0, (bits[64]:94489280512, bits[64]:18446744073709551615, bits[32]:0, bits[1]:0), (bits[128]:0x0, bits[32]:0)))
2024-01-04T14:55:33.3145422Z Expected: true
2024-01-04T14:55:33.3145803Z (interpreted unoptimized IR)
2024-01-04T14:55:33.3146240Z Google Test trace:
2024-01-04T14:55:33.3146874Z xls/modules/zstd/frame_header_test.cc:137: RunAndExpectEq failed
2024-01-04T14:55:33.3147656Z Stack trace:
2024-01-04T14:55:33.3148169Z   0x7f34a8ec8185: xls::IrTestBase::RunAndExpectEq()
2024-01-04T14:55:33.3148941Z   0x7f34a8eca1db: xls::IrTestBase::RunAndExpectEq()
2024-01-04T14:55:33.3149994Z   0x563f0f26374d: xls::(anonymous namespace)::FrameHeaderTest::ParseAndCompareWithZstd()
2024-01-04T14:55:33.3151137Z   0x563f0f276f37: fuzztest::internal::FixtureDriver<>::Test()
2024-01-04T14:55:33.3152162Z   0x7f349fbaf317: fuzztest::internal::FuzzTestFuzzerImpl::RunOneInput()
2024-01-04T14:55:33.3153286Z   0x7f349fbb19de: fuzztest::internal::FuzzTestFuzzerImpl::RunInUnitTestMode()
2024-01-04T14:55:33.3154354Z   0x7f34ac0abbb3: fuzztest::internal::GTest_TestAdaptor::TestBody()
2024-01-04T14:55:33.3155403Z   0x7f349f443ffd: testing::internal::HandleExceptionsInMethodIfSupported<>()
2024-01-04T14:55:33.3156707Z   0x7f349f443ebe: testing::Test::Run()
2024-01-04T14:55:33.3157594Z   0x7f349f445171: testing::TestInfo::Run()
2024-01-04T14:55:33.3158212Z ... Google Test internal frames ...
2024-01-04T14:55:33.3158607Z 
2024-01-04T14:55:33.3158614Z 
2024-01-04T14:55:33.3158848Z =================================================================
2024-01-04T14:55:33.3159429Z === BUG FOUND!
2024-01-04T14:55:33.3159679Z 
2024-01-04T14:55:33.3160606Z xls/modules/zstd/frame_header_test.cc:298: Counterexample found for FrameHeaderFuzzTest.ParseMultipleRandomFrameHeaders.
2024-01-04T14:55:33.3161924Z The test fails with input:
2024-01-04T14:55:33.3162413Z argument 0: {16, 211}
2024-01-04T14:55:33.3162996Z 
2024-01-04T14:55:33.3163238Z =================================================================
2024-01-04T14:55:33.3164180Z [  FAILED  ] FrameHeaderFuzzTest.ParseMultipleRandomFrameHeaders (837 ms)
2024-01-04T14:55:33.3165326Z [----------] 1 test from FrameHeaderFuzzTest (837 ms total)
2024-01-04T14:55:33.3165859Z 
2024-01-04T14:55:33.3166176Z [----------] Global test environment tear-down
2024-01-04T14:55:33.3166902Z [==========] 2 tests from 2 test suites ran. (1568 ms total)
2024-01-04T14:55:33.3167559Z [  PASSED  ] 1 test.
2024-01-04T14:55:33.3168029Z [  FAILED  ] 1 test, listed below:
2024-01-04T14:55:33.3168813Z [  FAILED  ] FrameHeaderFuzzTest.ParseMultipleRandomFrameHeaders
2024-01-04T14:55:33.3169478Z 
2024-01-04T14:55:33.3169632Z  1 FAILED TEST

I've already investigated the issue and managed to fix it. It was caused by not discarding ZSTD frames with window_size being larger than specified limit. PR will soon be updated with the fix for that

EDIT: Updated the PR with fixes

@lpawelcz lpawelcz force-pushed the 49967-frame-header branch 2 times, most recently from fb2294d to 070089f Compare January 15, 2024 14:51
rw1nkler and others added 10 commits February 21, 2024 14:57
This commit adds a DSLX Buffer library that provides the Buffer struct,
and helper functions that can be used to operate on it. The Buffer
is meant to be a storage for data coming from the channel. It acts like
a FIFO, allowing data of any length to be put in or popped out of it.
Provided DSLX tests verify the correct behaviour of the library.

Internal-tag: [#50221]
Signed-off-by: Robert Winkler <[email protected]>
This commit adds a simple test that shows, how one can use the Buffer
struct inside a Proc.

Internal-tag: [#50221]
Signed-off-by: Robert Winkler <[email protected]>
This commit adds the library with functions for parsing a magic number and
tests that verify its correctness.

Internal-tag: [#50221]
Signed-off-by: Robert Winkler <[email protected]>
This commit adds the library with functions for parsing a frame header.
The provided tests verify the correcness of the library.

Internal-tag: [#49967]
Co-authored-by: Roman Dobrodii <[email protected]>
Co-authored-by: Pawel Czarnecki <[email protected]>
Signed-off-by: Robert Winkler <[email protected]>
Signed-off-by: Pawel Czarnecki <[email protected]>
Internal-tag: [#53329]
Signed-off-by: Pawel Czarnecki <[email protected]>
Required for expected_status inference in C++ tests for ZSTD decoder
components

Internal-tag: [#53465]
Signed-off-by: Pawel Czarnecki <[email protected]>
Internal-tag: [#50967]
Signed-off-by: Robert Winkler <[email protected]>
This commit adds a binary that calls decoding to generate data and loads
it into a vector of bytes.

Internal-tag: [#50967]
Signed-off-by: Robert Winkler <[email protected]>
Internal-tag: [#50967]
Co-authored-by: Pawel Czarnecki <[email protected]>
Signed-off-by: Robert Winkler <[email protected]>
Signed-off-by: Pawel Czarnecki <[email protected]>
lpawelcz added a commit to antmicro/xls that referenced this pull request Feb 21, 2024
google#1168

modules/zstd: Add library for parsing magic number

This commit adds the library with functions for parsing a magic number and
tests that verify its correctness.

Internal-tag: [#50221]
Signed-off-by: Robert Winkler <[email protected]>

modules/zstd: Add library for parsing frame header

This commit adds the library with functions for parsing a frame header.
The provided tests verify the correcness of the library.

Internal-tag: [#49967]
Co-authored-by: Roman Dobrodii <[email protected]>
Co-authored-by: Pawel Czarnecki <[email protected]>
Signed-off-by: Robert Winkler <[email protected]>
Signed-off-by: Pawel Czarnecki <[email protected]>

modules/zstd/frame_header: Add benchmarking rules

Internal-tag: [#53329]
Signed-off-by: Pawel Czarnecki <[email protected]>

dependency_support/libzstd: Make zstd_errors.h public

Required for expected_status inference in C++ tests for ZSTD decoder
components

Internal-tag: [#53465]
Signed-off-by: Pawel Czarnecki <[email protected]>

dependency_support: Add decodecorpus binary

Internal-tag: [#50967]
Signed-off-by: Robert Winkler <[email protected]>

modules/zstd: Add data generator library

This commit adds a binary that calls decoding to generate data and loads
it into a vector of bytes.

Internal-tag: [#50967]
Signed-off-by: Robert Winkler <[email protected]>

modules/zstd: Add zstd frame header tests

Internal-tag: [#50967]
Co-authored-by: Pawel Czarnecki <[email protected]>
Signed-off-by: Robert Winkler <[email protected]>
Signed-off-by: Pawel Czarnecki <[email protected]>
lpawelcz added a commit to antmicro/xls that referenced this pull request Mar 7, 2024
google#1168

modules/zstd: Add library for parsing magic number

This commit adds the library with functions for parsing a magic number and
tests that verify its correctness.

Internal-tag: [#50221]
Signed-off-by: Robert Winkler <[email protected]>

modules/zstd: Add library for parsing frame header

This commit adds the library with functions for parsing a frame header.
The provided tests verify the correcness of the library.

Internal-tag: [#49967]
Co-authored-by: Roman Dobrodii <[email protected]>
Co-authored-by: Pawel Czarnecki <[email protected]>
Signed-off-by: Robert Winkler <[email protected]>
Signed-off-by: Pawel Czarnecki <[email protected]>

modules/zstd/frame_header: Add benchmarking rules

Internal-tag: [#53329]
Signed-off-by: Pawel Czarnecki <[email protected]>

dependency_support/libzstd: Make zstd_errors.h public

Required for expected_status inference in C++ tests for ZSTD decoder
components

Internal-tag: [#53465]
Signed-off-by: Pawel Czarnecki <[email protected]>

dependency_support: Add decodecorpus binary

Internal-tag: [#50967]
Signed-off-by: Robert Winkler <[email protected]>

modules/zstd: Add data generator library

This commit adds a binary that calls decoding to generate data and loads
it into a vector of bytes.

Internal-tag: [#50967]
Signed-off-by: Robert Winkler <[email protected]>

modules/zstd: Add zstd frame header tests

Internal-tag: [#50967]
Co-authored-by: Pawel Czarnecki <[email protected]>
Signed-off-by: Robert Winkler <[email protected]>
Signed-off-by: Pawel Czarnecki <[email protected]>
@cdleary cdleary added the app Application level functionality (examples, uses of XLS stack) label Mar 27, 2024
@proppy
Copy link
Member

proppy commented Mar 29, 2024

should we close this and focus on reviewing #1315 ?

@lpawelcz
Copy link
Contributor

The review will take place in #1315

@proppy this PR can be closed

@proppy proppy closed this Mar 29, 2024
@tmichalak tmichalak deleted the 49967-frame-header branch April 29, 2025 08:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
app Application level functionality (examples, uses of XLS stack)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants