Spicy units have significantly more overhead than binpac's records #1989

evantypanski · 2025-03-04T19:26:05Z

evantypanski
Mar 4, 2025
Maintainer

This is another issue like #1986 - namely a relatively vague issue to address a relatively slow part of Spicy. This one seems less direct than regular expressions - Spicy units simply have more capabilities than binpac records. However, this does produce noticeable slowdowns for parsers that should otherwise be quite performant simply because they split up into units (sometimes for very valid reasons, like a vector of units).

This Spicy parser is relatively slow compared to its binpac comparison:

public type WithUnit = unit {
    length: uint64;
    inner: Inner[] &size=self.length;
    end_: b"BB";
};

type Inner = unit {
    b: uint8;
};

That is, a parser which has a new inner unit for every byte. Obviously an extreme example. This takes roughly 14 seconds to parse a 100MB input on my laptop. By comparison, the "equivalent" binpac parser takes 3.8 seconds:

type WithRecordDat = record {
    length: uint64;
    data: Inner[] &length=length;
    rest: uint16;
} &byteorder = bigendian;

type Inner = record {
    b: uint8;
} &byteorder = bigendian;

This example seems pretty clearly a place where Spicy simply allows more functionality, so the overhead to allow that is higher. From what I can tell, this manifests in code that has a lot of features stripped when optimizing, but still keeps enough of the scaffolding to be noticeably slower than the binpac parser.

Here is the __parse_stage1 function for the Inner unit:

Inner::__parse_stage1 (Spicy)

auto __hlt_bench::Benchmark::Inner::__parse_stage1(::hilti::rt::ValueReference<::hilti::rt::Stream>& __data, const ::hilti::rt::stream::SafeConstIterator& __begin, ::hilti::rt::stream::View __cur, ::hilti::rt::Bool __trim, ::hilti::rt::integer::safe<int64_t> __lah, ::hilti::rt::stream::SafeConstIterator __lahe, std::optional<::hilti::rt::RecoverableFailure> __error) -> std::tuple<::hilti::rt::stream::View, ::hilti::rt::integer::safe<int64_t>, ::hilti::rt::stream::SafeConstIterator, std::optional<::hilti::rt::RecoverableFailure>> {
    auto __self = Inner::__self();
    ::hilti::rt::detail::checkStack();
    __location__("../custom.spicy:51:14-53:1");
    std::tuple<::hilti::rt::stream::View, ::hilti::rt::integer::safe<int64_t>, ::hilti::rt::stream::SafeConstIterator, std::optional<::hilti::rt::RecoverableFailure>> __result;

    {
        ::hilti::rt::debug::indent("spicy"sv);
        ::hilti::rt::stream::SafeConstIterator __begin = __cur.begin();
        ::hilti::rt::StrongReference<::hilti::rt::Stream> filtered = ::hilti::rt::StrongReference<::hilti::rt::Stream>();
        if ( ! (::hilti::rt::Bool(static_cast<bool>(filtered))) ) {
            __result = (*__self).__parse_Benchmark__Inner_2_stage2(__data, __begin, __cur, __trim, __lah, __lahe, __error);
        }
    }

    return __result;
}

For a minimal unit, this seems like a lot of overhead, and this doesn't even parse the byte yet. Most of the method arguments aren't needed and the overhead seems significant for some uses. For comparison, here's the entire parse method for Inner in binpac:

Inner::Parse (binpac)

int Inner::Parse(const_byteptr const t_begin_of_data, const_byteptr const t_end_of_data) {
    // Checking out-of-bound for "Inner"
    if ( t_begin_of_data + (1) > t_end_of_data || t_begin_of_data + (1) < t_begin_of_data ) {
        // Handle out-of-bound condition
        throw binpac::ExceptionOutOfBound("Inner",
        	(0) + (1), 
        	(t_end_of_data) - (t_begin_of_data));
    }
    // Parse "b"
    b_ = *((uint8 const*)(t_begin_of_data));

    // Evaluate 'let' and 'withinput' fields
    BINPAC_ASSERT(t_begin_of_data + (1) <= t_end_of_data);
    return 1;
}

Unfortunately, this isn't really an easy case to solve with some magic "optimizer" that removes unnecessary code. Spicy just gives you access to a lot more features that generate code - stripping that code out can be harder. I think given a bunch of time we can decrease the code generated there with fine-tuned optimizations, but how often these simple units even pop up is questionable.

I wonder to what extent this is that units are too easy to make for how costly they are. More friction between a user and making a costly unit seems good so that they don't accidentally blow up their parser's performance. Units being qualified as public to keep a decent amount of overhead is good here - I wouldn't intuitively understand that a public unit is costly without seeing it in the documentation, though.

Maybe there's room for parsing structs as a plain ol' data type with very few of the unit's capabilities, then they get inlined or something. Unfortunately there are other costs here (right now structs aren't parseable so parsing vectors would be different, then recursive types are at best a pain, at worst impossible).

I'll leave the rest for further investigation, though. Decreasing the generated code entirely through an optimizer seems quite tricky. A starting point could be to look at the Spicy C++ code for that minimal Inner unit and see what tearing some things down does - for example, there are sometimes multiple __location__ calls and debug information propagated. There are also a bunch of unused function parameters. And some self references that create significant overhead. But, those may be necessary or hard to take out.

Aside: Switches

I also noticed that switching the Inner unit for one with a switch (True) before parsing the byte causes a significant (2.4seconds=~25%) slowdown, despite being simple:

type InnerSwitch = unit {
    switch (True) {
        True -> b: uint8;
    };
};

I don't think this is worth optimizing for too much - people simply aren't going to do this pattern (switching over a constant like that). But, it does give some insight over how switching over a relatively simple field (like an enum value) may affect a parser's performance - it's relatively significant. I think that is more overhead causing more slowdown.

rsmmr · 2025-03-05T08:20:03Z

rsmmr
Mar 5, 2025
Maintainer

Interesting analysis. Couple of quick high-level thoughts:

I am curious to see if the upcoming optimizer will be able to address at least some of this. I agree that many things it won't be able to address "magically", but there are still some. For example, once we have the flow tracking in place, we should be able to remove arguments from functions that aren't being used at all.
I always have trouble prediction how much the C++ compiler can optimize code. It could be helpful to actually look at some the final, compiled binary code as well; for example, how much of that code in Inner::__parse_stage1 actually makes it into the final assembler function?
As you say, "given a bunch of time we can decrease the code generated there with fine-tuned optimizations". Just to ack this, that was part of the original vision: generate "worst case code" at first, but then identify common cases where we can do better and add custom, fine-tuned code generation for those. But you're right on the challenges with that as well: it's hard to actually identify common & expensive patterns that (1) have a clearly better solution without (2) narrowing down the constrains so much that it hardly applies anywhere anymore. Still, definitely open to this.

0 replies

rsmmr · 2025-03-05T08:21:06Z

rsmmr
Mar 5, 2025
Maintainer

Meta-comment: Collecting observations like this may work better as a "Discussion" than a ticket; and then leave ticket for anything concretely actionable.

0 replies

evantypanski · 2025-03-20T20:13:16Z

evantypanski
Mar 20, 2025
Maintainer Author

tl;dr: I'm actually pretty confident/excited about being able to hit some simple things with meaningful impact - with more to come

So far I've found two pretty simple optimizations during parser generation (with bad impls on a branch):

If a vector is given &size, reserve that amount before parsing
Don't store __result in a tuple intermediate, just return it directly

From main to 1, I get a 12% speedup on large vectors with &size. From 1 to 2, I get a 5-14% speedup on large vectors (5% using lookahead to stop, 14% using &size)

With both, I get ~5% speedup on my end to end spicy SSL analyzer test. My guess is this speedup is carried by point 2 - 1 only kicks in for large vectors with &size.

This seems to be a pretty tangible path forward. I think optimizing for this case (decreasing overhead in units) is giving real measurable speedups with relatively little work, so some of these microbenchmarks seem like decent proxies for real performance gains. I'll keep trucking along with them and after a few put together a PR and get the benchmarking stuff in. I'm also pretty confident that "optimizations" here will make control flow/alias analysis a bit easier.

I think there are some that have more impact in Zeek than Spicy alone. For example the __self call in every unit (which seems like it's there to make Spicy values and resolving easier vs. using this) seems to add overhead in Zeek but removing it doesn't help much in Spicy alone. Arne independently pointed me at #1600 which seems possibly related, I'd need to test more

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spicy units have significantly more overhead than binpac's records #1989

{{title}}

Replies: 3 comments

{{title}}

{{title}}

{{title}}

Select a reply

Spicy units have significantly more overhead than binpac's records #1989

evantypanski Mar 4, 2025 Maintainer

Aside: Switches

Replies: 3 comments

rsmmr Mar 5, 2025 Maintainer

rsmmr Mar 5, 2025 Maintainer

evantypanski Mar 20, 2025 Maintainer Author

evantypanski
Mar 4, 2025
Maintainer

rsmmr
Mar 5, 2025
Maintainer

rsmmr
Mar 5, 2025
Maintainer

evantypanski
Mar 20, 2025
Maintainer Author