Spicy units have significantly more overhead than binpac's records #1989
Replies: 3 comments
-
Interesting analysis. Couple of quick high-level thoughts:
|
Beta Was this translation helpful? Give feedback.
-
Meta-comment: Collecting observations like this may work better as a "Discussion" than a ticket; and then leave ticket for anything concretely actionable. |
Beta Was this translation helpful? Give feedback.
-
tl;dr: I'm actually pretty confident/excited about being able to hit some simple things with meaningful impact - with more to come So far I've found two pretty simple optimizations during parser generation (with bad impls on a branch):
From With both, I get ~5% speedup on my end to end spicy SSL analyzer test. My guess is this speedup is carried by point 2 - 1 only kicks in for large vectors with This seems to be a pretty tangible path forward. I think optimizing for this case (decreasing overhead in units) is giving real measurable speedups with relatively little work, so some of these microbenchmarks seem like decent proxies for real performance gains. I'll keep trucking along with them and after a few put together a PR and get the benchmarking stuff in. I'm also pretty confident that "optimizations" here will make control flow/alias analysis a bit easier. I think there are some that have more impact in Zeek than Spicy alone. For example the |
Beta Was this translation helpful? Give feedback.
-
This is another issue like #1986 - namely a relatively vague issue to address a relatively slow part of Spicy. This one seems less direct than regular expressions - Spicy units simply have more capabilities than binpac records. However, this does produce noticeable slowdowns for parsers that should otherwise be quite performant simply because they split up into units (sometimes for very valid reasons, like a vector of units).
This Spicy parser is relatively slow compared to its binpac comparison:
That is, a parser which has a new inner unit for every byte. Obviously an extreme example. This takes roughly 14 seconds to parse a 100MB input on my laptop. By comparison, the "equivalent" binpac parser takes 3.8 seconds:
This example seems pretty clearly a place where Spicy simply allows more functionality, so the overhead to allow that is higher. From what I can tell, this manifests in code that has a lot of features stripped when optimizing, but still keeps enough of the scaffolding to be noticeably slower than the binpac parser.
Here is the
__parse_stage1
function for theInner
unit:Inner::__parse_stage1 (Spicy)
For a minimal unit, this seems like a lot of overhead, and this doesn't even parse the byte yet. Most of the method arguments aren't needed and the overhead seems significant for some uses. For comparison, here's the entire parse method for
Inner
in binpac:Inner::Parse (binpac)
Unfortunately, this isn't really an easy case to solve with some magic "optimizer" that removes unnecessary code. Spicy just gives you access to a lot more features that generate code - stripping that code out can be harder. I think given a bunch of time we can decrease the code generated there with fine-tuned optimizations, but how often these simple units even pop up is questionable.
I wonder to what extent this is that units are too easy to make for how costly they are. More friction between a user and making a costly unit seems good so that they don't accidentally blow up their parser's performance. Units being qualified as
public
to keep a decent amount of overhead is good here - I wouldn't intuitively understand that apublic
unit is costly without seeing it in the documentation, though.Maybe there's room for parsing structs as a plain ol' data type with very few of the unit's capabilities, then they get inlined or something. Unfortunately there are other costs here (right now structs aren't parseable so parsing vectors would be different, then recursive types are at best a pain, at worst impossible).
I'll leave the rest for further investigation, though. Decreasing the generated code entirely through an optimizer seems quite tricky. A starting point could be to look at the Spicy C++ code for that minimal
Inner
unit and see what tearing some things down does - for example, there are sometimes multiple__location__
calls and debug information propagated. There are also a bunch of unused function parameters. And someself
references that create significant overhead. But, those may be necessary or hard to take out.Aside: Switches
I also noticed that switching the
Inner
unit for one with aswitch (True)
before parsing the byte causes a significant (2.4seconds=~25%) slowdown, despite being simple:I don't think this is worth optimizing for too much - people simply aren't going to do this pattern (switching over a constant like that). But, it does give some insight over how switching over a relatively simple field (like an enum value) may affect a parser's performance - it's relatively significant. I think that is more overhead causing more slowdown.
Beta Was this translation helpful? Give feedback.
All reactions