Optimize #add_referenced#238
Conversation
The #add_referenced kept track of existing object with a hash when the keys were the objects. This seemed to have been ok with Ruby 2.7, but became significantly slower in Ruby 3.2. Profiling showed that many of those objects are instances of Hash and Ruby uses #eql? method to compare Hash keys. It is not clear what acutally changed in Ruby, but we can work around the issue by using #hash so that the key is an integer. We just need to recompute that integer if the object changes. Co-authored-by: Jeremy Kirchhoff <Jeremy.Kirchhoff@appfolio.com>
|
@pkmiec our use of combine_pdf still experiences the slowdown on ruby 3.1. I'm wondering if you could explain the use |
|
Using This risk could silently corrupt PDF data, which could be a significant error and make CombinePDF unsuitable for some applications. Is there another viable approach? Or perhaps it would be better to drop duplication detection instead? How would that affect performance (memory usage will be higher, but other than that...)? |
|
Hi! @BenMorganMY Have you already profiled your code and identified that it is still slow in this @boazsegev That's a good question. I do not know the PDF spec well enough to say. When I was looking at it, I wanted to avoid changing the behavior of the method in order to avoid introducing some incompatibility with subsequent code or PDF readers. |
|
I found a related perf improvement in this method, see #241 |
|
I assume this was fixed in #241 and should be closed? |
|
@boazsegev Not quite. I happened to touch the same line, but a different statement. I changed the definition and usage of the As an outsider, I did find this several;statements;per;line style a bit hard to grok, and I think this suggests I'm not the only one 😅 |
|
I love how specific and targeted PR #241 had been. However, I assumed it was meant to deal with the However, to restate my previous comment – the SipHash, which is the underlying algorithm for Not that I believe that hash collisions are likely, but this could still happen and this means that @pkmiec , I think we need to rebase this PR if we are going to give it a chance. I would also like to see how we protect against possible hash collisions before I merge. Thanks. |
Glad to hear :)
I believe you're correct. Apart from a perfect hash table (which have restricted applicability, and Thus, testing for equality will always necessary to ensure the wrong value isn't returned via collision. I don't see any way around that. I'd be happy to chip in some alternative ideas for speeding this up, but candidly, I had a very hard time following exactly what's going on in this file, and what exact objects are being used as keys. |
|
The fix from this PR is quite valuable when working with big PDF files. Is it still under consideration? |
|
@boazsegev Can you give some context on how this code path works? The |
Summary
The #add_referenced kept track of existing object with a hash when the keys were the objects. This seemed to have been ok with Ruby 2.7, but became significantly slower in Ruby 3.2 (and possibly earlier).
Profiling showed that many of those objects are instances of Hash and Ruby uses #eql? method to compare Hash keys. It is not clear what actually changed in Ruby, but we can work around the issue by using #hash so that the key is an integer. We just need to recompute that integer if the object changes.
Performance
The benchmark was done with the following script,
FYI ... we end up with 32053 pdf objects in @objects array.
Before
Ruby: 2.7.7
CombinePDF: 1.0.26
2.598427 0.011980 2.610407 ( 2.617881)
Ruby: 3.2.2
CombinePDF: 1.0.26
15.067833 0.026986 15.094819 ( 15.139298)
After
Ruby: 2.7.7
CombinePDF: 1.0.26 (with this PR)
2.768545 0.006937 2.775482 ( 2.786386)
Ruby: 3.2.2
CombinePDF: 1.0.26 (with this PR)
1.997242 0.016295 2.013537 ( 2.021782)