Skip to content

Optimize #add_referenced#238

Open
pkmiec wants to merge 1 commit into
boazsegev:masterfrom
appfolio:fix-slowdown-in-add-referenced
Open

Optimize #add_referenced#238
pkmiec wants to merge 1 commit into
boazsegev:masterfrom
appfolio:fix-slowdown-in-add-referenced

Conversation

@pkmiec
Copy link
Copy Markdown

@pkmiec pkmiec commented Feb 2, 2024

Summary

The #add_referenced kept track of existing object with a hash when the keys were the objects. This seemed to have been ok with Ruby 2.7, but became significantly slower in Ruby 3.2 (and possibly earlier).

Profiling showed that many of those objects are instances of Hash and Ruby uses #eql? method to compare Hash keys. It is not clear what actually changed in Ruby, but we can work around the issue by using #hash so that the key is an integer. We just need to recompute that integer if the object changes.

Performance

The benchmark was done with the following script,

require 'benchmark'
require 'combine_pdf'

puts "Ruby: #{RUBY_VERSION}"
puts "CombinePDF: #{CombinePDF::VERSION}"

files = []
68.times { |i| files << "/tmp/pdfs/#{i}.pdf" }
files = files * 10 # to exacerbate the effect

result_pdf = CombinePDF.new
files.each { |file| result_pdf << CombinePDF.parse(IO.read(file)) }
puts(Benchmark.measure do
    result_pdf.save('/tmp/combined.pdf')
end)

FYI ... we end up with 32053 pdf objects in @objects array.

Before

Ruby: 2.7.7
CombinePDF: 1.0.26
2.598427 0.011980 2.610407 ( 2.617881)

Ruby: 3.2.2
CombinePDF: 1.0.26
15.067833 0.026986 15.094819 ( 15.139298)

After

Ruby: 2.7.7
CombinePDF: 1.0.26 (with this PR)
2.768545 0.006937 2.775482 ( 2.786386)

Ruby: 3.2.2
CombinePDF: 1.0.26 (with this PR)
1.997242 0.016295 2.013537 ( 2.021782)

The #add_referenced kept track of existing object with a hash when the keys
were the objects. This seemed to have been ok with Ruby 2.7, but became
significantly slower in Ruby 3.2.

Profiling showed that many of those objects are instances of Hash and Ruby
uses #eql? method to compare Hash keys. It is not clear what acutally changed
in Ruby, but we can work around the issue by using #hash so that the key
is an integer. We just need to recompute that integer if the object changes.

Co-authored-by: Jeremy Kirchhoff <Jeremy.Kirchhoff@appfolio.com>
@BenMorganMY
Copy link
Copy Markdown

@pkmiec our use of combine_pdf still experiences the slowdown on ruby 3.1. I'm wondering if you could explain the use .hash so that I can find other areas for improvement.

@boazsegev
Copy link
Copy Markdown
Owner

boazsegev commented Jul 13, 2024

Using .hash instead of .eql? adds a new risk of Hash collisions – rare but definitely possible when squashing variable multi-byte objects into 64 bits (or, I believe that in Ruby, it would actually be limited to 62 bits).

This risk could silently corrupt PDF data, which could be a significant error and make CombinePDF unsuitable for some applications.

Is there another viable approach? Or perhaps it would be better to drop duplication detection instead? How would that affect performance (memory usage will be higher, but other than that...)?

@pkmiec
Copy link
Copy Markdown
Author

pkmiec commented Jul 15, 2024

Hi!

@BenMorganMY Have you already profiled your code and identified that it is still slow in this #add_reference method? I can imagine that given a different set of PDFs you may be hitting a slowness in a different part of the code. But to answer your question, the point of the method is to try to de-dup references to objects. I imagine that #eql needs to "walk" the object to compute whether it is the same as another object. Since the objects are Hashes then this walking becomes a recursive operation and thus more expensive. Perhaps the result of this "walking" was cached in earlier versions of Ruby and now it is not. Note, I could not find anything in Ruby's changelog, so I do not know for sure. Using #hash was a way to force that computation to produce a number and then compare numbers instead of Hashes.

@boazsegev That's a good question. I do not know the PDF spec well enough to say. When I was looking at it, I wanted to avoid changing the behavior of the method in order to avoid introducing some incompatibility with subsequent code or PDF readers.

@amomchilov
Copy link
Copy Markdown
Contributor

I found a related perf improvement in this method, see #241

@boazsegev
Copy link
Copy Markdown
Owner

I assume this was fixed in #241 and should be closed?

@amomchilov
Copy link
Copy Markdown
Contributor

@boazsegev Not quite. I happened to touch the same line, but a different statement.

I changed the definition and usage of the resolved hash, as in resolved[obj.object_id] = obj, but this PR is changing the existing Hash as in existing[obj.hash] = obj.

As an outsider, I did find this several;statements;per;line style a bit hard to grok, and I think this suggests I'm not the only one 😅

@boazsegev
Copy link
Copy Markdown
Owner

@amomchilov ,

I love how specific and targeted PR #241 had been. However, I assumed it was meant to deal with the hash issue as well. My apologies for that.

However, to restate my previous comment – the existing[obj.hash] approach appears incomplete (unless I'm missing something).

SipHash, which is the underlying algorithm for obj.hash, isn't cryptographic and is more likely to produce hash collision than a cryptographic hash. Moreover, Ruby uses 62 bits from the 64 bit SipHash result, slightly increasing the risk.

Not that I believe that hash collisions are likely, but this could still happen and this means that obj.hash requires additional validation. We need to verify that a hash collision doesn't lead to an error (i.e., validating a correct result by manually checking for equality up to a certain depth).

@pkmiec ,

I think we need to rebase this PR if we are going to give it a chance. I would also like to see how we protect against possible hash collisions before I merge.

Thanks.

@amomchilov
Copy link
Copy Markdown
Contributor

amomchilov commented Nov 30, 2024

I love how specific and targeted PR #241 had been.

Glad to hear :)

However, to restate my previous comment – the existing[obj.hash] approach appears incomplete (unless I'm missing something).

I believe you're correct. Apart from a perfect hash table (which have restricted applicability, and Hash definitely is not), hash collisions are always possible.

Thus, testing for equality will always necessary to ensure the wrong value isn't returned via collision. I don't see any way around that.

I'd be happy to chip in some alternative ideas for speeding this up, but candidly, I had a very hard time following exactly what's going on in this file, and what exact objects are being used as keys.

@denislavski
Copy link
Copy Markdown
Contributor

denislavski commented Dec 29, 2025

The fix from this PR is quite valuable when working with big PDF files. Is it still under consideration?
Method #add_referenced is slow even after #241.
Using object_id performs times better in large PDFs than using the object itself as a key and is better pick overall IMO if collisions are a concern.

@amomchilov
Copy link
Copy Markdown
Contributor

@boazsegev Can you give some context on how this code path works?

The referenced variable might be able to be converted to use compare_by_identity, but that can only work if there's no distinct-but-equal objects to worry about. For example, if there was two Point.new(x: 1, y: 2)objects, they would have different identity and would _not_ be de-duplicated with acompare_by_identity` Set or Hash.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants