Optimize #add_referenced by pkmiec · Pull Request #238 · boazsegev/combine_pdf

pkmiec · 2024-02-02T17:08:20Z

Summary

The #add_referenced kept track of existing object with a hash when the keys were the objects. This seemed to have been ok with Ruby 2.7, but became significantly slower in Ruby 3.2 (and possibly earlier).

Profiling showed that many of those objects are instances of Hash and Ruby uses #eql? method to compare Hash keys. It is not clear what actually changed in Ruby, but we can work around the issue by using #hash so that the key is an integer. We just need to recompute that integer if the object changes.

Performance

The benchmark was done with the following script,

require 'benchmark'
require 'combine_pdf'

puts "Ruby: #{RUBY_VERSION}"
puts "CombinePDF: #{CombinePDF::VERSION}"

files = []
68.times { |i| files << "/tmp/pdfs/#{i}.pdf" }
files = files * 10 # to exacerbate the effect

result_pdf = CombinePDF.new
files.each { |file| result_pdf << CombinePDF.parse(IO.read(file)) }
puts(Benchmark.measure do
    result_pdf.save('/tmp/combined.pdf')
end)

FYI ... we end up with 32053 pdf objects in @objects array.

Before

Ruby: 2.7.7
CombinePDF: 1.0.26
2.598427 0.011980 2.610407 ( 2.617881)

Ruby: 3.2.2
CombinePDF: 1.0.26
15.067833 0.026986 15.094819 ( 15.139298)

After

Ruby: 2.7.7
CombinePDF: 1.0.26 (with this PR)
2.768545 0.006937 2.775482 ( 2.786386)

Ruby: 3.2.2
CombinePDF: 1.0.26 (with this PR)
1.997242 0.016295 2.013537 ( 2.021782)

The #add_referenced kept track of existing object with a hash when the keys were the objects. This seemed to have been ok with Ruby 2.7, but became significantly slower in Ruby 3.2. Profiling showed that many of those objects are instances of Hash and Ruby uses #eql? method to compare Hash keys. It is not clear what acutally changed in Ruby, but we can work around the issue by using #hash so that the key is an integer. We just need to recompute that integer if the object changes. Co-authored-by: Jeremy Kirchhoff <Jeremy.Kirchhoff@appfolio.com>

BenMorganMY · 2024-07-11T18:16:09Z

@pkmiec our use of combine_pdf still experiences the slowdown on ruby 3.1. I'm wondering if you could explain the use .hash so that I can find other areas for improvement.

boazsegev · 2024-07-13T22:44:54Z

Using .hash instead of .eql? adds a new risk of Hash collisions – rare but definitely possible when squashing variable multi-byte objects into 64 bits (or, I believe that in Ruby, it would actually be limited to 62 bits).

This risk could silently corrupt PDF data, which could be a significant error and make CombinePDF unsuitable for some applications.

Is there another viable approach? Or perhaps it would be better to drop duplication detection instead? How would that affect performance (memory usage will be higher, but other than that...)?

pkmiec · 2024-07-15T20:56:43Z

Hi!

@BenMorganMY Have you already profiled your code and identified that it is still slow in this #add_reference method? I can imagine that given a different set of PDFs you may be hitting a slowness in a different part of the code. But to answer your question, the point of the method is to try to de-dup references to objects. I imagine that #eql needs to "walk" the object to compute whether it is the same as another object. Since the objects are Hashes then this walking becomes a recursive operation and thus more expensive. Perhaps the result of this "walking" was cached in earlier versions of Ruby and now it is not. Note, I could not find anything in Ruby's changelog, so I do not know for sure. Using #hash was a way to force that computation to produce a number and then compare numbers instead of Hashes.

@boazsegev That's a good question. I do not know the PDF spec well enough to say. When I was looking at it, I wanted to avoid changing the behavior of the method in order to avoid introducing some incompatibility with subsequent code or PDF readers.

amomchilov · 2024-11-02T19:56:44Z

I found a related perf improvement in this method, see #241

boazsegev · 2024-11-10T03:03:53Z

I assume this was fixed in #241 and should be closed?

amomchilov · 2024-11-10T14:32:10Z

@boazsegev Not quite. I happened to touch the same line, but a different statement.

I changed the definition and usage of the resolved hash, as in resolved[obj.object_id] = obj, but this PR is changing the existing Hash as in existing[obj.hash] = obj.

As an outsider, I did find this several;statements;per;line style a bit hard to grok, and I think this suggests I'm not the only one 😅

boazsegev · 2024-11-10T16:20:33Z

@amomchilov ,

I love how specific and targeted PR #241 had been. However, I assumed it was meant to deal with the hash issue as well. My apologies for that.

However, to restate my previous comment – the existing[obj.hash] approach appears incomplete (unless I'm missing something).

SipHash, which is the underlying algorithm for obj.hash, isn't cryptographic and is more likely to produce hash collision than a cryptographic hash. Moreover, Ruby uses 62 bits from the 64 bit SipHash result, slightly increasing the risk.

Not that I believe that hash collisions are likely, but this could still happen and this means that obj.hash requires additional validation. We need to verify that a hash collision doesn't lead to an error (i.e., validating a correct result by manually checking for equality up to a certain depth).

@pkmiec ,

I think we need to rebase this PR if we are going to give it a chance. I would also like to see how we protect against possible hash collisions before I merge.

Thanks.

amomchilov · 2024-11-30T04:35:35Z

I love how specific and targeted PR #241 had been.

Glad to hear :)

However, to restate my previous comment – the existing[obj.hash] approach appears incomplete (unless I'm missing something).

I believe you're correct. Apart from a perfect hash table (which have restricted applicability, and Hash definitely is not), hash collisions are always possible.

Thus, testing for equality will always necessary to ensure the wrong value isn't returned via collision. I don't see any way around that.

I'd be happy to chip in some alternative ideas for speeding this up, but candidly, I had a very hard time following exactly what's going on in this file, and what exact objects are being used as keys.

denislavski · 2025-12-29T13:01:01Z

The fix from this PR is quite valuable when working with big PDF files. Is it still under consideration?
Method #add_referenced is slow even after #241.
Using object_id performs times better in large PDFs than using the object itself as a key and is better pick overall IMO if collisions are a concern.

amomchilov · 2026-01-01T20:09:09Z

@boazsegev Can you give some context on how this code path works?

The referenced variable might be able to be converted to use compare_by_identity, but that can only work if there's no distinct-but-equal objects to worry about. For example, if there was two Point.new(x: 1, y: 2)objects, they would have different identity and would _not_ be de-duplicated with acompare_by_identity` Set or Hash.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize #add_referenced#238

Optimize #add_referenced#238
pkmiec wants to merge 1 commit into
boazsegev:masterfrom
appfolio:fix-slowdown-in-add-referenced

pkmiec commented Feb 2, 2024

Uh oh!

BenMorganMY commented Jul 11, 2024

Uh oh!

boazsegev commented Jul 13, 2024 •

edited

Loading

Uh oh!

pkmiec commented Jul 15, 2024

Uh oh!

amomchilov commented Nov 2, 2024

Uh oh!

boazsegev commented Nov 10, 2024

Uh oh!

amomchilov commented Nov 10, 2024

Uh oh!

boazsegev commented Nov 10, 2024

Uh oh!

amomchilov commented Nov 30, 2024 •

edited

Loading

Uh oh!

denislavski commented Dec 29, 2025 •

edited

Loading

Uh oh!

amomchilov commented Jan 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

pkmiec commented Feb 2, 2024

Summary

Performance

Before

After

Uh oh!

BenMorganMY commented Jul 11, 2024

Uh oh!

boazsegev commented Jul 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pkmiec commented Jul 15, 2024

Uh oh!

amomchilov commented Nov 2, 2024

Uh oh!

boazsegev commented Nov 10, 2024

Uh oh!

amomchilov commented Nov 10, 2024

Uh oh!

boazsegev commented Nov 10, 2024

Uh oh!

amomchilov commented Nov 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

denislavski commented Dec 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amomchilov commented Jan 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

boazsegev commented Jul 13, 2024 •

edited

Loading

amomchilov commented Nov 30, 2024 •

edited

Loading

denislavski commented Dec 29, 2025 •

edited

Loading