tests: look for regressions when converting PDFs #1089

almet · 2025-03-05T15:01:27Z

I convert all the documents we have in our test suite and store them in a reference folder, and then compare this bit to bit, using pymupdf pixel buffers.

We could use another diffing tool and tell it what's acceptable (if that exists), but I believe that in the end what matters most is the developer experience when we have an output that changes.

Two things come to mind:

Inspect the differences
Updating the reference version(s)

Inspecting the diff

There are multiple tools that allow to do that, but I found that diff-pdf good and able to generate an output we can look at without having to run a GUI.

diff-pdf /tmp/pytest-of-alexis/pytest-current/sample-docx0.pdf ./tests/test_docs/reference/sample-docx.pdf -m --output-diff=diff.pdf

Produces this diff.pdf file for the changes between the 0.8.1 release and this commit.

Update the reference version

We should have a command to bump all the reference documents (or a specific one).

Status:

This PR currently only fails tests when there a change in the output. I plan to do the following:

Check that PDF outputs are the same (pixel comparison) in our tests
Collect all differences and publish them as an artifact so we can inspect them, probably as part of the CI.
Add a tool to update all the reference documents.

Fixes #321

This stores a reference version of the converted PDFs and diffs them when the newly converted document during the tests.

This is useful to reduce the computation time when creating PDF visual diffs. Here is a comparison of the same operation using python arrays and numpy arrays + lookups: Python arrays: ``` diff took 5.094218431997433 seconds diff took 3.1553626069980965 seconds diff took 3.3721952960004273 seconds diff took 3.2134646750018874 seconds diff took 3.3410625500000606 seconds diff took 3.2893160990024626 seconds ``` Numpy: ``` diff took 0.13705662599750212 seconds diff took 0.05698924000171246 seconds diff took 0.15319590600120137 seconds diff took 0.06126453700198908 seconds diff took 0.12916332699751365 seconds diff took 0.05839455900058965 seconds

Which makes it easier to inspect after CI run failures.

This leverages a new flag that can be passed during the tests to regenerate the PDFs if needed.

This is to see how a failing CI would look like.

tests: test for regressions when converting PDFs when running the tests

1966937

This stores a reference version of the converted PDFs and diffs them when the newly converted document during the tests.

almet changed the title ~~tests: test for regressions when converting PDFs when running the tests~~ tests: look for regressions when converting PDFs when running the tests Mar 6, 2025

almet changed the title ~~tests: look for regressions when converting PDFs when running the tests~~ tests: look for regressions when converting PDFs Mar 6, 2025

almet added 7 commits March 10, 2025 15:42

Fixup: use numpy to do the comparison

8bde47e

Publish the resulted diffs as github artifacts

4893a4c

Which makes it easier to inspect after CI run failures.

Add a makefile target to regenerate reference PDFs

04263a2

This leverages a new flag that can be passed during the tests to regenerate the PDFs if needed.

fixup! Fixup: use numpy to do the comparison

856f988

Use reference documents from the 0.8.1 release

9a4c4bf

This is to see how a failing CI would look like.

Update poetry.lock

925645d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tests: look for regressions when converting PDFs #1089

tests: look for regressions when converting PDFs #1089

almet commented Mar 5, 2025 •

edited

Loading

tests: look for regressions when converting PDFs #1089

Are you sure you want to change the base?

tests: look for regressions when converting PDFs #1089

Conversation

almet commented Mar 5, 2025 • edited Loading

Inspecting the diff

Update the reference version

almet commented Mar 5, 2025 •

edited

Loading