Skip to content

feat: update PDF saving logic to ensure valid metadata handling and support for cross-reference streams#21

Merged
balazs-szucs merged 2 commits intogrimmory-tools:mainfrom
balazs-szucs:overheal-save
Mar 29, 2026
Merged

feat: update PDF saving logic to ensure valid metadata handling and support for cross-reference streams#21
balazs-szucs merged 2 commits intogrimmory-tools:mainfrom
balazs-szucs:overheal-save

Conversation

@balazs-szucs
Copy link
Copy Markdown
Member

@balazs-szucs balazs-szucs commented Mar 29, 2026

Summary by CodeRabbit

  • New Features

    • Better support for PDFs using cross-reference streams during save operations
    • Safer fallback/save path that always produces a complete serialized output
  • Bug Fixes

    • More reliable metadata merging and application when saving documents
    • Added corruption-safety validation to ensure saved PDFs can be reopened
    • Improved robustness for incremental update generation and catalog rewrites
  • Documentation

    • Clarified save flag guidance in the API reference
  • Tests

    • Updated tests to validate reopenability and metadata/XMP correctness, including cross-ref-stream cases

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 29, 2026

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 68c103e3-b774-40b4-ba46-e6b519aa364d

📥 Commits

Reviewing files that changed from the base of the PR and between 15facf9 and 4a24096.

📒 Files selected for processing (3)
  • src/main/java/org/grimmory/pdfium4j/PdfDocument.java
  • src/main/java/org/grimmory/pdfium4j/PdfSaver.java
  • src/test/java/org/grimmory/pdfium4j/PdfDocumentTest.java

📝 Walkthrough

Walkthrough

Version bumped to 0.10.0. Metadata fast-path removed: metadata is merged with native Info on save and all saves route through a unified saveToBytes pipeline. PdfSaver now does tail-focused incremental updates, xref-stream support, and validation-by-reopening to guard against corruption; tests updated for xref-stream scenarios.

Changes

Cohort / File(s) Summary
Build Configuration
build.gradle.kts
Project version updated from 0.9.00.10.0.
PDF document metadata & save orchestration
src/main/java/org/grimmory/pdfium4j/PdfDocument.java
Removed structurallyModified branch and metadata-only save; metadata(MetadataTag) may read native values via helper; added buildMergedMetadata() and unified save paths to always use saveToBytes() with merged metadata.
Incremental save, xref-stream, and validation logic
src/main/java/org/grimmory/pdfium4j/PdfSaver.java
Rewrote incremental/update pipeline: perform native save, append metadata via tail-focused parsing, detect/write xref-streams (or xref table+trailer), removed in-place XMP replacement, added corruption-safety validation (re-open & read metadata), and introduced structured logging.
Native binding docs
src/main/java/org/grimmory/pdfium4j/internal/EditBindings.java
Expanded JavaDoc for FPDF_SaveAsCopy flags and noted FPDF_INCREMENTAL is non-functional in OSS PDFium; no API/signature changes.
Tests (xref-stream cases & assertions)
src/test/java/org/grimmory/pdfium4j/PdfDocumentTest.java
Replaced file-size-based assertions with reopen/validate checks; added minimal cross-reference-stream PDF generator and tests that write/reopen metadata and XMP (including repeated writes) to verify persistence and parsing.

Sequence Diagram(s)

sequenceDiagram
    participant Doc as PdfDocument
    participant Saver as PdfSaver
    participant Native as PDFium Native
    participant Validator as Validation

    Doc->>Saver: saveToBytes(handle, pendingMetadata, pendingXmp)
    Saver->>Native: nativeSave(handle)  -- FPDF_SaveAsCopy (full save)
    Native-->>Saver: basePdfBytes

    Saver->>Saver: buildMergedMetadata(basePdfBytes, pendingMetadata)
    Saver->>Saver: detectXrefStream(basePdfBytes)
    alt xref-stream present
        Saver->>Saver: appendXrefStreamSection(basePdfBytes, mergedMetadata, xmp)
    else traditional xref
        Saver->>Saver: appendIncrementalUpdate(basePdfBytes, mergedMetadata, xmp)
    end

    Saver->>Validator: validateOrFallback(candidateBytes, expectedMetadata)
    Validator->>Native: FPDF_LoadMemDocument(candidateBytes)
    Native-->>Validator: loadResult / pageCount / metaRead
    alt validation OK
        Validator-->>Saver: validatedBytes
    else validation failed
        Validator-->>Saver: fallback to basePdfBytes (log warning)
    end

    Saver-->>Doc: finalPdfBytes
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~55 minutes

Possibly related PRs

Poem

🐰 Hop, I stitched the Info and tail,

XRef streams now leave a trail,
Merged metadata, safe and bright,
Reopen checks keep bytes just right,
0.10.0 — a tiny, clever sprout.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 65.71% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title follows the conventional commit format with 'feat:' prefix and clearly describes the main changes: PDF saving logic improvements, metadata handling, and cross-reference stream support.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
✨ Simplify code
  • Create PR with simplified code

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 PMD (7.22.0)
src/main/java/org/grimmory/pdfium4j/PdfSaver.java

A fatal error has been detected by the Java Runtime Environment:

SIGBUS (0x7) at pc=0x00007f75aa417ffa, pid=1, tid=57

JRE version: (17.0.18+8) (build )

Java VM: OpenJDK 64-Bit Server VM (17.0.18+8-Debian-1deb12u1, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)

Problematic frame:

C [libc.so.6+0x16dffa]

No core dump will be written. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again

An error report file with more information is saved as:

/hs_err_pid1.log

[error occurred during error reporting (), id 0xb, SIGSEGV (0xb) at pc=0x00007f75aa2d050f]


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/test/java/org/grimmory/pdfium4j/PdfDocumentTest.java (1)

1435-1435: ⚠️ Potential issue | 🟡 Minor

Remove unused variable originalSize.

The variable is declared but never used after the test assertions were updated. The pipeline also flagged this with a Checkstyle warning.

🧹 Proposed fix
   void xmpOnlySaveUsesIncrementalUpdate(`@TempDir` Path tempDir) throws IOException {
     Path testPdf = getTestPdf();
     if (testPdf == null) return;
 
-    long originalSize = Files.size(testPdf);
     Path pdf = tempDir.resolve("xmp-only.pdf");
     Files.copy(testPdf, pdf);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/test/java/org/grimmory/pdfium4j/PdfDocumentTest.java` at line 1435,
Remove the unused local variable by deleting the declaration "long originalSize
= Files.size(testPdf);" in PdfDocumentTest (the unused symbol is originalSize
and it uses Files.size(testPdf)); ensure no other code references originalSize
and run tests/style checks to confirm the Checkstyle warning is resolved.
src/main/java/org/grimmory/pdfium4j/PdfDocument.java (1)

54-54: ⚠️ Potential issue | 🟡 Minor

Remove unused field structurallyModified.

This field is set but never read after the removal of the conditional save fast path. The pipeline also flagged this with a Checkstyle warning.

🧹 Proposed fix
-  private volatile boolean structurallyModified = false;

Also remove the assignments at lines 689, 720, and 757 where structurallyModified = true is set.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/main/java/org/grimmory/pdfium4j/PdfDocument.java` at line 54, Remove the
unused boolean field structurallyModified from class PdfDocument and delete all
assignments to it (the statements setting structurallyModified = true); search
for the symbol structurallyModified and remove its declaration "private volatile
boolean structurallyModified = false;" and any standalone assignment statements
that set it so no dead field or unreachable writes remain (ensure no other code
references the symbol afterwards).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/main/java/org/grimmory/pdfium4j/PdfSaver.java`:
- Around line 574-583: The method findStartxrefInTail currently declares an
unused parameter tailOffset; remove the parameter from its signature and update
all call sites that pass tailOffset to call findStartxrefInTail(tail) instead
(e.g., the four locations that currently pass tailOffset). Ensure
imports/overloads are not needed and run tests/compile to catch any remaining
references to tailOffset in PdfSaver.

---

Outside diff comments:
In `@src/main/java/org/grimmory/pdfium4j/PdfDocument.java`:
- Line 54: Remove the unused boolean field structurallyModified from class
PdfDocument and delete all assignments to it (the statements setting
structurallyModified = true); search for the symbol structurallyModified and
remove its declaration "private volatile boolean structurallyModified = false;"
and any standalone assignment statements that set it so no dead field or
unreachable writes remain (ensure no other code references the symbol
afterwards).

In `@src/test/java/org/grimmory/pdfium4j/PdfDocumentTest.java`:
- Line 1435: Remove the unused local variable by deleting the declaration "long
originalSize = Files.size(testPdf);" in PdfDocumentTest (the unused symbol is
originalSize and it uses Files.size(testPdf)); ensure no other code references
originalSize and run tests/style checks to confirm the Checkstyle warning is
resolved.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 63d18598-b9ec-4ec6-91b5-7e8455c6b5bd

📥 Commits

Reviewing files that changed from the base of the PR and between 0ee1f5e and 15facf9.

📒 Files selected for processing (5)
  • build.gradle.kts
  • src/main/java/org/grimmory/pdfium4j/PdfDocument.java
  • src/main/java/org/grimmory/pdfium4j/PdfSaver.java
  • src/main/java/org/grimmory/pdfium4j/internal/EditBindings.java
  • src/test/java/org/grimmory/pdfium4j/PdfDocumentTest.java
📜 Review details
🧰 Additional context used
🪛 GitHub Actions: CI
src/main/java/org/grimmory/pdfium4j/PdfDocument.java

[warning] 54-54: CheckstyleMain: UnusedPrivateField. Avoid unused private fields such as 'structurallyModified'.

src/test/java/org/grimmory/pdfium4j/PdfDocumentTest.java

[warning] 1435-1435: CheckstyleTest: UnusedLocalVariable. Avoid unused local variables such as 'originalSize'.

src/main/java/org/grimmory/pdfium4j/PdfSaver.java

[warning] 574-574: CheckstyleMain: UnusedFormalParameter. Avoid unused method parameters such as 'tailOffset'.

🪛 PMD (7.22.0)
src/main/java/org/grimmory/pdfium4j/PdfSaver.java

[Medium] 574-574: Avoid unused method parameters such as 'tailOffset'. (UnusedFormalParameter)

(Best Practices)

🔇 Additional comments (12)
src/main/java/org/grimmory/pdfium4j/internal/EditBindings.java (1)

76-83: Documentation is accurate—no changes needed.

The JavaDoc improvements are verified as correct:

  • Link validity: The Google Groups link (https://groups.google.com/g/pdfium/c/6SklEc2lYNM) is accessible and properly documents the FPDF_INCREMENTAL issue.
  • Flag values: All documented values match the PDFium fpdf_save.h specification (0 = full, 1 = incremental, 2 = no incremental, 4 = remove security).
  • Codebase usage: No code in the repository is using the broken flag 1; the implementation correctly uses flag 0 as warned.

The warning about FPDF_INCREMENTAL is necessary and well-documented, preventing users from introducing corrupt output. The documentation is consistent with explanations in PdfSaver.java.

build.gradle.kts (1)

16-16: LGTM!

Version bump to 0.10.0 appropriately reflects the scope of changes in this PR (refactored save pipeline, cross-reference stream support).

src/main/java/org/grimmory/pdfium4j/PdfDocument.java (3)

869-890: LGTM!

The readNativeMetadata helper correctly implements the double-call pattern for FPDF_GetMetaText and properly handles the UTF-16LE null terminator case.


1273-1293: LGTM!

The merge logic correctly:

  1. Preserves existing metadata entries not being modified
  2. Allows pending values to override existing entries
  3. Supports clearing tags by setting empty string values

1248-1265: LGTM!

The unified save path through saveToBytes() simplifies the implementation and ensures consistent metadata handling across all save methods.

src/test/java/org/grimmory/pdfium4j/PdfDocumentTest.java (2)

1461-1515: Well-structured xref stream PDF generator.

The implementation correctly follows PDF spec §7.5.8 for cross-reference streams with proper W=[1,4,0] encoding and Index specification. This provides valuable test coverage for the new xref stream handling in PdfSaver.


1522-1594: LGTM!

Comprehensive test coverage for cross-reference stream PDFs covering:

  • Basic metadata persistence
  • Combined XMP + Info dictionary writes
  • Double write scenarios (incremental updates on incrementally updated PDFs)
src/main/java/org/grimmory/pdfium4j/PdfSaver.java (5)

131-181: LGTM!

Robust validation strategy that guarantees no corrupt output by re-opening the produced bytes with PDFium. The graceful fallback to base bytes on validation failure is a sound approach for production reliability.


470-516: LGTM!

Tail-focused parsing is a sound approach that:

  1. Is immune to false matches in binary stream data
  2. Handles both traditional trailers and xref streams
  3. Uses a generous 4096-byte window (spec requires startxref within last 1024)

419-462: LGTM!

Correct implementation of cross-reference streams per PDF spec §7.5.8:

  • Valid W=[1,4,0] encoding scheme
  • Proper /Index specification for non-contiguous entries
  • Byte-transparent ISO-8859-1 encoding for binary stream data

212-326: LGTM!

The incremental update logic correctly:

  1. Detects PDF format (xref table vs xref stream) and generates appropriate output
  2. Assigns object numbers sequentially from trailer /Size
  3. Automatically updates ModDate timestamp
  4. Chains a Catalog update when XMP metadata is added

331-410: Test coverage for chained incremental updates is comprehensive.

The test suite already covers simultaneous Info dictionary and XMP metadata updates across multiple scenarios: single saves with both Info and XMP (testInfoDictAndXmpSimultaneous), chained updates (testMultipleSavesChainingBothInfoAndXmp, testXrefStreamChainedUpdatesWithBothMetadataAndXmp), and same-file overwrites (testLargeContentPreservationWithMetadata). The appendCatalogUpdate method correctly re-parses the tail to obtain the current /Prev pointer after the initial incremental update, ensuring proper chaining. The concern about this subtle behavior is valid, and verification confirms it is adequately tested.

@balazs-szucs balazs-szucs merged commit 846688f into grimmory-tools:main Mar 29, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant