feat: update PDF saving logic to ensure valid metadata handling and support for cross-reference streams#21
Conversation
…upport for cross-reference streams
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
📝 WalkthroughWalkthroughVersion bumped to 0.10.0. Metadata fast-path removed: metadata is merged with native Info on save and all saves route through a unified saveToBytes pipeline. PdfSaver now does tail-focused incremental updates, xref-stream support, and validation-by-reopening to guard against corruption; tests updated for xref-stream scenarios. Changes
Sequence Diagram(s)sequenceDiagram
participant Doc as PdfDocument
participant Saver as PdfSaver
participant Native as PDFium Native
participant Validator as Validation
Doc->>Saver: saveToBytes(handle, pendingMetadata, pendingXmp)
Saver->>Native: nativeSave(handle) -- FPDF_SaveAsCopy (full save)
Native-->>Saver: basePdfBytes
Saver->>Saver: buildMergedMetadata(basePdfBytes, pendingMetadata)
Saver->>Saver: detectXrefStream(basePdfBytes)
alt xref-stream present
Saver->>Saver: appendXrefStreamSection(basePdfBytes, mergedMetadata, xmp)
else traditional xref
Saver->>Saver: appendIncrementalUpdate(basePdfBytes, mergedMetadata, xmp)
end
Saver->>Validator: validateOrFallback(candidateBytes, expectedMetadata)
Validator->>Native: FPDF_LoadMemDocument(candidateBytes)
Native-->>Validator: loadResult / pageCount / metaRead
alt validation OK
Validator-->>Saver: validatedBytes
else validation failed
Validator-->>Saver: fallback to basePdfBytes (log warning)
end
Saver-->>Doc: finalPdfBytes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~55 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
✨ Simplify code
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 PMD (7.22.0)src/main/java/org/grimmory/pdfium4j/PdfSaver.javaA fatal error has been detected by the Java Runtime Environment:SIGBUS (0x7) at pc=0x00007f75aa417ffa, pid=1, tid=57JRE version: (17.0.18+8) (build )Java VM: OpenJDK 64-Bit Server VM (17.0.18+8-Debian-1deb12u1, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)Problematic frame:C [libc.so.6+0x16dffa]No core dump will be written. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java againAn error report file with more information is saved as:/hs_err_pid1.log[error occurred during error reporting (), id 0xb, SIGSEGV (0xb) at pc=0x00007f75aa2d050f] Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/test/java/org/grimmory/pdfium4j/PdfDocumentTest.java (1)
1435-1435:⚠️ Potential issue | 🟡 MinorRemove unused variable
originalSize.The variable is declared but never used after the test assertions were updated. The pipeline also flagged this with a Checkstyle warning.
🧹 Proposed fix
void xmpOnlySaveUsesIncrementalUpdate(`@TempDir` Path tempDir) throws IOException { Path testPdf = getTestPdf(); if (testPdf == null) return; - long originalSize = Files.size(testPdf); Path pdf = tempDir.resolve("xmp-only.pdf"); Files.copy(testPdf, pdf);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/test/java/org/grimmory/pdfium4j/PdfDocumentTest.java` at line 1435, Remove the unused local variable by deleting the declaration "long originalSize = Files.size(testPdf);" in PdfDocumentTest (the unused symbol is originalSize and it uses Files.size(testPdf)); ensure no other code references originalSize and run tests/style checks to confirm the Checkstyle warning is resolved.src/main/java/org/grimmory/pdfium4j/PdfDocument.java (1)
54-54:⚠️ Potential issue | 🟡 MinorRemove unused field
structurallyModified.This field is set but never read after the removal of the conditional save fast path. The pipeline also flagged this with a Checkstyle warning.
🧹 Proposed fix
- private volatile boolean structurallyModified = false;Also remove the assignments at lines 689, 720, and 757 where
structurallyModified = trueis set.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/main/java/org/grimmory/pdfium4j/PdfDocument.java` at line 54, Remove the unused boolean field structurallyModified from class PdfDocument and delete all assignments to it (the statements setting structurallyModified = true); search for the symbol structurallyModified and remove its declaration "private volatile boolean structurallyModified = false;" and any standalone assignment statements that set it so no dead field or unreachable writes remain (ensure no other code references the symbol afterwards).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/main/java/org/grimmory/pdfium4j/PdfSaver.java`:
- Around line 574-583: The method findStartxrefInTail currently declares an
unused parameter tailOffset; remove the parameter from its signature and update
all call sites that pass tailOffset to call findStartxrefInTail(tail) instead
(e.g., the four locations that currently pass tailOffset). Ensure
imports/overloads are not needed and run tests/compile to catch any remaining
references to tailOffset in PdfSaver.
---
Outside diff comments:
In `@src/main/java/org/grimmory/pdfium4j/PdfDocument.java`:
- Line 54: Remove the unused boolean field structurallyModified from class
PdfDocument and delete all assignments to it (the statements setting
structurallyModified = true); search for the symbol structurallyModified and
remove its declaration "private volatile boolean structurallyModified = false;"
and any standalone assignment statements that set it so no dead field or
unreachable writes remain (ensure no other code references the symbol
afterwards).
In `@src/test/java/org/grimmory/pdfium4j/PdfDocumentTest.java`:
- Line 1435: Remove the unused local variable by deleting the declaration "long
originalSize = Files.size(testPdf);" in PdfDocumentTest (the unused symbol is
originalSize and it uses Files.size(testPdf)); ensure no other code references
originalSize and run tests/style checks to confirm the Checkstyle warning is
resolved.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 63d18598-b9ec-4ec6-91b5-7e8455c6b5bd
📒 Files selected for processing (5)
build.gradle.ktssrc/main/java/org/grimmory/pdfium4j/PdfDocument.javasrc/main/java/org/grimmory/pdfium4j/PdfSaver.javasrc/main/java/org/grimmory/pdfium4j/internal/EditBindings.javasrc/test/java/org/grimmory/pdfium4j/PdfDocumentTest.java
📜 Review details
🧰 Additional context used
🪛 GitHub Actions: CI
src/main/java/org/grimmory/pdfium4j/PdfDocument.java
[warning] 54-54: CheckstyleMain: UnusedPrivateField. Avoid unused private fields such as 'structurallyModified'.
src/test/java/org/grimmory/pdfium4j/PdfDocumentTest.java
[warning] 1435-1435: CheckstyleTest: UnusedLocalVariable. Avoid unused local variables such as 'originalSize'.
src/main/java/org/grimmory/pdfium4j/PdfSaver.java
[warning] 574-574: CheckstyleMain: UnusedFormalParameter. Avoid unused method parameters such as 'tailOffset'.
🪛 PMD (7.22.0)
src/main/java/org/grimmory/pdfium4j/PdfSaver.java
[Medium] 574-574: Avoid unused method parameters such as 'tailOffset'. (UnusedFormalParameter)
(Best Practices)
🔇 Additional comments (12)
src/main/java/org/grimmory/pdfium4j/internal/EditBindings.java (1)
76-83: Documentation is accurate—no changes needed.The JavaDoc improvements are verified as correct:
- Link validity: The Google Groups link (https://groups.google.com/g/pdfium/c/6SklEc2lYNM) is accessible and properly documents the FPDF_INCREMENTAL issue.
- Flag values: All documented values match the PDFium fpdf_save.h specification (0 = full, 1 = incremental, 2 = no incremental, 4 = remove security).
- Codebase usage: No code in the repository is using the broken flag 1; the implementation correctly uses flag 0 as warned.
The warning about FPDF_INCREMENTAL is necessary and well-documented, preventing users from introducing corrupt output. The documentation is consistent with explanations in PdfSaver.java.
build.gradle.kts (1)
16-16: LGTM!Version bump to
0.10.0appropriately reflects the scope of changes in this PR (refactored save pipeline, cross-reference stream support).src/main/java/org/grimmory/pdfium4j/PdfDocument.java (3)
869-890: LGTM!The
readNativeMetadatahelper correctly implements the double-call pattern forFPDF_GetMetaTextand properly handles the UTF-16LE null terminator case.
1273-1293: LGTM!The merge logic correctly:
- Preserves existing metadata entries not being modified
- Allows pending values to override existing entries
- Supports clearing tags by setting empty string values
1248-1265: LGTM!The unified save path through
saveToBytes()simplifies the implementation and ensures consistent metadata handling across all save methods.src/test/java/org/grimmory/pdfium4j/PdfDocumentTest.java (2)
1461-1515: Well-structured xref stream PDF generator.The implementation correctly follows PDF spec §7.5.8 for cross-reference streams with proper W=[1,4,0] encoding and Index specification. This provides valuable test coverage for the new xref stream handling in
PdfSaver.
1522-1594: LGTM!Comprehensive test coverage for cross-reference stream PDFs covering:
- Basic metadata persistence
- Combined XMP + Info dictionary writes
- Double write scenarios (incremental updates on incrementally updated PDFs)
src/main/java/org/grimmory/pdfium4j/PdfSaver.java (5)
131-181: LGTM!Robust validation strategy that guarantees no corrupt output by re-opening the produced bytes with PDFium. The graceful fallback to base bytes on validation failure is a sound approach for production reliability.
470-516: LGTM!Tail-focused parsing is a sound approach that:
- Is immune to false matches in binary stream data
- Handles both traditional trailers and xref streams
- Uses a generous 4096-byte window (spec requires startxref within last 1024)
419-462: LGTM!Correct implementation of cross-reference streams per PDF spec §7.5.8:
- Valid W=[1,4,0] encoding scheme
- Proper /Index specification for non-contiguous entries
- Byte-transparent ISO-8859-1 encoding for binary stream data
212-326: LGTM!The incremental update logic correctly:
- Detects PDF format (xref table vs xref stream) and generates appropriate output
- Assigns object numbers sequentially from trailer /Size
- Automatically updates ModDate timestamp
- Chains a Catalog update when XMP metadata is added
331-410: Test coverage for chained incremental updates is comprehensive.The test suite already covers simultaneous Info dictionary and XMP metadata updates across multiple scenarios: single saves with both Info and XMP (testInfoDictAndXmpSimultaneous), chained updates (testMultipleSavesChainingBothInfoAndXmp, testXrefStreamChainedUpdatesWithBothMetadataAndXmp), and same-file overwrites (testLargeContentPreservationWithMetadata). The
appendCatalogUpdatemethod correctly re-parses the tail to obtain the current/Prevpointer after the initial incremental update, ensuring proper chaining. The concern about this subtle behavior is valid, and verification confirms it is adequately tested.
Summary by CodeRabbit
New Features
Bug Fixes
Documentation
Tests