Skip to content

feat(pdf): integrate PDFium4j for PDF processing and metadata extraction#281

Open
balazs-szucs wants to merge 10 commits intogrimmory-tools:developfrom
balazs-szucs:pdfium4j
Open

feat(pdf): integrate PDFium4j for PDF processing and metadata extraction#281
balazs-szucs wants to merge 10 commits intogrimmory-tools:developfrom
balazs-szucs:pdfium4j

Conversation

@balazs-szucs
Copy link
Copy Markdown
Member

@balazs-szucs balazs-szucs commented Mar 30, 2026

Description

Linked Issue: Fixes #

Changes

This pull request migrates PDF processing in the booklore-api module from Apache PDFBox to PDFium4j, updating both the build configuration and the main codebase to use the new library for PDF reading, rendering, and metadata extraction. It also introduces improvements for native library resolution, dependency management, and PDF date parsing. Additionally, the custom XMP schema class for PDFBox is removed as it is no longer needed.

Migration from PDFBox to PDFium4j:

  • Replaced all usages of PDFBox in PdfProcessor and PdfMetadataExtractor with PDFium4j equivalents, including document loading, cover extraction, page rendering, and metadata access. Methods and logic were updated to use PDFium4j's API and types.

  • Removed the now-unused custom XMP schema class BookLoreSchema.java from the codebase.

Build and Dependency Management:

  • Updated build.gradle.kts to:
    • Add PDFium4j and its native libraries as dependencies, with logic to resolve native classifier based on OS and architecture.
    • Retain PDFBox as a test dependency for generating test PDFs only.

Summary by CodeRabbit

  • New Features

    • Improved PDF processing: more reliable cover generation, metadata extraction, and outline/page handling using a native-backed renderer.
    • More deterministic metadata embedding to reduce spurious metadata changes.
  • Bug Fixes

    • Safer XML parsing and EPUB streaming for fewer parsing/streaming failures.
  • Tests

    • Conditional tests for platforms without the native PDF runtime.
  • Chores

    • Docker/runtime updated to enable native library usage.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 30, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Migrates PDF processing from Apache PDFBox to PDFium4j, adds Gradle/native-library gating and platform detection, removes BookLore XMP schema class and its tests, introduces conditional test execution for native PDFium availability, updates XML factory caching and minor IO/refactor tweaks.

Changes

Cohort / File(s) Summary
Build & Native Support
booklore-api/build.gradle.kts, Dockerfile
Added useLocalLibs Gradle gate, new pdfiumNativesClassifier() for platform-native classifier, switched PDF/EPUB dependencies to Grimmory/Pdfium variants with runtime native classifier, added TEST/bootRun JVM native-access flag and passed TARGETARCH to Gradle in Docker build.
PDF Processing Core
booklore-api/src/main/java/org/booklore/service/.../PdfProcessor.java, .../PdfReaderService.java, .../metadata/extractor/PdfMetadataExtractor.java, .../metadata/writer/PdfMetadataWriter.java
Replaced PDFBox usage with PDFium4j across rendering, metadata/XMP handling, outline/bookmark extraction and page rendering; introduced new helpers (e.g., parsePdfDate), switched XMP generation to string-based construction and compare-by-stripped-timestamps, and adjusted save/metadata APIs to Pdfium4j equivalents.
Removed XMP Schema
booklore-api/src/main/java/org/booklore/service/metadata/writer/BookLoreSchema.java, booklore-api/src/test/java/.../BookLoreSchemaTest.java
Removed BookLoreSchema class and its corresponding test class (namespace/prefix constants and setters deleted, tests removed).
Test infra & Guards for Pdfium
booklore-api/src/test/java/org/booklore/test/PdfiumAvailableCondition.java, .../RequiresPdfium.java, .../PdfMetadataExtractorTest.java, .../PdfMetadataWriterTest.java
Added JUnit ExecutionCondition probing Pdfium native availability and @RequiresPdfium annotation; annotated/updated tests to use Pdfium4j utilities and gate execution when native libs are absent.
XML & IO Utilities
booklore-api/src/main/java/org/booklore/util/SecureXmlUtils.java, booklore-api/src/main/java/org/booklore/service/reader/EpubReaderService.java, booklore-api/src/main/java/org/booklore/service/metadata/extractor/Fb2MetadataExtractor.java
Cached preconfigured DocumentBuilderFactory instances and created builders from them, switched EPUB ZIP streaming from IOUtils.copy to InputStream.transferTo, and simplified FB2 extractor to use the shared secure builder.
Frontend Lifecycle
frontend/src/app/features/metadata/component/book-metadata-center/metadata-editor/metadata-editor.component.ts
Moved syncProviderFieldsEffect initialization from ngOnInit() to the component constructor for earlier effect registration.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant PdfProcessor
    participant PdfiumLib
    participant Filesystem
    Client->>PdfProcessor: request cover/metadata for `book.pdf` path
    PdfProcessor->>Filesystem: open file stream (Path)
    PdfProcessor->>PdfiumLib: PdfDocument.open(path)
    PdfiumLib-->>PdfProcessor: PdfDocument instance / metadata API
    PdfProcessor->>PdfiumLib: doc.page(0) / renderPageToBytes(dpi, "jpeg")
    PdfiumLib-->>PdfProcessor: JPEG bytes / metadata strings
    PdfProcessor->>Filesystem: write cover JPEG / update PDF (setMetadata/setXmpMetadata)
    PdfProcessor-->>Client: return cover path / metadata DTO
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Poem

🐰
I nibble bytes and hop through stacks,
Swapped PDFBox trails for Pdfium tracks.
Native claws linked by Gradle's art,
XMP trimmed down — a brand new start.
Hooray — metadata snug in every part!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 11.43% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The PR title follows the conventional commit format with type 'feat', scope 'pdf', and a clear description of the main change: integrating PDFium4j for PDF processing.
Description check ✅ Passed The PR description includes all required template sections with substantial content explaining the migration, changes, and reasoning, though the 'Changes' section could be more structured.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
✨ Simplify code
  • Create PR with simplified code

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@balazs-szucs balazs-szucs marked this pull request as ready for review March 30, 2026 16:44
coderabbitai[bot]

This comment was marked as resolved.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@booklore-api/src/main/java/org/booklore/service/metadata/writer/PdfMetadataWriter.java`:
- Around line 149-162: The catch in PdfMetadataWriter that logs XMP embedding
failures currently swallows the exception and allows saveMetadataToFile() to
continue; change this so failures bubble out instead of being ignored: after
logging the error in the catch block (surrounding buildXmpPacket /
isXmpMetadataDifferent / doc.setXmpMetadata), rethrow the exception (or throw a
new RuntimeException with the caught exception as cause) so the outer
backup/restore path can detect the failure and skip the final save.
- Around line 141-145: The current truncation in PdfMetadataWriter uses
String.length() and substring on keywords (and references keywordsBuilder) which
splits characters and ignores a 255-byte UTF-8 limit (MAX_INFO_KEYWORDS_LENGTH);
change the logic to measure and truncate by UTF-8 bytes: iterate over the
keyword string by code points (or build via keywordsBuilder.codePoints()),
append code points to a StringBuilder while measuring bytes of the accumulated
string in UTF-8 until adding the next code point would exceed
MAX_INFO_KEYWORDS_LENGTH - 3, then append "..." and use that result; ensure you
never call substring on a potential surrogate pair and update the debug log to
report the original UTF-8 byte length vs truncated byte length.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ca448278-1644-4955-8845-338b8a0b0308

📥 Commits

Reviewing files that changed from the base of the PR and between 31bc58f and baa1564.

📒 Files selected for processing (7)
  • Dockerfile
  • booklore-api/build.gradle.kts
  • booklore-api/src/main/java/org/booklore/service/metadata/extractor/Fb2MetadataExtractor.java
  • booklore-api/src/main/java/org/booklore/service/metadata/extractor/PdfMetadataExtractor.java
  • booklore-api/src/main/java/org/booklore/service/metadata/writer/PdfMetadataWriter.java
  • booklore-api/src/main/java/org/booklore/util/SecureXmlUtils.java
  • dev.local-libs.docker-compose.yml
✅ Files skipped from review due to trivial changes (1)
  • dev.local-libs.docker-compose.yml
🚧 Files skipped from review as they are similar to previous changes (1)
  • booklore-api/src/main/java/org/booklore/service/metadata/extractor/PdfMetadataExtractor.java
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Analyze (java-kotlin)
🧰 Additional context used
📓 Path-based instructions (1)
booklore-api/src/**/*.java

📄 CodeRabbit inference engine (AGENTS.md)

booklore-api/src/**/*.java: Use 4-space indentation in Java files and match surrounding Java style
Prefer constructor injection via Lombok patterns already used in the codebase. Do not introduce @Autowired field injection
Use MapStruct for entity/DTO mapping

Files:

  • booklore-api/src/main/java/org/booklore/util/SecureXmlUtils.java
  • booklore-api/src/main/java/org/booklore/service/metadata/extractor/Fb2MetadataExtractor.java
  • booklore-api/src/main/java/org/booklore/service/metadata/writer/PdfMetadataWriter.java
🧠 Learnings (1)
📚 Learning: 2026-03-31T06:22:25.311Z
Learnt from: imnotjames
Repo: grimmory-tools/grimmory PR: 113
File: booklore-api/build.gradle.kts:89-90
Timestamp: 2026-03-31T06:22:25.311Z
Learning: When reviewing build logic or CI checks that call the JitPack build status API, do not treat a response like {"status":"none"} as evidence that a dependency is unavailable or that a build has failed. JitPack typically builds packages on-demand the first time they’re requested (e.g., via Gradle); "none" means the package isn’t pre-built/cached yet. Only raise a build failure concern when the status explicitly indicates an error/failure (e.g., failed/cancelled), or when dependent resolution actually fails.

Applied to files:

  • booklore-api/build.gradle.kts
🔇 Additional comments (8)
booklore-api/src/main/java/org/booklore/util/SecureXmlUtils.java (2)

28-40: Centralized XML hardening looks good.

Keeping the namespace-aware toggle and XXE-related flags in one helper reduces parser-config drift across the XML entry points.


15-17: The implementation is correct and follows JAXP best practices. Concurrent calls to newDocumentBuilder() on the same configured DocumentBuilderFactory instance are thread-safe per the JAXP specification. The pattern of configuring the factory once and caching it for reuse is the recommended approach for performance and safety. Each thread receives its own fresh DocumentBuilder instance, which is used locally and never shared.

The comment at line 15 correctly states the factory is thread-safe; the misleading comment at line 49 should be clarified that each call to newDocumentBuilder() does create a new builder instance from the thread-safe factory, which is the correct behavior. No changes are needed to the implementation.

			> Likely an incorrect or invalid review comment.
booklore-api/src/main/java/org/booklore/service/metadata/extractor/Fb2MetadataExtractor.java (1)

41-42: Nice consolidation.

Routing both FB2 parse paths through SecureXmlUtils keeps XML parser hardening centralized instead of duplicating factory setup here.

Also applies to: 94-95

Dockerfile (2)

19-34: Good multi-arch handoff into the Gradle build.

Plumbing TARGETARCH through the backend build keeps PDFium native resolution tied to the target image instead of the builder host.


68-74: Runtime JVM flags are now consistent with the PDFium path.

Adding --enable-native-access=ALL-UNNAMED to JAVA_TOOL_OPTIONS brings the packaged container in line with the test and bootRun paths.

booklore-api/src/main/java/org/booklore/service/metadata/writer/PdfMetadataWriter.java (1)

319-355: The RDF Bag output lines up with the extractor contract.

Emitting booklore:{field} -> rdf:Bag -> rdf:li with sorted values matches what PdfMetadataExtractor.extractBookloreBag() expects, so tags and moods should round-trip cleanly.

booklore-api/build.gradle.kts (2)

37-81: Classifier-based native wiring looks solid.

Resolving the PDFium runtime artifact through pdfiumNativesClassifier() makes the native selection explicit instead of relying on whatever transitive binary happens to land on the classpath.

Also applies to: 119-121


190-193: The native-access requirements are now consistent in dev/test.

Aligning test and bootRun with PDFium's native-access requirement should keep local runs and CI much closer to the packaged container behavior.

Also applies to: 219-223

Copy link
Copy Markdown
Member

@zachyale zachyale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM pending merge conflict resolution + sufficient testing following the conflict resolution

chernesk pushed a commit to chernesk/grimmory that referenced this pull request Apr 1, 2026
@balazs-szucs
Copy link
Copy Markdown
Member Author

I think the original code had enough testing coverage, that's also reason I had to keep PDFBox in, because tests generate sample PDFs for testing via PDFBox (ony kept is testImplementation so no worries there)

However, I'm not against testing more testing, any particular place you are thinking of?

Metadata have round trip tests, so that's covered really well, and for the others e.g., cover generation that's hard to mock/test and generally it's a rather in your face regression. But 🤷‍♂️ I can be flexible

@coderabbitai coderabbitai bot requested a review from imajes April 1, 2026 19:35
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
booklore-api/build.gradle.kts (2)

29-34: useLocalLibs can silently stop testing local publishes.

With mavenLocal() just added to the repo list and both org.grimmory overrides using +, -PuseLocalLibs does not actually guarantee the local artifacts are the ones being exercised. If ~/.m2 is missing one of them, Gradle will happily resolve a remote version and the build still looks “local”. I’d rather require explicit local versions (for example -Ppdfium4jVersion / -Pepub4jVersion) or switch this mode to dependency substitution/composite builds so it fails fast instead of falling back.

Also applies to: 119-131

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@booklore-api/build.gradle.kts` around lines 29 - 34, The current useLocalLibs
boolean (val useLocalLibs) simply adds mavenLocal() to repositories which can
silently fall back to remote artifacts when local coordinates are absent; change
the approach so that enabling "local mode" requires explicit local version
properties (e.g. -Ppdfium4jVersion and -Pepub4jVersion) or switch to dependency
substitution/composite builds instead of toggling mavenLocal(). Concretely: stop
relying on useLocalLibs to add mavenLocal(), introduce/read explicit Gradle
project properties (pdfium4jVersion, epub4jVersion) and fail the build if those
properties are set but corresponding artifacts cannot be resolved, or implement
dependencySubstitution / includeBuild for the local modules to force resolution
to local projects and fail fast when missing; update references to useLocalLibs,
the repositories block, and any org.grimmory dependency declarations to use the
chosen explicit-version or substitution strategy so resolution cannot silently
fall back to remote.

39-42: Use Gradle providers for TARGET* inputs.

System.getenv() here bypasses Gradle's provider-based input tracking, so TARGETPLATFORM / TARGETARCH changes are invisible to configuration caching. Using providers.environmentVariable(...).orElse(providers.gradleProperty(...)) keeps this consistent with the rest of the script.

Proposed change
-    val targetPlatform = System.getenv("TARGETPLATFORM")
-        ?: project.findProperty("targetPlatform")?.toString()
-    val targetArch = System.getenv("TARGETARCH")
-        ?: project.findProperty("targetArch")?.toString()
+    val targetPlatform = providers.environmentVariable("TARGETPLATFORM")
+        .orElse(providers.gradleProperty("targetPlatform"))
+        .orNull
+    val targetArch = providers.environmentVariable("TARGETARCH")
+        .orElse(providers.gradleProperty("targetArch"))
+        .orNull
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@booklore-api/build.gradle.kts` around lines 39 - 42, Replace direct
System.getenv calls for targetPlatform and targetArch with Gradle providers so
configuration cache sees changes: use
providers.environmentVariable("TARGETPLATFORM").orElse(providers.gradleProperty("targetPlatform"))
for targetPlatform and similarly
providers.environmentVariable("TARGETARCH").orElse(providers.gradleProperty("targetArch"))
for targetArch (then convert the provider value to a String or nullable String
as your code expects). Update the references to the existing targetPlatform and
targetArch vals to use these providers' resolved value.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@booklore-api/build.gradle.kts`:
- Around line 60-69: Replace the runtime probing logic that computes isMusl from
osName and filesystem checks with an explicit libc configuration: add support
for reading a build property or env var (e.g., project.findProperty("libc") ?:
System.getenv("LIBC")) and use that value to choose between "linux-musl" and
"linux" instead of the current isMusl probe; update the branch that currently
references osName and the isMusl boolean so it selects "linux-musl" when the
explicit libc equals "musl" and "linux" otherwise, and document that callers
should pass ARG LIBC=musl during Docker/CI builds.

---

Nitpick comments:
In `@booklore-api/build.gradle.kts`:
- Around line 29-34: The current useLocalLibs boolean (val useLocalLibs) simply
adds mavenLocal() to repositories which can silently fall back to remote
artifacts when local coordinates are absent; change the approach so that
enabling "local mode" requires explicit local version properties (e.g.
-Ppdfium4jVersion and -Pepub4jVersion) or switch to dependency
substitution/composite builds instead of toggling mavenLocal(). Concretely: stop
relying on useLocalLibs to add mavenLocal(), introduce/read explicit Gradle
project properties (pdfium4jVersion, epub4jVersion) and fail the build if those
properties are set but corresponding artifacts cannot be resolved, or implement
dependencySubstitution / includeBuild for the local modules to force resolution
to local projects and fail fast when missing; update references to useLocalLibs,
the repositories block, and any org.grimmory dependency declarations to use the
chosen explicit-version or substitution strategy so resolution cannot silently
fall back to remote.
- Around line 39-42: Replace direct System.getenv calls for targetPlatform and
targetArch with Gradle providers so configuration cache sees changes: use
providers.environmentVariable("TARGETPLATFORM").orElse(providers.gradleProperty("targetPlatform"))
for targetPlatform and similarly
providers.environmentVariable("TARGETARCH").orElse(providers.gradleProperty("targetArch"))
for targetArch (then convert the provider value to a String or nullable String
as your code expects). Update the references to the existing targetPlatform and
targetArch vals to use these providers' resolved value.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 73579fef-d856-45c7-82f6-88d2455812e8

📥 Commits

Reviewing files that changed from the base of the PR and between f03d13b and d2029c2.

📒 Files selected for processing (3)
  • Dockerfile
  • booklore-api/build.gradle.kts
  • frontend/src/app/features/metadata/component/book-metadata-center/metadata-editor/metadata-editor.component.ts
🚧 Files skipped from review as they are similar to previous changes (2)
  • frontend/src/app/features/metadata/component/book-metadata-center/metadata-editor/metadata-editor.component.ts
  • Dockerfile
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Test Suite / Backend Tests
  • GitHub Check: Test Suite / Frontend Tests
  • GitHub Check: Analyze (java-kotlin)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2026-03-31T06:22:25.311Z
Learnt from: imnotjames
Repo: grimmory-tools/grimmory PR: 113
File: booklore-api/build.gradle.kts:89-90
Timestamp: 2026-03-31T06:22:25.311Z
Learning: When reviewing build logic or CI checks that call the JitPack build status API, do not treat a response like {"status":"none"} as evidence that a dependency is unavailable or that a build has failed. JitPack typically builds packages on-demand the first time they’re requested (e.g., via Gradle); "none" means the package isn’t pre-built/cached yet. Only raise a build failure concern when the status explicitly indicates an error/failure (e.g., failed/cancelled), or when dependent resolution actually fails.

Applied to files:

  • booklore-api/build.gradle.kts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants