feat(pdf): integrate PDFium4j for PDF processing and metadata extraction#281
feat(pdf): integrate PDFium4j for PDF processing and metadata extraction#281balazs-szucs wants to merge 10 commits intogrimmory-tools:developfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughMigrates PDF processing from Apache PDFBox to PDFium4j, adds Gradle/native-library gating and platform detection, removes BookLore XMP schema class and its tests, introduces conditional test execution for native PDFium availability, updates XML factory caching and minor IO/refactor tweaks. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant PdfProcessor
participant PdfiumLib
participant Filesystem
Client->>PdfProcessor: request cover/metadata for `book.pdf` path
PdfProcessor->>Filesystem: open file stream (Path)
PdfProcessor->>PdfiumLib: PdfDocument.open(path)
PdfiumLib-->>PdfProcessor: PdfDocument instance / metadata API
PdfProcessor->>PdfiumLib: doc.page(0) / renderPageToBytes(dpi, "jpeg")
PdfiumLib-->>PdfProcessor: JPEG bytes / metadata strings
PdfProcessor->>Filesystem: write cover JPEG / update PDF (setMetadata/setXmpMetadata)
PdfProcessor-->>Client: return cover path / metadata DTO
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
✨ Simplify code
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
…in metadata extraction
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@booklore-api/src/main/java/org/booklore/service/metadata/writer/PdfMetadataWriter.java`:
- Around line 149-162: The catch in PdfMetadataWriter that logs XMP embedding
failures currently swallows the exception and allows saveMetadataToFile() to
continue; change this so failures bubble out instead of being ignored: after
logging the error in the catch block (surrounding buildXmpPacket /
isXmpMetadataDifferent / doc.setXmpMetadata), rethrow the exception (or throw a
new RuntimeException with the caught exception as cause) so the outer
backup/restore path can detect the failure and skip the final save.
- Around line 141-145: The current truncation in PdfMetadataWriter uses
String.length() and substring on keywords (and references keywordsBuilder) which
splits characters and ignores a 255-byte UTF-8 limit (MAX_INFO_KEYWORDS_LENGTH);
change the logic to measure and truncate by UTF-8 bytes: iterate over the
keyword string by code points (or build via keywordsBuilder.codePoints()),
append code points to a StringBuilder while measuring bytes of the accumulated
string in UTF-8 until adding the next code point would exceed
MAX_INFO_KEYWORDS_LENGTH - 3, then append "..." and use that result; ensure you
never call substring on a potential surrogate pair and update the debug log to
report the original UTF-8 byte length vs truncated byte length.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: ca448278-1644-4955-8845-338b8a0b0308
📒 Files selected for processing (7)
Dockerfilebooklore-api/build.gradle.ktsbooklore-api/src/main/java/org/booklore/service/metadata/extractor/Fb2MetadataExtractor.javabooklore-api/src/main/java/org/booklore/service/metadata/extractor/PdfMetadataExtractor.javabooklore-api/src/main/java/org/booklore/service/metadata/writer/PdfMetadataWriter.javabooklore-api/src/main/java/org/booklore/util/SecureXmlUtils.javadev.local-libs.docker-compose.yml
✅ Files skipped from review due to trivial changes (1)
- dev.local-libs.docker-compose.yml
🚧 Files skipped from review as they are similar to previous changes (1)
- booklore-api/src/main/java/org/booklore/service/metadata/extractor/PdfMetadataExtractor.java
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Analyze (java-kotlin)
🧰 Additional context used
📓 Path-based instructions (1)
booklore-api/src/**/*.java
📄 CodeRabbit inference engine (AGENTS.md)
booklore-api/src/**/*.java: Use 4-space indentation in Java files and match surrounding Java style
Prefer constructor injection via Lombok patterns already used in the codebase. Do not introduce@Autowiredfield injection
Use MapStruct for entity/DTO mapping
Files:
booklore-api/src/main/java/org/booklore/util/SecureXmlUtils.javabooklore-api/src/main/java/org/booklore/service/metadata/extractor/Fb2MetadataExtractor.javabooklore-api/src/main/java/org/booklore/service/metadata/writer/PdfMetadataWriter.java
🧠 Learnings (1)
📚 Learning: 2026-03-31T06:22:25.311Z
Learnt from: imnotjames
Repo: grimmory-tools/grimmory PR: 113
File: booklore-api/build.gradle.kts:89-90
Timestamp: 2026-03-31T06:22:25.311Z
Learning: When reviewing build logic or CI checks that call the JitPack build status API, do not treat a response like {"status":"none"} as evidence that a dependency is unavailable or that a build has failed. JitPack typically builds packages on-demand the first time they’re requested (e.g., via Gradle); "none" means the package isn’t pre-built/cached yet. Only raise a build failure concern when the status explicitly indicates an error/failure (e.g., failed/cancelled), or when dependent resolution actually fails.
Applied to files:
booklore-api/build.gradle.kts
🔇 Additional comments (8)
booklore-api/src/main/java/org/booklore/util/SecureXmlUtils.java (2)
28-40: Centralized XML hardening looks good.Keeping the namespace-aware toggle and XXE-related flags in one helper reduces parser-config drift across the XML entry points.
15-17: The implementation is correct and follows JAXP best practices. Concurrent calls tonewDocumentBuilder()on the same configuredDocumentBuilderFactoryinstance are thread-safe per the JAXP specification. The pattern of configuring the factory once and caching it for reuse is the recommended approach for performance and safety. Each thread receives its own freshDocumentBuilderinstance, which is used locally and never shared.The comment at line 15 correctly states the factory is thread-safe; the misleading comment at line 49 should be clarified that each call to
newDocumentBuilder()does create a new builder instance from the thread-safe factory, which is the correct behavior. No changes are needed to the implementation.> Likely an incorrect or invalid review comment.booklore-api/src/main/java/org/booklore/service/metadata/extractor/Fb2MetadataExtractor.java (1)
41-42: Nice consolidation.Routing both FB2 parse paths through
SecureXmlUtilskeeps XML parser hardening centralized instead of duplicating factory setup here.Also applies to: 94-95
Dockerfile (2)
19-34: Good multi-arch handoff into the Gradle build.Plumbing
TARGETARCHthrough the backend build keeps PDFium native resolution tied to the target image instead of the builder host.
68-74: Runtime JVM flags are now consistent with the PDFium path.Adding
--enable-native-access=ALL-UNNAMEDtoJAVA_TOOL_OPTIONSbrings the packaged container in line with thetestandbootRunpaths.booklore-api/src/main/java/org/booklore/service/metadata/writer/PdfMetadataWriter.java (1)
319-355: The RDF Bag output lines up with the extractor contract.Emitting
booklore:{field} -> rdf:Bag -> rdf:liwith sorted values matches whatPdfMetadataExtractor.extractBookloreBag()expects, so tags and moods should round-trip cleanly.booklore-api/build.gradle.kts (2)
37-81: Classifier-based native wiring looks solid.Resolving the PDFium runtime artifact through
pdfiumNativesClassifier()makes the native selection explicit instead of relying on whatever transitive binary happens to land on the classpath.Also applies to: 119-121
190-193: The native-access requirements are now consistent in dev/test.Aligning
testandbootRunwith PDFium's native-access requirement should keep local runs and CI much closer to the packaged container behavior.Also applies to: 219-223
booklore-api/src/main/java/org/booklore/service/metadata/writer/PdfMetadataWriter.java
Show resolved
Hide resolved
booklore-api/src/main/java/org/booklore/service/metadata/writer/PdfMetadataWriter.java
Show resolved
Hide resolved
zachyale
left a comment
There was a problem hiding this comment.
LGTM pending merge conflict resolution + sufficient testing following the conflict resolution
Merge Develop into Master for Release
|
I think the original code had enough testing coverage, that's also reason I had to keep PDFBox in, because tests generate sample PDFs for testing via PDFBox (ony kept is testImplementation so no worries there) However, I'm not against testing more testing, any particular place you are thinking of? Metadata have round trip tests, so that's covered really well, and for the others e.g., cover generation that's hard to mock/test and generally it's a rather in your face regression. But 🤷♂️ I can be flexible |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
booklore-api/build.gradle.kts (2)
29-34:useLocalLibscan silently stop testing local publishes.With
mavenLocal()just added to the repo list and bothorg.grimmoryoverrides using+,-PuseLocalLibsdoes not actually guarantee the local artifacts are the ones being exercised. If~/.m2is missing one of them, Gradle will happily resolve a remote version and the build still looks “local”. I’d rather require explicit local versions (for example-Ppdfium4jVersion/-Pepub4jVersion) or switch this mode to dependency substitution/composite builds so it fails fast instead of falling back.Also applies to: 119-131
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@booklore-api/build.gradle.kts` around lines 29 - 34, The current useLocalLibs boolean (val useLocalLibs) simply adds mavenLocal() to repositories which can silently fall back to remote artifacts when local coordinates are absent; change the approach so that enabling "local mode" requires explicit local version properties (e.g. -Ppdfium4jVersion and -Pepub4jVersion) or switch to dependency substitution/composite builds instead of toggling mavenLocal(). Concretely: stop relying on useLocalLibs to add mavenLocal(), introduce/read explicit Gradle project properties (pdfium4jVersion, epub4jVersion) and fail the build if those properties are set but corresponding artifacts cannot be resolved, or implement dependencySubstitution / includeBuild for the local modules to force resolution to local projects and fail fast when missing; update references to useLocalLibs, the repositories block, and any org.grimmory dependency declarations to use the chosen explicit-version or substitution strategy so resolution cannot silently fall back to remote.
39-42: Use Gradle providers forTARGET*inputs.
System.getenv()here bypasses Gradle's provider-based input tracking, soTARGETPLATFORM/TARGETARCHchanges are invisible to configuration caching. Usingproviders.environmentVariable(...).orElse(providers.gradleProperty(...))keeps this consistent with the rest of the script.Proposed change
- val targetPlatform = System.getenv("TARGETPLATFORM") - ?: project.findProperty("targetPlatform")?.toString() - val targetArch = System.getenv("TARGETARCH") - ?: project.findProperty("targetArch")?.toString() + val targetPlatform = providers.environmentVariable("TARGETPLATFORM") + .orElse(providers.gradleProperty("targetPlatform")) + .orNull + val targetArch = providers.environmentVariable("TARGETARCH") + .orElse(providers.gradleProperty("targetArch")) + .orNull🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@booklore-api/build.gradle.kts` around lines 39 - 42, Replace direct System.getenv calls for targetPlatform and targetArch with Gradle providers so configuration cache sees changes: use providers.environmentVariable("TARGETPLATFORM").orElse(providers.gradleProperty("targetPlatform")) for targetPlatform and similarly providers.environmentVariable("TARGETARCH").orElse(providers.gradleProperty("targetArch")) for targetArch (then convert the provider value to a String or nullable String as your code expects). Update the references to the existing targetPlatform and targetArch vals to use these providers' resolved value.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@booklore-api/build.gradle.kts`:
- Around line 60-69: Replace the runtime probing logic that computes isMusl from
osName and filesystem checks with an explicit libc configuration: add support
for reading a build property or env var (e.g., project.findProperty("libc") ?:
System.getenv("LIBC")) and use that value to choose between "linux-musl" and
"linux" instead of the current isMusl probe; update the branch that currently
references osName and the isMusl boolean so it selects "linux-musl" when the
explicit libc equals "musl" and "linux" otherwise, and document that callers
should pass ARG LIBC=musl during Docker/CI builds.
---
Nitpick comments:
In `@booklore-api/build.gradle.kts`:
- Around line 29-34: The current useLocalLibs boolean (val useLocalLibs) simply
adds mavenLocal() to repositories which can silently fall back to remote
artifacts when local coordinates are absent; change the approach so that
enabling "local mode" requires explicit local version properties (e.g.
-Ppdfium4jVersion and -Pepub4jVersion) or switch to dependency
substitution/composite builds instead of toggling mavenLocal(). Concretely: stop
relying on useLocalLibs to add mavenLocal(), introduce/read explicit Gradle
project properties (pdfium4jVersion, epub4jVersion) and fail the build if those
properties are set but corresponding artifacts cannot be resolved, or implement
dependencySubstitution / includeBuild for the local modules to force resolution
to local projects and fail fast when missing; update references to useLocalLibs,
the repositories block, and any org.grimmory dependency declarations to use the
chosen explicit-version or substitution strategy so resolution cannot silently
fall back to remote.
- Around line 39-42: Replace direct System.getenv calls for targetPlatform and
targetArch with Gradle providers so configuration cache sees changes: use
providers.environmentVariable("TARGETPLATFORM").orElse(providers.gradleProperty("targetPlatform"))
for targetPlatform and similarly
providers.environmentVariable("TARGETARCH").orElse(providers.gradleProperty("targetArch"))
for targetArch (then convert the provider value to a String or nullable String
as your code expects). Update the references to the existing targetPlatform and
targetArch vals to use these providers' resolved value.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 73579fef-d856-45c7-82f6-88d2455812e8
📒 Files selected for processing (3)
Dockerfilebooklore-api/build.gradle.ktsfrontend/src/app/features/metadata/component/book-metadata-center/metadata-editor/metadata-editor.component.ts
🚧 Files skipped from review as they are similar to previous changes (2)
- frontend/src/app/features/metadata/component/book-metadata-center/metadata-editor/metadata-editor.component.ts
- Dockerfile
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Test Suite / Backend Tests
- GitHub Check: Test Suite / Frontend Tests
- GitHub Check: Analyze (java-kotlin)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2026-03-31T06:22:25.311Z
Learnt from: imnotjames
Repo: grimmory-tools/grimmory PR: 113
File: booklore-api/build.gradle.kts:89-90
Timestamp: 2026-03-31T06:22:25.311Z
Learning: When reviewing build logic or CI checks that call the JitPack build status API, do not treat a response like {"status":"none"} as evidence that a dependency is unavailable or that a build has failed. JitPack typically builds packages on-demand the first time they’re requested (e.g., via Gradle); "none" means the package isn’t pre-built/cached yet. Only raise a build failure concern when the status explicitly indicates an error/failure (e.g., failed/cancelled), or when dependent resolution actually fails.
Applied to files:
booklore-api/build.gradle.kts
Description
Linked Issue: Fixes #
Changes
This pull request migrates PDF processing in the
booklore-apimodule from Apache PDFBox to PDFium4j, updating both the build configuration and the main codebase to use the new library for PDF reading, rendering, and metadata extraction. It also introduces improvements for native library resolution, dependency management, and PDF date parsing. Additionally, the custom XMP schema class for PDFBox is removed as it is no longer needed.Migration from PDFBox to PDFium4j:
Replaced all usages of PDFBox in
PdfProcessorandPdfMetadataExtractorwith PDFium4j equivalents, including document loading, cover extraction, page rendering, and metadata access. Methods and logic were updated to use PDFium4j's API and types.Removed the now-unused custom XMP schema class
BookLoreSchema.javafrom the codebase.Build and Dependency Management:
build.gradle.ktsto:Summary by CodeRabbit
New Features
Bug Fixes
Tests
Chores