extract_data() from PDF doesn't work #302

andrie · 2025-02-05T20:03:36Z

PR #265 adds support for PDF in claude, and #301 fixes a missing as_json() method for using PDF in AWS Bedrock.

This means I can successfully extract information from a PDF when using a custom prompt. For example, this pseudocode works:

pdf <- content_pdf_file("~/path/to.pdf")
prompt <- "Extract the vendor name and invoice amount from the PDF."
chat$chat(prompt, pdf)

However, extract_data() throws an error:

pdf <- content_pdf_file("~/path/to.pdf")

type_invoice <- type_object(
  vendor_name = type_string("vendor name"),
  amount = type_string("amount")
)

chat$extract_data(pdf, type = type_invoice)

results in:

Error in `req_perform()`:
! HTTP 400 Bad Request.
• Messages can’t contain duplicate document names. Rename the document and retry your request.

cc @atheriel

The text was updated successfully, but these errors were encountered:

walkerke · 2025-02-05T22:46:52Z

@andrie I've never seen that error in my fork. For example:

library(ellmer)

pdf <- content_pdf_url("https://cran.r-project.org/web/packages/ellmer/ellmer.pdf")

chat <- chat_claude()

schema <- type_object(
  package_name = type_string("The name of the R package"),
  authors = type_array("The authors of the R package", items = type_string())
)

chat$extract_data(pdf, type = schema)

$package_name
[1] "ellmer"

$authors
[1] "Hadley Wickham" "Joe Cheng"

Is that specific to Bedrock? I know getting extract data to work for Gemini from PDFs was a little tricky, it requires an additional text prompt:

library(ellmer)

pdf <- content_pdf_url("https://cran.r-project.org/web/packages/ellmer/ellmer.pdf")

chat <- chat_gemini()

schema <- type_object(
  package_name = type_string("The name of the R package"),
  authors = type_array("The authors of the R package", items = type_string())
)

chat$extract_data("Extract data from this PDF", pdf, type = schema)

atheriel · 2025-02-05T23:06:34Z

From some testing I think ellmer may also need to generate unique document names during JSON serialization for Bedrock.

And to the second issue, I think Bedrock may require a text prompt as well. Maybe there is some way to signal a more informative error in that case.

andrie · 2025-02-05T23:12:01Z

I can confirm that passing the prompt into bedrock made a difference. Thank you for the hint.

And as @atheriel mentioned, the unique document name occurs when you upload the same PDF twice in a chat session.

But further experimentation revealed that I can send an extract_data() request with a PDF, then make a second request on the same PDF, without uploading the PDF again.

hadley · 2025-02-06T16:47:22Z

Should be fixed now in #265

andrie mentioned this issue Feb 5, 2025

Add support for PDFs in Claude and Gemini #265

Merged

hadley closed this as completed Feb 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

extract_data() from PDF doesn't work #302

extract_data() from PDF doesn't work #302

andrie commented Feb 5, 2025

walkerke commented Feb 5, 2025 •

edited

Loading

atheriel commented Feb 5, 2025

andrie commented Feb 5, 2025

hadley commented Feb 6, 2025

extract_data() from PDF doesn't work #302

extract_data() from PDF doesn't work #302

Comments

andrie commented Feb 5, 2025

walkerke commented Feb 5, 2025 • edited Loading

atheriel commented Feb 5, 2025

andrie commented Feb 5, 2025

hadley commented Feb 6, 2025

walkerke commented Feb 5, 2025 •

edited

Loading