Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

extract_data() from PDF doesn't work #302

Closed
andrie opened this issue Feb 5, 2025 · 4 comments
Closed

extract_data() from PDF doesn't work #302

andrie opened this issue Feb 5, 2025 · 4 comments

Comments

@andrie
Copy link

andrie commented Feb 5, 2025

PR #265 adds support for PDF in claude, and #301 fixes a missing as_json() method for using PDF in AWS Bedrock.

This means I can successfully extract information from a PDF when using a custom prompt. For example, this pseudocode works:

pdf <- content_pdf_file("~/path/to.pdf")
prompt <- "Extract the vendor name and invoice amount from the PDF."
chat$chat(prompt, pdf)

However, extract_data() throws an error:

pdf <- content_pdf_file("~/path/to.pdf")

type_invoice <- type_object(
  vendor_name = type_string("vendor name"),
  amount = type_string("amount")
)

chat$extract_data(pdf, type = type_invoice)

results in:

Error in `req_perform()`:
! HTTP 400 Bad Request.
• Messages can’t contain duplicate document names. Rename the document and retry your request.

cc @atheriel

@walkerke
Copy link
Contributor

walkerke commented Feb 5, 2025

@andrie I've never seen that error in my fork. For example:

library(ellmer)

pdf <- content_pdf_url("https://cran.r-project.org/web/packages/ellmer/ellmer.pdf")

chat <- chat_claude()

schema <- type_object(
  package_name = type_string("The name of the R package"),
  authors = type_array("The authors of the R package", items = type_string())
)

chat$extract_data(pdf, type = schema)
$package_name
[1] "ellmer"

$authors
[1] "Hadley Wickham" "Joe Cheng"  

Is that specific to Bedrock? I know getting extract data to work for Gemini from PDFs was a little tricky, it requires an additional text prompt:

library(ellmer)

pdf <- content_pdf_url("https://cran.r-project.org/web/packages/ellmer/ellmer.pdf")

chat <- chat_gemini()

schema <- type_object(
  package_name = type_string("The name of the R package"),
  authors = type_array("The authors of the R package", items = type_string())
)

chat$extract_data("Extract data from this PDF", pdf, type = schema)

@atheriel
Copy link
Collaborator

atheriel commented Feb 5, 2025

From some testing I think ellmer may also need to generate unique document names during JSON serialization for Bedrock.

And to the second issue, I think Bedrock may require a text prompt as well. Maybe there is some way to signal a more informative error in that case.

@andrie
Copy link
Author

andrie commented Feb 5, 2025

I can confirm that passing the prompt into bedrock made a difference. Thank you for the hint.

And as @atheriel mentioned, the unique document name occurs when you upload the same PDF twice in a chat session.

But further experimentation revealed that I can send an extract_data() request with a PDF, then make a second request on the same PDF, without uploading the PDF again.

@hadley
Copy link
Member

hadley commented Feb 6, 2025

Should be fixed now in #265

@hadley hadley closed this as completed Feb 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants