Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider filtering out duplicated emission table data before we save to the vector DB #274

Open
Greenheart opened this issue Nov 21, 2024 · 1 comment
Labels
question Further information is requested

Comments

@Greenheart
Copy link
Contributor

In nlmExtractTables, we store the emission tables two times to the vector DB.

markdown:
markdownText +
'\n\n This is some of the important tables from the markdown with more precision:' +
tables
.map(
({ page_idx, name, markdown }) =>
`\n#### ${name} (Extracted table with more precision from PAGE ${page_idx})\n\n${markdown}`
)
.join('\n'),

It might be worth de-duplicating the table data from the report, since we have the page numbers of those tables. Maybe we could filter out the json data before we format it to markdown.

Perhaps this could improve the quality of the returned context from the vector DB, since we would reduce the likelihood of duplicated table data, for example if we get one paragraph from the regular pdf extraction, and then get the same table data extracted via nlmExtractTables.ts and the Vision API.

@Greenheart Greenheart added the question Further information is requested label Nov 21, 2024
@Greenheart Greenheart added this to Garbo Nov 21, 2024
@Greenheart
Copy link
Contributor Author

Idea: What if we could detect table images in the markdown and replace the markdown content with the data extracted by the vision API?

We add an image for every table when we render the document as markdown, which means we could find and replace sections of the markdown string to de-duplicate the table content.

const image = `![table image]({page: ${block.page_idx}, x: ${Math.round(
bbox[0]
)}}, {y: ${Math.round(bbox[1])}, {width: ${Math.round(
bbox[2] - bbox[0]
)}}, {height: ${Math.round(bbox[3] - bbox[1])}})`

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
Status: No status
Status: No status
Development

No branches or pull requests

1 participant