You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
'\n\n This is some of the important tables from the markdown with more precision:'+
tables
.map(
({ page_idx, name, markdown })=>
`\n#### ${name} (Extracted table with more precision from PAGE ${page_idx})\n\n${markdown}`
)
.join('\n'),
It might be worth de-duplicating the table data from the report, since we have the page numbers of those tables. Maybe we could filter out the json data before we format it to markdown.
Perhaps this could improve the quality of the returned context from the vector DB, since we would reduce the likelihood of duplicated table data, for example if we get one paragraph from the regular pdf extraction, and then get the same table data extracted via nlmExtractTables.ts and the Vision API.
The text was updated successfully, but these errors were encountered:
Idea: What if we could detect table images in the markdown and replace the markdown content with the data extracted by the vision API?
We add an image for every table when we render the document as markdown, which means we could find and replace sections of the markdown string to de-duplicate the table content.
In
nlmExtractTables
, we store the emission tables two times to the vector DB.garbo/src/workers/nlmExtractTables.ts
Lines 110 to 118 in 649e8c4
It might be worth de-duplicating the table data from the report, since we have the page numbers of those tables. Maybe we could filter out the json data before we format it to markdown.
Perhaps this could improve the quality of the returned context from the vector DB, since we would reduce the likelihood of duplicated table data, for example if we get one paragraph from the regular pdf extraction, and then get the same table data extracted via
nlmExtractTables.ts
and the Vision API.The text was updated successfully, but these errors were encountered: