Consider filtering out duplicated emission table data before we save to the vector DB #274

Greenheart · 2024-11-21T15:52:55Z

In nlmExtractTables, we store the emission tables two times to the vector DB.

Lines 110 to 118 in 649e8c4

    
           markdown: 
        
             markdownText + 
        
             '\n\n This is some of the important tables from the markdown with more precision:' + 
        
             tables 
        
               .map( 
        
                 ({ page_idx, name, markdown }) => 
        
                   `\n#### ${name} (Extracted table with more precision from PAGE ${page_idx})\n\n${markdown}` 
        
               ) 
        
               .join('\n'),

It might be worth de-duplicating the table data from the report, since we have the page numbers of those tables. Maybe we could filter out the json data before we format it to markdown.

Perhaps this could improve the quality of the returned context from the vector DB, since we would reduce the likelihood of duplicated table data, for example if we get one paragraph from the regular pdf extraction, and then get the same table data extracted via nlmExtractTables.ts and the Vision API.

The text was updated successfully, but these errors were encountered:

Greenheart · 2024-11-26T11:39:11Z

Idea: What if we could detect table images in the markdown and replace the markdown content with the data extracted by the vision API?

We add an image for every table when we render the document as markdown, which means we could find and replace sections of the markdown string to de-duplicate the table content.

garbo/src/lib/jsonExtraction.ts

Lines 54 to 58 in 27bcfca

    
             const image = `![table image]({page: ${block.page_idx}, x: ${Math.round( 
        
               bbox[0] 
        
             )}}, {y: ${Math.round(bbox[1])}, {width: ${Math.round( 
        
               bbox[2] - bbox[0] 
        
             )}}, {height: ${Math.round(bbox[3] - bbox[1])}})`

Greenheart added the question Further information is requested label Nov 21, 2024

Greenheart added this to Garbo Nov 21, 2024

Greenheart mentioned this issue Nov 27, 2024

RangeError: Invalid string length #276

Closed

kaylawoodbury added this to Klimatkollen main Feb 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider filtering out duplicated emission table data before we save to the vector DB #274

Consider filtering out duplicated emission table data before we save to the vector DB #274

Greenheart commented Nov 21, 2024

Greenheart commented Nov 26, 2024

Consider filtering out duplicated emission table data before we save to the vector DB #274

Consider filtering out duplicated emission table data before we save to the vector DB #274

Comments

Greenheart commented Nov 21, 2024

Greenheart commented Nov 26, 2024