Updated Code for PDF Parsing Affects HTML Parsing #4

alexander-singh · 2024-03-08T17:55:18Z

The updated vectorizor.py embed code assumes a different train.jsonl structure than what is created in the chunker.py process. It appears the code was updated based on the new pdf-muncher file, but the structure is not consistent:

chunker.py creates items with a {id:"id",text:"text",source:"source"} structure

{
'id': f'{uid}-{i}',
      'text': chunk,
      'source': file_path
}

pdf-muncher.py creates items with this structure:

{
 'id': f'{uid}-{i}',
     'pageContent': chunk,  # Use the key 'pageContent' instead of 'text'
     'metadata': {
           'txtPath': file_path
      }
}

vectorizor.py expects the format to be the latter and returns an error when no pdfs are parsed

The text was updated successfully, but these errors were encountered:

d-neri · 2024-03-12T04:25:25Z

FYI I got around this for now by using the older code in this commit: 5b6121e

Sstobo · 2024-03-14T15:54:12Z

Thanks for the feedback! Ill get on it as soon as possible

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updated Code for PDF Parsing Affects HTML Parsing #4

Updated Code for PDF Parsing Affects HTML Parsing #4

alexander-singh commented Mar 8, 2024

d-neri commented Mar 12, 2024

Sstobo commented Mar 14, 2024

Updated Code for PDF Parsing Affects HTML Parsing #4

Updated Code for PDF Parsing Affects HTML Parsing #4

Comments

alexander-singh commented Mar 8, 2024

d-neri commented Mar 12, 2024

Sstobo commented Mar 14, 2024