Skip to content

Updated Code for PDF Parsing Affects HTML Parsing #4

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
alexander-singh opened this issue Mar 8, 2024 · 2 comments
Open

Updated Code for PDF Parsing Affects HTML Parsing #4

alexander-singh opened this issue Mar 8, 2024 · 2 comments

Comments

@alexander-singh
Copy link

The updated vectorizor.py embed code assumes a different train.jsonl structure than what is created in the chunker.py process. It appears the code was updated based on the new pdf-muncher file, but the structure is not consistent:

chunker.py creates items with a {id:"id",text:"text",source:"source"} structure

{
'id': f'{uid}-{i}',
      'text': chunk,
      'source': file_path
}

pdf-muncher.py creates items with this structure:

{
 'id': f'{uid}-{i}',
     'pageContent': chunk,  # Use the key 'pageContent' instead of 'text'
     'metadata': {
           'txtPath': file_path
      }
}

vectorizor.py expects the format to be the latter and returns an error when no pdfs are parsed

@d-neri
Copy link

d-neri commented Mar 12, 2024

FYI I got around this for now by using the older code in this commit: 5b6121e

@Sstobo
Copy link
Owner

Sstobo commented Mar 14, 2024

Thanks for the feedback! Ill get on it as soon as possible

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants