Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Passing custom metadata per document #8

Open
r-gg opened this issue Jan 28, 2025 · 0 comments
Open

Passing custom metadata per document #8

r-gg opened this issue Jan 28, 2025 · 0 comments

Comments

@r-gg
Copy link

r-gg commented Jan 28, 2025

Issue Description

When converting multiple documents, I want to pass several metadata fields which are different for each document. This functionality is available for multiple default haystack converters (e.g. for MarkdownToDocument). Just like in the default haystack converters, one should either be able to pass:

  1. a single dictionary whose fields will be added to the metadata of all chunks or
  2. a list of dictionaries having the same length as the list of passed documents (mapping fields of each dictionary to the metadata fields of the chunks of the respective document).

This is however not present in the current implementation. Workaround where the metadata would be set after conversion (with export type DOC_CHUNKS) is not possible for the following reason: When working with multiple documents (i.e. len(paths)>1) it is difficult to track which chunks belong to which document. Some documents can have the same filename and binary_hash, so for chunks belonging to these documents it is impossible to differentiate to which original document the chunk belongs.

Possible Solution

Add the optional meta parameter to the components DoclingConverter.run() method and expand the existing meta dictionaries (returned by the _meta_extractor) with the dictionary/dictionaries passed in the new meta parameter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant