You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When converting multiple documents, I want to pass several metadata fields which are different for each document. This functionality is available for multiple default haystack converters (e.g. for MarkdownToDocument). Just like in the default haystack converters, one should either be able to pass:
a single dictionary whose fields will be added to the metadata of all chunks or
a list of dictionaries having the same length as the list of passed documents (mapping fields of each dictionary to the metadata fields of the chunks of the respective document).
This is however not present in the current implementation. Workaround where the metadata would be set after conversion (with export type DOC_CHUNKS) is not possible for the following reason: When working with multiple documents (i.e. len(paths)>1) it is difficult to track which chunks belong to which document. Some documents can have the same filename and binary_hash, so for chunks belonging to these documents it is impossible to differentiate to which original document the chunk belongs.
Possible Solution
Add the optional meta parameter to the components DoclingConverter.run() method and expand the existing meta dictionaries (returned by the _meta_extractor) with the dictionary/dictionaries passed in the new meta parameter.
The text was updated successfully, but these errors were encountered:
Issue Description
When converting multiple documents, I want to pass several metadata fields which are different for each document. This functionality is available for multiple default haystack converters (e.g. for MarkdownToDocument). Just like in the default haystack converters, one should either be able to pass:
This is however not present in the current implementation. Workaround where the metadata would be set after conversion (with export type
DOC_CHUNKS
) is not possible for the following reason: When working with multiple documents (i.e.len(paths)>1
) it is difficult to track which chunks belong to which document. Some documents can have the samefilename
andbinary_hash
, so for chunks belonging to these documents it is impossible to differentiate to which original document the chunk belongs.Possible Solution
Add the optional
meta
parameter to the componentsDoclingConverter.run()
method and expand the existing meta dictionaries (returned by the_meta_extractor
) with the dictionary/dictionaries passed in the newmeta
parameter.The text was updated successfully, but these errors were encountered: