Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add content_layer property to items to address body, furniture and other roles #735

Merged
merged 10 commits into from
Feb 10, 2025

Conversation

cau-git
Copy link
Contributor

@cau-git cau-git commented Jan 13, 2025

Checklist:

  • Documentation has been updated, if necessary.
  • Examples have been added, if necessary.
  • Tests have been added, if necessary.

Copy link

mergify bot commented Jan 13, 2025

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

🟢 Require two reviewer for test updates

Wonderful, this rule succeeded.

When test data is updated, we require two reviewers

  • #approved-reviews-by >= 2

@cau-git cau-git changed the title feat: Populate page-headers and page-footers in DoclingDocument.furniture fix: Populate page-headers and page-footers in DoclingDocument.furniture Jan 14, 2025
@cau-git cau-git requested review from vagenas and dolfim-ibm January 14, 2025 12:39
@dolfim-ibm dolfim-ibm marked this pull request as draft January 21, 2025 07:30
@dolfim-ibm
Copy link
Contributor

One proposal to consider would also be:

furniture:
  children:
  - $ref: '#/texts/0'
  - $ref: '#/texts/1'
  label: unspecified
  name: _root_
  parent: null
  self_ref: '#/furniture'

body:
  children:
  - $ref: '#/furniture/0'
  - $ref: '#/texts/2'
  - $ref: '#/texts/3'
  - $ref: '#/furniture/1'
  label: unspecified
  name: _root_
  parent: null
  self_ref: '#/body'

This would require changes in how the DoclingDocument.add_text() works, and the document.iterate_items() would add an option to skip or not the #/furniture references.

@vagenas
Copy link
Contributor

vagenas commented Jan 30, 2025

For reference, this can be related to #667.

@cau-git
Copy link
Contributor Author

cau-git commented Feb 3, 2025

One proposal to consider would also be:

furniture:
  children:
  - $ref: '#/texts/0'
  - $ref: '#/texts/1'
  label: unspecified
  name: _root_
  parent: null
  self_ref: '#/furniture'

body:
  children:
  - $ref: '#/furniture/0'
  - $ref: '#/texts/2'
  - $ref: '#/texts/3'
  - $ref: '#/furniture/1'
  label: unspecified
  name: _root_
  parent: null
  self_ref: '#/body'

This would require changes in how the DoclingDocument.add_text() works, and the document.iterate_items() would add an option to skip or not the #/furniture references.

This would not work since there is no such thing as #/furniture/0 or #/furniture/1. It would only be the case if we change to something like furnitures as a field (similar to texts), which is a list. It would also make it unclear where to expect some content. E.g. a table in the furniture, should it be in tables or in furnitures (not both!)

@cau-git cau-git marked this pull request as ready for review February 3, 2025 15:01
@cau-git
Copy link
Contributor Author

cau-git commented Feb 3, 2025

This is updated to use a new field in NodeItem called content_layer. It will be set as furniture if a page-header or page-footer is detected in the PDF pipeline.

See DS4SD/docling-core#148

Signed-off-by: Christoph Auer <[email protected]>
@cau-git cau-git changed the title fix: Populate page-headers and page-footers in DoclingDocument.furniture feat: Add content_layer property to items to address body, furniture and other roles Feb 4, 2025
Signed-off-by: Christoph Auer <[email protected]>
Signed-off-by: Christoph Auer <[email protected]>
Signed-off-by: Christoph Auer <[email protected]>
Copy link
Contributor

@PeterStaar-IBM PeterStaar-IBM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@cau-git cau-git merged commit cf78d5b into main Feb 10, 2025
10 checks passed
@cau-git cau-git deleted the cau/handle-furniture branch February 10, 2025 11:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants