Skip to content

Unexpected References Appearing in Docling OCR JSON Results #1991

@manikrishna-m

Description

@manikrishna-m

Hi Everyone,

I'm using Docling's basic OCR to convert a PDF into text, and I save the output as a JSON file. In one of the PDFs I'm processing, the JSON result includes a group (Group 4), which lists its children as references to texts 61 through 77. Here's a simplified snippet of what the raw JSON looks like:

{
  "children": [
    {"$ref": "#/texts/61"},
    {"$ref": "#/texts/62"},
    {"$ref": "#/texts/63"},
    {"$ref": "#/texts/64"},
    {"$ref": "#/texts/65"},
    {"$ref": "#/texts/66"},
    {"$ref": "#/texts/67"},
    {"$ref": "#/texts/68"},
    {"$ref": "#/texts/69"},
    {"$ref": "#/texts/70"},
    {"$ref": "#/texts/71"},
    {"$ref": "#/texts/72"},
    {"$ref": "#/texts/73"},
    {"$ref": "#/texts/74"},
    {"$ref": "#/texts/75"},
    {"$ref": "#/texts/76"},
    {"$ref": "#/texts/77"}
  ]
}

However, after loading this JSON using:

self.docling_doc = DoclingDocument.model_validate(doc_dict)
The children list in Group 4 unexpectedly changes. Here's what I get:

[
  RefItem(cref='#/texts/61'),
  RefItem(cref='#/texts/62'),
  RefItem(cref='#/texts/63'),
  RefItem(cref='#/texts/64'),
  RefItem(cref='#/texts/65'),
  RefItem(cref='#/texts/66'),
  RefItem(cref='#/texts/67'),
  RefItem(cref='#/texts/68'),
  RefItem(cref='#/groups/45'),  # Unexpected
  RefItem(cref='#/texts/69'),
  RefItem(cref='#/groups/44'),  # Unexpected
  RefItem(cref='#/texts/70'),
  RefItem(cref='#/texts/71'),
  RefItem(cref='#/texts/72'),
  RefItem(cref='#/texts/73'),
  RefItem(cref='#/texts/74'),
  RefItem(cref='#/texts/75')
]

As you can see, there are now references to #/groups/44 and #/groups/45, which were not present in the original JSON. Also, texts/76 and texts/77 are missing from the parsed result.

Can anyone help me understand why this is happening? Is it a parsing issue with model_validate, or could the input JSON be getting altered during validation?

Happy to provide more details if needed.

Thanks in advance!

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions