-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Description
Hi Everyone,
I'm using Docling's basic OCR to convert a PDF into text, and I save the output as a JSON file. In one of the PDFs I'm processing, the JSON result includes a group (Group 4), which lists its children as references to texts 61 through 77. Here's a simplified snippet of what the raw JSON looks like:
{
"children": [
{"$ref": "#/texts/61"},
{"$ref": "#/texts/62"},
{"$ref": "#/texts/63"},
{"$ref": "#/texts/64"},
{"$ref": "#/texts/65"},
{"$ref": "#/texts/66"},
{"$ref": "#/texts/67"},
{"$ref": "#/texts/68"},
{"$ref": "#/texts/69"},
{"$ref": "#/texts/70"},
{"$ref": "#/texts/71"},
{"$ref": "#/texts/72"},
{"$ref": "#/texts/73"},
{"$ref": "#/texts/74"},
{"$ref": "#/texts/75"},
{"$ref": "#/texts/76"},
{"$ref": "#/texts/77"}
]
}
However, after loading this JSON using:
self.docling_doc = DoclingDocument.model_validate(doc_dict)
The children list in Group 4 unexpectedly changes. Here's what I get:
[
RefItem(cref='#/texts/61'),
RefItem(cref='#/texts/62'),
RefItem(cref='#/texts/63'),
RefItem(cref='#/texts/64'),
RefItem(cref='#/texts/65'),
RefItem(cref='#/texts/66'),
RefItem(cref='#/texts/67'),
RefItem(cref='#/texts/68'),
RefItem(cref='#/groups/45'), # Unexpected
RefItem(cref='#/texts/69'),
RefItem(cref='#/groups/44'), # Unexpected
RefItem(cref='#/texts/70'),
RefItem(cref='#/texts/71'),
RefItem(cref='#/texts/72'),
RefItem(cref='#/texts/73'),
RefItem(cref='#/texts/74'),
RefItem(cref='#/texts/75')
]
As you can see, there are now references to #/groups/44 and #/groups/45, which were not present in the original JSON. Also, texts/76 and texts/77 are missing from the parsed result.
Can anyone help me understand why this is happening? Is it a parsing issue with model_validate, or could the input JSON be getting altered during validation?
Happy to provide more details if needed.
Thanks in advance!