-
Notifications
You must be signed in to change notification settings - Fork 0
Semantic Chunking with Layout Extraction
Modern documents mix complex layouts with diverse content types. By using both geometric and logical information, you can segment documents into semantically meaningful chunks, which improves data extraction accuracy.
-
Geometric Roles:
- Information about spatial position (e.g., bounding boxes for paragraphs, tables, figures).
-
Logical Roles:
- Structural cues (e.g., headings, titles, footers) that indicate section boundaries.
-
Key-Value Pairs:
- Structured data found in forms that link labels (keys) with values.
-
Proximity & Alignment:
Spatially adjacent and aligned elements are likely related. -
Logical Boundaries:
Headings and titles naturally delimit sections. -
Enhanced Merging:
Combining geometric and logical cues helps merge key-value pairs and group related elements (e.g., merging tables with embedded key-pairs).
-
OCR & Layout Extraction:
Use an OCR engine (e.g., Azure OCR) to obtain text along with bounding boxes/polygons and logical roles. -
Preprocessing:
Normalize and convert polygon data; extract elements from keys such aspages
,paragraphs
,tables
,keyValuePairs
, andsections
. -
Semantic Chunking:
- Group content using headings and spatial proximity.
- Merge key-value pairs by combining text and bounding boxes.
-
Downstream Processing:
Process each semantic chunk using NLP (e.g., summarization, entity extraction) with improved context from layout data.
-
Higher Accuracy:
Combining text with layout cues improves data extraction. -
Better Context:
Semantic chunks align with the document’s visual structure. -
Modular & Extensible:
Easily adapt to various document formats and content types.
Output from Agent using the following prompt,
LAYOUT_PROMPT = """Analyze this document's layout structure following these specific steps.
<instructions>
1. First analyze the visual structure and spatial relationships
2. Then identify the logical reading order
3. Finally return a JSON structure describing the complete layout
</instructions>
<example>
{
"document_type": "form",
"layout": {
"regions": [
{
"id": "r1",
"name": "header",
"type": "header",
"order": 1,
"position": "top of page"
}
],
"reading_flow": ["r1", "r2"],
"spatial_relationships": [
{
"from_region": "r1",
"to_region": "r2",
"relationship": "above"
}
]
}
}
</example>
Return ONLY the JSON structure with no additional text."""
{'regions': [{'id': 'r1',
'name': 'header_text',
'type': 'header',
'order': 1,
'position': 'top center of page',
'bounds': '',
'explanation_of_decision': "The header 'FINAL REPORT AMENDMENT' is at the top and centered.",
'decision_text': 'FINAL REPORT AMENDMENT',
'next_regions': [],
'contains': []},
{'id': 'r2',
'name': 'study_information',
'type': 'information',
'order': 2,
'position': 'below header',
'bounds': '',
'explanation_of_decision': 'Contains details about the study, immediately under the header.',
'decision_text': 'Study Name Acute Toxicity of Reference Cigarette Smoke ...',
'next_regions': [],
'contains': []},
{'id': 'r3',
'name': 'reason_for_amendment',
'type': 'paragraph',
'order': 3,
'position': 'below study information',
'bounds': '',
'explanation_of_decision': 'A section detailing the reason for the amendment follows the study information.',
'decision_text': 'Reason for the Amendment...',
'next_regions': [],
'contains': []},
{'id': 'r4',
'name': 'amendment_details',
'type': 'paragraph',
'order': 4,
'position': 'below reason for amendment',
'bounds': '',
'explanation_of_decision': 'Details about the amendment are provided under the reason for the amendment paragraph.',
'decision_text': 'Amendment (Attach additional sheets as necessary)...',
'next_regions': [],
'contains': []},
{'id': 'r5',
'name': 'approvals_section',
'type': 'table',
'order': 5,
'position': 'below amendment details',
'bounds': '',
'explanation_of_decision': 'The section containing approvals and signatures is at the bottom.',
'decision_text': 'APPROVALS ACCEPT/REJECT...',
'next_regions': [],
'contains': []},
{'id': 'r6',
'name': 'received_by',
'type': 'footer',
'order': 6,
'position': 'bottom of page',
'bounds': '',
'explanation_of_decision': 'Footer at the very bottom indicating received by information.',
'decision_text': 'Received by REGULATORY AFFAIRS...',
'next_regions': [],
'contains': []}],
'reading_flow': [],
Location of bounding boxes based on output from the following prompt,
LAYOUT_PROMPT = """Analyze this document's layout structure following these specific steps.
<instructions>
1. First analyze the visual structure and spatial relationships
2. Then add an explanation for your reasoning for each region in the field "explanation_of_decision"
3. Then identify any text that you used for your reasoning in the field "decision_text"
3. Then identify the logical reading order
4. Finally return a JSON structure describing the complete layout
</instructions>
<example>
{
"document_type": "form",
"layout": {
"regions": [
{
"id": "r1",
"name": "header",
"type": "header",
"order": 1,
"position": "top of page",
"reading_flow": ["r1", "r2"],
"explanation_of_decision": "At top of page",
"spatial_relationships": [
{
"from_region": "r1",
"to_region": "r2",
"relationship": "above"
}
]
}
}
</example>
Return ONLY the JSON structure with no additional text."""
Below is results from a prototype Python script that:
- Reads your OCR JSON (the sample you provided).
- Sorts lines (or paragraphs) by bounding box y-coordinate.
- Heuristically detects headings vs. body lines.
- Groups lines under the last-detected heading (or in a “Misc” section if no heading is detected).
-
Marks any line as handwritten if it overlaps a style’s
offset
andlength
. - Outputs a hierarchical JSON structure containing headings, lines, bounding boxes, and handwriting flags.
You can adapt these heuristics or add domain-specific logic to suit your documents.
-
is_heading
function:- Checks for lines in mostly uppercase or lines that end with “:”.
- You can tweak this to better match your headings.
-
check_handwriting
function:- For each paragraph line, we see if
(offset,length)
overlaps with any style(offset,length)
that hasis_handwritten=True
. - Overlap means they’re not disjoint. If overlap is found, returns True.
- For each paragraph line, we see if
-
Sorting & Grouping:
- We approximate each paragraph’s position by the minimal
y
in its bounding polygon (approx_y
). - Sort from top to bottom.
- If
is_heading
, start a new node inhierarchy
. Otherwise, add to the last heading’slines
.
- We approximate each paragraph’s position by the minimal
-
Output:
- The final data structure is a list of headings, each with a
lines
array. - Each line has
text
,bounding_box
, andis_handwritten
. - If a line appears before any heading is found, it goes into a default “Misc” heading.
- The final data structure is a list of headings, each with a
-
Enhancements:
- You could detect subheadings vs. main headings (e.g., bigger gaps or indentation).
- You could store OCR confidences or per-word bounding boxes similarly.
- You could unify multiple lines if they’re close in y-coordinates to form paragraphs.
- If you want to handle more advanced layout (like columns or tables), you’d add more sophisticated heuristics or domain logic.
With this script, you do hierarchical grouping and handwriting annotation before calling an LLM—giving you more control and a structured starting point. Then, if needed, you can pass the resulting JSON to an LLM to refine headings or annotate fields with domain-specific labels.
Document structure layout analysis is the process of analyzing a document to extract regions of interest and their inter-relationships. The goal is to extract text and structural elements from the page to build better semantic understanding models. There are two types of roles in a document layout: Geometric roles: Text, tables, figures, and selection marks are examples of geometric roles. Logical roles: Titles, headings, and footers are examples of logical roles of texts. The following illustration shows the typical components in an image of a sample page.