Skip to content

Semantic Chunking with Layout Extraction

Chris Sweet edited this page Feb 22, 2025 · 8 revisions

Leveraging Document Layout for Semantic Chunking & Data Extraction

Modern documents mix complex layouts with diverse content types. By using both geometric and logical information, you can segment documents into semantically meaningful chunks, which improves data extraction accuracy.

Key Concepts

  • Geometric Roles:

    • Information about spatial position (e.g., bounding boxes for paragraphs, tables, figures).
  • Logical Roles:

    • Structural cues (e.g., headings, titles, footers) that indicate section boundaries.
  • Key-Value Pairs:

    • Structured data found in forms that link labels (keys) with values.

How Layout Aids Extraction

  • Proximity & Alignment:
    Spatially adjacent and aligned elements are likely related.

  • Logical Boundaries:
    Headings and titles naturally delimit sections.

  • Enhanced Merging:
    Combining geometric and logical cues helps merge key-value pairs and group related elements (e.g., merging tables with embedded key-pairs).

Workflow Overview

  1. OCR & Layout Extraction:
    Use an OCR engine (e.g., Azure OCR) to obtain text along with bounding boxes/polygons and logical roles.

  2. Preprocessing:
    Normalize and convert polygon data; extract elements from keys such as pages, paragraphs, tables, keyValuePairs, and sections.

  3. Semantic Chunking:

    • Group content using headings and spatial proximity.
    • Merge key-value pairs by combining text and bounding boxes.
  4. Downstream Processing:
    Process each semantic chunk using NLP (e.g., summarization, entity extraction) with improved context from layout data.

Benefits

  • Higher Accuracy:
    Combining text with layout cues improves data extraction.
  • Better Context:
    Semantic chunks align with the document’s visual structure.
  • Modular & Extensible:
    Easily adapt to various document formats and content types.

Experiments

Experiment 1: Using LLMs (Chuck's Funsd_Lexical_Graph.ipynb gist)

Output from Agent using the following prompt,

LAYOUT_PROMPT = """Analyze this document's layout structure following these specific steps.

<instructions>
1. First analyze the visual structure and spatial relationships
2. Then identify the logical reading order
3. Finally return a JSON structure describing the complete layout
</instructions>

<example>
{
    "document_type": "form",
    "layout": {
        "regions": [
            {
                "id": "r1",
                "name": "header",
                "type": "header",
                "order": 1,
                "position": "top of page"
            }
        ],
        "reading_flow": ["r1", "r2"],
        "spatial_relationships": [
            {
                "from_region": "r1",
                "to_region": "r2",
                "relationship": "above"
            }
        ]
    }
}
</example>

Return ONLY the JSON structure with no additional text."""
{'regions': [{'id': 'r1',
   'name': 'header_text',
   'type': 'header',
   'order': 1,
   'position': 'top center of page',
   'bounds': '',
   'explanation_of_decision': "The header 'FINAL REPORT AMENDMENT' is at the top and centered.",
   'decision_text': 'FINAL REPORT AMENDMENT',
   'next_regions': [],
   'contains': []},
  {'id': 'r2',
   'name': 'study_information',
   'type': 'information',
   'order': 2,
   'position': 'below header',
   'bounds': '',
   'explanation_of_decision': 'Contains details about the study, immediately under the header.',
   'decision_text': 'Study Name Acute Toxicity of Reference Cigarette Smoke ...',
   'next_regions': [],
   'contains': []},
  {'id': 'r3',
   'name': 'reason_for_amendment',
   'type': 'paragraph',
   'order': 3,
   'position': 'below study information',
   'bounds': '',
   'explanation_of_decision': 'A section detailing the reason for the amendment follows the study information.',
   'decision_text': 'Reason for the Amendment...',
   'next_regions': [],
   'contains': []},
  {'id': 'r4',
   'name': 'amendment_details',
   'type': 'paragraph',
   'order': 4,
   'position': 'below reason for amendment',
   'bounds': '',
   'explanation_of_decision': 'Details about the amendment are provided under the reason for the amendment paragraph.',
   'decision_text': 'Amendment (Attach additional sheets as necessary)...',
   'next_regions': [],
   'contains': []},
  {'id': 'r5',
   'name': 'approvals_section',
   'type': 'table',
   'order': 5,
   'position': 'below amendment details',
   'bounds': '',
   'explanation_of_decision': 'The section containing approvals and signatures is at the bottom.',
   'decision_text': 'APPROVALS ACCEPT/REJECT...',
   'next_regions': [],
   'contains': []},
  {'id': 'r6',
   'name': 'received_by',
   'type': 'footer',
   'order': 6,
   'position': 'bottom of page',
   'bounds': '',
   'explanation_of_decision': 'Footer at the very bottom indicating received by information.',
   'decision_text': 'Received by REGULATORY AFFAIRS...',
   'next_regions': [],
   'contains': []}],
 'reading_flow': [],

Experiment 2: Using LLMs to find region bounding boxes

Location of bounding boxes based on output from the following prompt,

LAYOUT_PROMPT = """Analyze this document's layout structure following these specific steps.

<instructions>
1. First analyze the visual structure and spatial relationships
2. Then add an explanation for your reasoning for each region in the field "explanation_of_decision"
3. Then identify any text that you used for your reasoning  in the field "decision_text"
3. Then identify the logical reading order
4. Finally return a JSON structure describing the complete layout
</instructions>

<example>
{
    "document_type": "form",
    "layout": {
        "regions": [
            {
                "id": "r1",
                "name": "header",
                "type": "header",
                "order": 1,
                "position": "top of page",
                "reading_flow": ["r1", "r2"],
                "explanation_of_decision": "At top of page",
                "spatial_relationships": [
                    {
                        "from_region": "r1",
                        "to_region": "r2",
                        "relationship": "above"
                    }
                ]
    }
}
</example>

Return ONLY the JSON structure with no additional text."""

image

Experiment 3: Use OCR bounding boxes to analyze layout

Below is results from a prototype Python script that:

  1. Reads your OCR JSON (the sample you provided).
  2. Sorts lines (or paragraphs) by bounding box y-coordinate.
  3. Heuristically detects headings vs. body lines.
  4. Groups lines under the last-detected heading (or in a “Misc” section if no heading is detected).
  5. Marks any line as handwritten if it overlaps a style’s offset and length.
  6. Outputs a hierarchical JSON structure containing headings, lines, bounding boxes, and handwriting flags.

You can adapt these heuristics or add domain-specific logic to suit your documents.


Explanation & Customization

  1. is_heading function:

    • Checks for lines in mostly uppercase or lines that end with “:”.
    • You can tweak this to better match your headings.
  2. check_handwriting function:

    • For each paragraph line, we see if (offset,length) overlaps with any style (offset,length) that has is_handwritten=True.
    • Overlap means they’re not disjoint. If overlap is found, returns True.
  3. Sorting & Grouping:

    • We approximate each paragraph’s position by the minimal y in its bounding polygon (approx_y).
    • Sort from top to bottom.
    • If is_heading, start a new node in hierarchy. Otherwise, add to the last heading’s lines.
  4. Output:

    • The final data structure is a list of headings, each with a lines array.
    • Each line has text, bounding_box, and is_handwritten.
    • If a line appears before any heading is found, it goes into a default “Misc” heading.
  5. Enhancements:

    • You could detect subheadings vs. main headings (e.g., bigger gaps or indentation).
    • You could store OCR confidences or per-word bounding boxes similarly.
    • You could unify multiple lines if they’re close in y-coordinates to form paragraphs.
    • If you want to handle more advanced layout (like columns or tables), you’d add more sophisticated heuristics or domain logic.

With this script, you do hierarchical grouping and handwriting annotation before calling an LLM—giving you more control and a structured starting point. Then, if needed, you can pass the resulting JSON to an LLM to refine headings or annotate fields with domain-specific labels.

image

Experiment 5: Leveraging latest Azure Document Intelligence features

Document structure layout analysis is the process of analyzing a document to extract regions of interest and their inter-relationships. The goal is to extract text and structural elements from the page to build better semantic understanding models. There are two types of roles in a document layout: Geometric roles: Text, tables, figures, and selection marks are examples of geometric roles. Logical roles: Titles, headings, and footers are examples of logical roles of texts. The following illustration shows the typical components in an image of a sample page.

IMG_0418

Segmentation/Regions for Atomic Key-Pairs

image

Segmentation/Regions for Atomic Key-Pairs and Atomic Tables

image

Segmentation/Regions for Atomic Key-Pairs and Atomic Tables with role paragraphs as clusters

image

Same as above but using GPT's semantic designations for the regions

image