|
| 1 | +# How to Configure LandingAI ADE |
| 2 | + |
| 3 | +This guide shows you how to configure the LandingAI ADE (Agentic Document Extraction) driver for document processing, including setting default options and overriding them on a per-document basis. |
| 4 | + |
| 5 | +## Prerequisites |
| 6 | + |
| 7 | +- Parxy installed with LandingAI support: `pip install parxy[landingai]` or via UV `uv add parxy[landingai]` |
| 8 | +- A LandingAI API key from [LandingAI](https://landing.ai/) |
| 9 | + |
| 10 | +## Quick Start |
| 11 | + |
| 12 | +### Step 1: Set Your API Key |
| 13 | + |
| 14 | +Create a `.env` file in your project directory: |
| 15 | + |
| 16 | +```bash |
| 17 | +PARXY_LANDINGAI_API_KEY=your-api-key-here |
| 18 | +``` |
| 19 | + |
| 20 | +Or set it as an environment variable: |
| 21 | + |
| 22 | +```bash |
| 23 | +export PARXY_LANDINGAI_API_KEY=your-api-key-here |
| 24 | +``` |
| 25 | + |
| 26 | +### Step 2: Parse a Document |
| 27 | + |
| 28 | +```python |
| 29 | +from parxy_core.facade.parxy import Parxy |
| 30 | + |
| 31 | +doc = Parxy.parse("document.pdf", driver_name="landingai") |
| 32 | +print(f"Processed {len(doc.pages)} pages") |
| 33 | +``` |
| 34 | + |
| 35 | +## Configuration Options |
| 36 | + |
| 37 | +LandingAI ADE supports configuration options that control API connectivity. These can be set via environment variables or programmatic configuration. |
| 38 | + |
| 39 | +### Environment Variables |
| 40 | + |
| 41 | +All LandingAI configuration uses environment variables with the `PARXY_LANDINGAI_` prefix: |
| 42 | + |
| 43 | +| Variable | Type | Default | Description | |
| 44 | +|----------|------|---------|-------------| |
| 45 | +| `PARXY_LANDINGAI_API_KEY` | string | None | Your LandingAI API key | |
| 46 | +| `PARXY_LANDINGAI_ENVIRONMENT` | string | `eu` | API environment (`production` or `eu`) | |
| 47 | +| `PARXY_LANDINGAI_BASE_URL` | string | None | Custom API endpoint (overrides environment) | |
| 48 | + |
| 49 | +### Environment Options |
| 50 | + |
| 51 | +LandingAI offers two hosted environments: |
| 52 | + |
| 53 | +| Environment | API Endpoint | Description | |
| 54 | +|-------------|--------------|-------------| |
| 55 | +| `production` | `https://api.va.landing.ai` | US-based production environment | |
| 56 | +| `eu` | `https://api.va.eu-west-1.landing.ai` | EU-based environment (default) | |
| 57 | + |
| 58 | +To use the US production environment: |
| 59 | + |
| 60 | +```bash |
| 61 | +PARXY_LANDINGAI_ENVIRONMENT=production |
| 62 | +``` |
| 63 | + |
| 64 | +### Custom Base URL |
| 65 | + |
| 66 | +If you need to use a custom endpoint (e.g., self-hosted or enterprise deployment), set the base URL directly and set environment to `None`: |
| 67 | + |
| 68 | +```bash |
| 69 | +PARXY_LANDINGAI_BASE_URL=https://your-custom-endpoint.example.com |
| 70 | +PARXY_LANDINGAI_ENVIRONMENT= |
| 71 | +``` |
| 72 | + |
| 73 | +## Document Structure and Roles |
| 74 | + |
| 75 | +LandingAI ADE extracts structured content from documents and categorizes each chunk by type. Parxy maps these types to WAI-ARIA document structure roles for semantic understanding. |
| 76 | + |
| 77 | +### Chunk Type Mappings |
| 78 | + |
| 79 | +| LandingAI Type | Parxy Role | Description | |
| 80 | +|----------------|------------|-------------| |
| 81 | +| `text` | `paragraph` | Regular text content | |
| 82 | +| `table` | `table` | Tabular data | |
| 83 | +| `figure` | `figure` | Images and diagrams | |
| 84 | +| `logo` | `figure` | Company logos (DPT-2 model) | |
| 85 | +| `card` | `figure` | ID cards, driver licenses (DPT-2 model) | |
| 86 | +| `attestation` | `figure` | Signatures, stamps, seals (DPT-2 model) | |
| 87 | +| `scan_code` | `figure` | QR codes, barcodes (DPT-2 model) | |
| 88 | +| `marginalia` | `generic` | Mixed content in margins | |
| 89 | +| `heading` | `heading` | Section headings | |
| 90 | +| `title` | `doc-title` | Document title | |
| 91 | +| `subtitle` | `doc-subtitle` | Document subtitle | |
| 92 | +| `chapter` | `doc-chapter` | Chapter markers | |
| 93 | +| `page-header` / `header` | `doc-pageheader` | Page headers | |
| 94 | +| `page-footer` / `footer` | `doc-pagefooter` | Page footers | |
| 95 | +| `page-number` | `doc-pagefooter` | Page numbers | |
| 96 | +| `footnote` / `note` | `doc-footnote` | Footnotes | |
| 97 | +| `endnote` | `doc-endnotes` | Endnotes | |
| 98 | + |
| 99 | +## Programmatic Configuration |
| 100 | + |
| 101 | +You can configure the driver programmatically: |
| 102 | + |
| 103 | +```python |
| 104 | +from parxy_core.facade.parxy import Parxy |
| 105 | +from parxy_core.models.config import LandingAIConfig |
| 106 | + |
| 107 | +# Create custom configuration for EU environment |
| 108 | +config = LandingAIConfig( |
| 109 | + api_key="your-api-key", |
| 110 | + environment="eu", |
| 111 | +) |
| 112 | + |
| 113 | +# Get driver with custom config |
| 114 | +driver = Parxy.driver("landingai", config=config) |
| 115 | + |
| 116 | +# Parse documents |
| 117 | +doc = driver.handle("document.pdf", level="block") |
| 118 | +``` |
| 119 | + |
| 120 | +### Using US Production Environment |
| 121 | + |
| 122 | +```python |
| 123 | +from parxy_core.facade.parxy import Parxy |
| 124 | +from parxy_core.models.config import LandingAIConfig |
| 125 | + |
| 126 | +config = LandingAIConfig( |
| 127 | + api_key="your-api-key", |
| 128 | + environment="production", # Use US endpoint |
| 129 | +) |
| 130 | + |
| 131 | +driver = Parxy.driver("landingai", config=config) |
| 132 | +doc = driver.handle("document.pdf") |
| 133 | +``` |
| 134 | + |
| 135 | +### Using Custom Endpoint |
| 136 | + |
| 137 | +```python |
| 138 | +from parxy_core.facade.parxy import Parxy |
| 139 | +from parxy_core.models.config import LandingAIConfig |
| 140 | + |
| 141 | +config = LandingAIConfig( |
| 142 | + api_key="your-api-key", |
| 143 | + environment=None, # Disable default environment |
| 144 | + base_url="https://your-custom-endpoint.example.com", |
| 145 | +) |
| 146 | + |
| 147 | +driver = Parxy.driver("landingai", config=config) |
| 148 | +doc = driver.handle("document.pdf") |
| 149 | +``` |
| 150 | + |
| 151 | +## Cost Estimation |
| 152 | + |
| 153 | +Parxy automatically tracks parsing costs in the document metadata: |
| 154 | + |
| 155 | +```python |
| 156 | +doc = Parxy.parse("document.pdf", driver_name="landingai") |
| 157 | + |
| 158 | +# Access cost information |
| 159 | +metadata = doc.parsing_metadata |
| 160 | +print(f"Credit usage: {metadata.get('cost_estimation')} {metadata.get('cost_estimation_unit')}") |
| 161 | +``` |
| 162 | + |
| 163 | +## Document Metadata |
| 164 | + |
| 165 | +After parsing, the document contains additional metadata from LandingAI ADE: |
| 166 | + |
| 167 | +```python |
| 168 | +doc = Parxy.parse("document.pdf", driver_name="landingai") |
| 169 | + |
| 170 | +metadata = doc.parsing_metadata |
| 171 | + |
| 172 | +# ADE-specific details |
| 173 | +details = metadata.get('ade_details', {}) |
| 174 | +print(f"Processing time: {details.get('duration_ms')} ms") |
| 175 | +print(f"Job ID: {details.get('job_id')}") |
| 176 | +print(f"Page count: {details.get('page_count')}") |
| 177 | +print(f"API version: {details.get('version')}") |
| 178 | +print(f"Filename: {details.get('filename')}") |
| 179 | + |
| 180 | +# Check for any failed pages (partial content) |
| 181 | +if 'failed_pages' in details: |
| 182 | + print(f"Failed pages: {details.get('failed_pages')}") |
| 183 | +``` |
| 184 | + |
| 185 | +## Working with Extracted Content |
| 186 | + |
| 187 | +### Accessing Blocks by Role |
| 188 | + |
| 189 | +```python |
| 190 | +doc = Parxy.parse("document.pdf", driver_name="landingai") |
| 191 | + |
| 192 | +for page in doc.pages: |
| 193 | + # Get all tables |
| 194 | + tables = [b for b in page.blocks if b.role == 'table'] |
| 195 | + |
| 196 | + # Get all headings |
| 197 | + headings = [b for b in page.blocks if b.role == 'heading'] |
| 198 | + |
| 199 | + # Get document title |
| 200 | + titles = [b for b in page.blocks if b.role == 'doc-title'] |
| 201 | + |
| 202 | + # Get figures (images, logos, etc.) |
| 203 | + figures = [b for b in page.blocks if b.role == 'figure'] |
| 204 | + |
| 205 | + print(f"Page {page.number}: {len(tables)} tables, {len(headings)} headings") |
| 206 | +``` |
| 207 | + |
| 208 | +### Accessing Bounding Boxes |
| 209 | + |
| 210 | +LandingAI ADE provides bounding box coordinates for each extracted chunk: |
| 211 | + |
| 212 | +```python |
| 213 | +doc = Parxy.parse("document.pdf", driver_name="landingai") |
| 214 | + |
| 215 | +for page in doc.pages: |
| 216 | + for block in page.blocks: |
| 217 | + if block.bbox: |
| 218 | + print(f"Block at ({block.bbox.x0}, {block.bbox.y0}) - ({block.bbox.x1}, {block.bbox.y1})") |
| 219 | + print(f" Type: {block.category}") |
| 220 | + print(f" Role: {block.role}") |
| 221 | + print(f" Text: {block.text[:50]}...") |
| 222 | +``` |
| 223 | + |
| 224 | +### Accessing Original Chunk Data |
| 225 | + |
| 226 | +The original LandingAI chunk data is preserved in `source_data`: |
| 227 | + |
| 228 | +```python |
| 229 | +doc = Parxy.parse("document.pdf", driver_name="landingai") |
| 230 | + |
| 231 | +for page in doc.pages: |
| 232 | + for block in page.blocks: |
| 233 | + original = block.source_data |
| 234 | + # Access any LandingAI-specific fields |
| 235 | + print(f"Original type: {original.get('type')}") |
| 236 | + print(f"Markdown: {original.get('markdown')}") |
| 237 | +``` |
| 238 | + |
| 239 | +## Troubleshooting |
| 240 | + |
| 241 | +### Authentication Errors |
| 242 | + |
| 243 | +If you see authentication errors: |
| 244 | + |
| 245 | +1. Verify your API key is correct |
| 246 | +2. Check the key has not expired |
| 247 | +3. Ensure you're using the correct environment for your account |
| 248 | + |
| 249 | +```python |
| 250 | +# Test authentication |
| 251 | +from parxy_core.facade.parxy import Parxy |
| 252 | +from parxy_core.models.config import LandingAIConfig |
| 253 | + |
| 254 | +config = LandingAIConfig(api_key="your-key", environment="eu") |
| 255 | +driver = Parxy.driver("landingai", config=config) |
| 256 | +# If no error, authentication is working |
| 257 | +``` |
| 258 | + |
| 259 | +### Rate Limiting |
| 260 | + |
| 261 | +If you encounter 429 errors (rate limiting): |
| 262 | + |
| 263 | +1. Reduce the frequency of API calls |
| 264 | +2. Implement retry logic with exponential backoff |
| 265 | +3. Contact LandingAI for higher rate limits if needed |
| 266 | + |
| 267 | +### Quota Exceeded |
| 268 | + |
| 269 | +If you see 402 errors (quota exceeded): |
| 270 | + |
| 271 | +1. Check your account's remaining credits |
| 272 | +2. Purchase additional credits from LandingAI |
| 273 | + |
| 274 | +### Input Validation Errors |
| 275 | + |
| 276 | +If you see 422 errors (input validation): |
| 277 | + |
| 278 | +1. Ensure the file format is supported (PDF, images) |
| 279 | +2. Check the file is not corrupted |
| 280 | +3. Verify the file size is within limits |
| 281 | + |
| 282 | +### Partial Content / Failed Pages |
| 283 | + |
| 284 | +If some pages fail to process: |
| 285 | + |
| 286 | +```python |
| 287 | +doc = Parxy.parse("document.pdf", driver_name="landingai") |
| 288 | + |
| 289 | +details = doc.parsing_metadata.get('ade_details', {}) |
| 290 | +if 'failed_pages' in details: |
| 291 | + failed = details['failed_pages'] |
| 292 | + print(f"Warning: Pages {failed} failed to process") |
| 293 | +``` |
| 294 | + |
| 295 | +This can happen with: |
| 296 | +- Corrupted pages |
| 297 | +- Pages with unsupported content |
| 298 | +- Processing timeouts on complex pages |
| 299 | + |
| 300 | +### Wrong Environment |
| 301 | + |
| 302 | +If API calls fail with connection errors: |
| 303 | + |
| 304 | +1. Verify the environment setting matches your account region |
| 305 | +2. Try explicitly setting the base URL |
| 306 | +3. Check network connectivity to the LandingAI API |
| 307 | + |
| 308 | +## See Also |
| 309 | + |
| 310 | +- [LandingAI ADE Documentation](https://docs.landing.ai/ade/) |
| 311 | +- [LandingAI ADE JSON Response](https://docs.landing.ai/ade/ade-json-response.md) |
| 312 | +- [Document Structure Roles](../explanation/document-roles.md) |
| 313 | +- [Getting Started Tutorial](../tutorials/getting_started.md) |
0 commit comments