Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

49 documentloaderconfig #191

Merged
merged 2 commits into from
Jan 13, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
116 changes: 67 additions & 49 deletions docs/core-concepts/document-loaders/aws-textract.md
Original file line number Diff line number Diff line change
@@ -1,74 +1,92 @@
# AWS Textract Document Loader

> AWS Textract provides advanced OCR and document analysis capabilities, extracting text, forms, and tables from documents.

## Installation

Install the required dependencies:

```bash
pip install boto3
```

## Prerequisites

1. An AWS account
2. AWS credentials with access to Textract service
3. AWS region where Textract is available
The AWS Textract loader uses Amazon's Textract service to extract text, forms, and tables from documents. It supports both image files and PDFs.

## Supported Formats

- Images: jpeg/jpg, png, tiff
- Documents: pdf
- pdf
- jpeg
- png
- tiff

## Usage

### Basic Usage

```python
from extract_thinker import DocumentLoaderAWSTextract

# Initialize the loader with AWS credentials
# Initialize with AWS credentials
loader = DocumentLoaderAWSTextract(
aws_access_key_id="your-access-key",
aws_secret_access_key="your-secret-key",
region_name="your-region"
aws_access_key_id="your_access_key",
aws_secret_access_key="your_secret_key",
region_name="your_region"
)

# Load document content
result = loader.load_content_from_file("document.pdf")
```
# Load document
pages = loader.load("path/to/your/document.pdf")

## Response Structure
# Process extracted content
for page in pages:
# Access text content
text = page["content"]
# Access tables if extracted
tables = page.get("tables", [])
```

The loader returns a dictionary with the following structure:
### Configuration-based Usage

```python
{
"pages": [
{
"paragraphs": ["text content..."],
"lines": ["line1", "line2"],
"words": ["word1", "word2"]
}
],
"tables": [
[["cell1", "cell2"], ["cell3", "cell4"]]
],
"forms": [
{"key": "value"}
],
"layout": {
# Document layout information
}
}
from extract_thinker import DocumentLoaderAWSTextract, TextractConfig

# Create configuration
config = TextractConfig(
aws_access_key_id="your_access_key",
aws_secret_access_key="your_secret_key",
region_name="your_region",
feature_types=["TABLES", "FORMS", "SIGNATURES"], # Specify features to extract
cache_ttl=600, # Cache results for 10 minutes
max_retries=3 # Number of retry attempts
)

# Initialize loader with configuration
loader = DocumentLoaderAWSTextract(config)

# Load and process document
pages = loader.load("path/to/your/document.pdf")
```

## Supported Formats
## Configuration Options

The `TextractConfig` class supports the following options:

`PDF`, `JPEG`, `PNG`
| Option | Type | Default | Description |
|--------|------|---------|-------------|
| `content` | Any | None | Initial content to process |
| `cache_ttl` | int | 300 | Cache time-to-live in seconds |
| `aws_access_key_id` | str | None | AWS access key ID |
| `aws_secret_access_key` | str | None | AWS secret access key |
| `region_name` | str | None | AWS region name |
| `textract_client` | boto3.client | None | Pre-configured Textract client |
| `feature_types` | List[str] | [] | Features to extract (TABLES, FORMS, LAYOUT, SIGNATURES) |
| `max_retries` | int | 3 | Maximum number of retry attempts |

## Features

- Text extraction with layout preservation
- Text extraction from images and PDFs
- Table detection and extraction
- Support for multiple document formats
- Automatic retries on API failures
- Form field detection
- Layout analysis
- Signature detection
- Configurable feature selection
- Automatic retry on failure
- Caching support
- Support for pre-configured clients

## Notes

- Raw text extraction is the default when no feature types are specified
- "QUERIES" feature type is not supported
- Vision mode is supported for image formats
- AWS credentials are required unless using a pre-configured client
- Rate limits and quotas apply based on your AWS account
85 changes: 61 additions & 24 deletions docs/core-concepts/document-loaders/azure-form.md
Original file line number Diff line number Diff line change
@@ -1,34 +1,23 @@
# Azure Document Intelligence Document Loader
# Azure Document Intelligence Loader

The Azure Document Intelligence loader (formerly known as Form Recognizer) uses Azure's Document Intelligence service to extract text, tables, and layout information from documents.

## Installation

Install the required dependencies:

```bash
pip install azure-ai-formrecognizer
```

## Prerequisites

1. An Azure subscription
2. A Document Intelligence resource created in your Azure portal
3. The endpoint URL and subscription key from your Azure resource
The Azure Document Intelligence loader (formerly Form Recognizer) uses Azure's Document Intelligence service to extract text, tables, and structured information from documents.

## Supported Formats

Supports `PDF`, `JPEG/JPG`, `PNG`, `BMP`, `TIFF`, `HEIF`, `DOCX`, `XLSX`, `PPTX` and `HTML`.

## Usage

### Basic Usage

```python
from extract_thinker import DocumentLoaderAzureForm

# Initialize the loader
# Initialize with Azure credentials
loader = DocumentLoaderAzureForm(
subscription_key="your-subscription-key",
endpoint="your-endpoint-url"
endpoint="your_endpoint",
key="your_api_key",
model="prebuilt-document" # Use prebuilt document model
)

# Load document
Expand All @@ -38,14 +27,62 @@ pages = loader.load("path/to/your/document.pdf")
for page in pages:
# Access text content
text = page["content"]

# Access tables (if any)
tables = page["tables"]
# Access tables if available
tables = page.get("tables", [])
```

### Configuration-based Usage

```python
from extract_thinker import DocumentLoaderAzureForm, AzureConfig

# Create configuration
config = AzureConfig(
endpoint="your_endpoint",
key="your_api_key",
model="prebuilt-read", # Use layout model for enhanced layout analysis
language="en", # Specify document language
pages=[1, 2, 3], # Process specific pages
cache_ttl=600 # Cache results for 10 minutes
)

# Initialize loader with configuration
loader = DocumentLoaderAzureForm(config)

# Load and process document
pages = loader.load("path/to/your/document.pdf")
```

## Configuration Options

The `AzureConfig` class supports the following options:

| Option | Type | Default | Description |
|--------|------|---------|-------------|
| `content` | Any | None | Initial content to process |
| `cache_ttl` | int | 300 | Cache time-to-live in seconds |
| `endpoint` | str | None | Azure endpoint URL |
| `key` | str | None | Azure API key |
| `model` | str | "prebuilt-document" | Model ID to use |
| `language` | str | None | Document language code |
| `pages` | List[int] | None | Specific pages to process |
| `reading_order` | str | "natural" | Text reading order |

## Features

- Text extraction with layout preservation
- Table detection and extraction
- Support for multiple document formats
- Automatic table content deduplication from text
- Form field recognition
- Multiple model support (document, layout, read)
- Language specification
- Page selection
- Reading order control
- Caching support
- Support for pre-configured clients

## Notes

- Available models: "prebuilt-document", "prebuilt-layout", "prebuilt-read"
- Vision mode is supported for image formats
- Azure credentials are required
- Rate limits and quotas apply based on your Azure subscription
65 changes: 51 additions & 14 deletions docs/core-concepts/document-loaders/doc2txt.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,6 @@
# Microsoft Word Document Loader (Doc2txt)
# Doc2txt Document Loader

The Doc2txt loader is designed to handle Microsoft Word documents (`.doc` and `.docx` files). It uses the `docx2txt` library to extract text content from Word documents.

## Installation

Install the required dependencies:

```bash
pip install docx2txt
```
The Doc2txt loader extracts text from Microsoft Word documents. It supports both legacy (.doc) and modern (.docx) file formats.

## Supported Formats

Expand All @@ -17,10 +9,12 @@ pip install docx2txt

## Usage

### Basic Usage

```python
from extract_thinker import DocumentLoaderDoc2txt

# Initialize the loader
# Initialize with default settings
loader = DocumentLoaderDoc2txt()

# Load document
Expand All @@ -32,9 +26,52 @@ for page in pages:
text = page["content"]
```

### Configuration-based Usage

```python
from extract_thinker import DocumentLoaderDoc2txt, Doc2txtConfig

# Create configuration
config = Doc2txtConfig(
page_separator="\n\n---\n\n", # Custom page separator
preserve_whitespace=True, # Preserve original whitespace
extract_images=True, # Extract embedded images
cache_ttl=600 # Cache results for 10 minutes
)

# Initialize loader with configuration
loader = DocumentLoaderDoc2txt(config)

# Load and process document
pages = loader.load("path/to/your/document.docx")
```

## Configuration Options

The `Doc2txtConfig` class supports the following options:

| Option | Type | Default | Description |
|--------|------|---------|-------------|
| `content` | Any | None | Initial content to process |
| `cache_ttl` | int | 300 | Cache time-to-live in seconds |
| `page_separator` | str | "\n\n" | Text to use as page separator |
| `preserve_whitespace` | bool | False | Whether to preserve whitespace |
| `extract_images` | bool | False | Whether to extract embedded images |

## Features

- Text extraction from Word documents
- Support for both .doc and .docx formats
- Automatic page detection
- Preserves basic text formatting
- Support for both .doc and .docx
- Custom page separation
- Whitespace preservation
- Image extraction (optional)
- Caching support
- No cloud service required

## Notes

- Vision mode is not supported
- Image extraction requires additional memory
- Local processing with no external dependencies
- May not preserve complex formatting
- Handles both legacy and modern Word formats
Loading
Loading