Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: document Docling JSON parsing #819

Merged
merged 2 commits into from
Jan 28, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 12 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,22 +22,21 @@
[![License MIT](https://img.shields.io/github/license/DS4SD/docling)](https://opensource.org/licenses/MIT)
[![PyPI Downloads](https://static.pepy.tech/badge/docling/month)](https://pepy.tech/projects/docling)

Docling parses documents and exports them to the desired format with ease and speed.
Docling simplifies document processing, parsing diverse formats — including advanced PDF understanding — and providing seamless integrations with the gen AI ecosystem.

## Features

* 🗂️ Reads popular document formats (PDF, DOCX, PPTX, XLSX, Images, HTML, AsciiDoc & Markdown) and exports to HTML, Markdown and JSON (with embedded and referenced images)
* 📑 Advanced PDF document understanding including page layout, reading order & table structures
* 🧩 Unified, expressive [DoclingDocument](https://ds4sd.github.io/docling/concepts/docling_document/) representation format
* 🤖 Plug-and-play [integrations](https://ds4sd.github.io/docling/integrations/) incl. LangChain, LlamaIndex, Crew AI & Haystack for agentic AI
* 🔍 OCR support for scanned PDFs
* 🗂️ Parsing of [multiple document formats][supported_formats] incl. PDF, DOCX, XLSX, HTML, images, and more
* 📑 Advanced PDF understanding incl. page layout, reading order, table structure, code, formulas, image classification, and more
* 🧬 Unified, expressive [DoclingDocument][docling_document] representation format
* ↪️ Various [export formats][supported_formats] and options, including Markdown, HTML, and lossless JSON
* 🔒 Local execution capabilities for sensitive data and air-gapped environments
* 🤖 Plug-and-play [integrations][integrations] incl. LangChain, LlamaIndex, Crew AI & Haystack for agentic AI
* 🔍 Extensive OCR support for scanned PDFs and images
* 💻 Simple and convenient CLI

Explore the [documentation](https://ds4sd.github.io/docling/) to discover plenty examples and unlock the full power of Docling!

### Coming soon

* ♾️ Equation & code extraction
* 📝 Metadata extraction, including title, authors, references & language

## Installation
Expand Down Expand Up @@ -120,3 +119,7 @@ For individual model usage, please refer to the model licenses found in the orig
## IBM ❤️ Open Source AI

Docling has been brought to you by IBM.

[supported_formats]: https://ds4sd.github.io/docling/supported_formats/
[docling_document]: https://ds4sd.github.io/docling/concepts/docling_document/
[integrations]: https://ds4sd.github.io/docling/integrations/
19 changes: 12 additions & 7 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,20 +14,21 @@
[![License MIT](https://img.shields.io/github/license/DS4SD/docling)](https://opensource.org/licenses/MIT)
[![PyPI Downloads](https://static.pepy.tech/badge/docling/month)](https://pepy.tech/projects/docling)

Docling parses documents and exports them to the desired format with ease and speed.
Docling simplifies document processing, parsing diverse formats — including advanced PDF understanding — and providing seamless integrations with the gen AI ecosystem.

## Features

* 🗂️ Reads popular document formats (PDF, DOCX, PPTX, XLSX, Images, HTML, AsciiDoc & Markdown) and exports to HTML, Markdown and JSON (with embedded and referenced images)
* 📑 Advanced PDF document understanding incl. page layout, reading order & table structures
* 🧩 Unified, expressive [DoclingDocument](./concepts/docling_document.md) representation format
* 🤖 Plug-and-play [integrations](https://ds4sd.github.io/docling/integrations/) incl. LangChain, LlamaIndex, Crew AI & Haystack for agentic AI
* 🔍 OCR support for scanned PDFs
* 🗂️ Parsing of [multiple document formats][supported_formats] incl. PDF, DOCX, XLSX, HTML, images, and more
* 📑 Advanced PDF understanding incl. page layout, reading order, table structure, code, formulas, image classification, and more
* 🧬 Unified, expressive [DoclingDocument][docling_document] representation format
* ↪️ Various [export formats][supported_formats] and options, including Markdown, HTML, and lossless JSON
* 🔒 Local execution capabilities for sensitive data and air-gapped environments
* 🤖 Plug-and-play [integrations][integrations] incl. LangChain, LlamaIndex, Crew AI & Haystack for agentic AI
* 🔍 Extensive OCR support for scanned PDFs and images
* 💻 Simple and convenient CLI

### Coming soon

* ♾️ Equation & code extraction
* 📝 Metadata extraction, including title, authors, references & language

## Get started
Expand All @@ -42,3 +43,7 @@ Docling parses documents and exports them to the desired format with ease and sp
## IBM ❤️ Open Source AI

Docling has been brought to you by IBM.

[supported_formats]: ./supported_formats.md
[docling_document]: ./concepts/docling_document.md
[integrations]: ./integrations/index.md
34 changes: 34 additions & 0 deletions docs/supported_formats.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
Docling can parse various documents formats into a unified representation (Docling
Document), which it can export to different formats too — check out
[Architecture](./concepts/architecture.md) for more details.

Below you can find a listing of all supported input and output formats.

## Supported input formats

| Format | Description |
|--------|-------------|
| PDF | |
| DOCX, XLSX, PPTX | Default formats in MS Office 2007+, based on Office Open XML |
| Markdown | |
| AsciiDoc | |
| HTML, XHTML | |
| PNG, JPEG, TIFF, BMP | Image formats |

Schema-specific support:

| Format | Description |
|--------|-------------|
| USPTO XML | XML format followed by [USPTO](https://www.uspto.gov/patents) patents |
| PMC XML | XML format followed by [PubMed Central®](https://pmc.ncbi.nlm.nih.gov/) articles |
| Docling JSON | JSON-serialized [Docling Document](./concepts/docling_document.md) |

## Supported output formats

| Format | Description |
|--------|-------------|
| HTML | Both image embedding and referencing are supported |
| Markdown | |
| JSON | Lossless serialization of Docling Document |
| Text | Plain text, i.e. without Markdown markers |
| Doctags | |
27 changes: 10 additions & 17 deletions docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,20 +24,6 @@ docling https://arxiv.org/pdf/2206.01062

To see all available options (export formats etc.) run `docling --help`. More details in the [CLI reference page](./reference/cli.md).

### Supported formats

The document conversion in Docling supports several popular formats, including:

- **PDF** (Portable Document Format): the format developed by Adobe to present documents compatible across application software, hardware, and operating systems.
- **.docx**, **.xlsx**, **.pptx** (Word, Excel, and PowerPoint): the Open XML formats suppored by Microsof Office.
- **Markdown**: a lightweight markup language to add formatting elements to plain text documents.
- **AsciiDoc**: a plain text markup language for writing technical content.
- **HTML** (Hypertext Markup Language): the standard markup language for creating web pages.
- **XHTML** (Extensible Hypertext Markup Language): the XML-based version of HTML.
- **XML** (Extensible Markup Language): a markup format for storing and transmitting data. Due to its flexibility, Docling requires custom implementations to identify the
semantics of the data. Currently, Docling supports the parsing of [USPTO](https://www.uspto.gov/patents) patents and [PubMed Central® (PMC)](https://pmc.ncbi.nlm.nih.gov/) articles.


### Advanced options

#### Adjust pipeline features
Expand Down Expand Up @@ -142,7 +128,14 @@ You can limit the CPU threads used by Docling by setting the environment variabl

#### Use specific backend converters

By default, Docling will try to identify the document format to apply the appropriate conversion backend (see the list of [supported formats](#supported-formats)).
!!! note

This section discusses directly invoking a [backend](./concepts/architecture.md),
i.e. using a low-level API. This should only be done when necessary. For most cases,
using a `DocumentConverter` (high-level API) as discussed in the sections above
should suffice — and is the recommended way.

By default, Docling will try to identify the document format to apply the appropriate conversion backend (see the list of [supported formats](./supported_formats.md)).
You can restrict the `DocumentConverter` to a set of allowed document formats, as shown in the [Multi-format conversion](./examples/run_with_formats.py) example.
Alternatively, you can also use the specific backend that matches your document content. For instance, you can use `HTMLDocumentBackend` for HTML pages:

Expand All @@ -162,8 +155,8 @@ in_doc = InputDocument(
filename="duck.html",
)
backend = HTMLDocumentBackend(in_doc=in_doc, path_or_stream=BytesIO(text))
result = backend.convert()
print(result.export_to_markdown())
dl_doc = backend.convert()
print(dl_doc.export_to_markdown())
```

## Chunking
Expand Down
3 changes: 2 additions & 1 deletion mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,7 @@ nav:
- "Docling": index.md
- Installation: installation.md
- Usage: usage.md
- Supported formats: supported_formats.md
- FAQ: faq.md
- Docling v2: v2.md
- Concepts:
Expand All @@ -77,7 +78,7 @@ nav:
- "Force full page OCR": examples/full_page_ocr.py
- "Automatic OCR language detection with tesseract": examples/tesseract_lang_detection.py
- "Accelerator options": examples/run_with_accelerator.py
- "Simple translation": examples/translate.py
- "Simple translation": examples/translate.py
- examples/backend_xml_rag.ipynb
- ✂️ Chunking:
- examples/hybrid_chunking.ipynb
Expand Down
Loading