Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: document Docling JSON parsing #819

Merged
merged 2 commits into from
Jan 28, 2025
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 11 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,11 +26,13 @@ Docling parses documents and exports them to the desired format with ease and sp

## Features

* 🗂️ Reads popular document formats (PDF, DOCX, PPTX, XLSX, Images, HTML, AsciiDoc & Markdown) and exports to HTML, Markdown and JSON (with embedded and referenced images)
* 📑 Advanced PDF document understanding including page layout, reading order & table structures
* 🧩 Unified, expressive [DoclingDocument](https://ds4sd.github.io/docling/concepts/docling_document/) representation format
* 🤖 Plug-and-play [integrations](https://ds4sd.github.io/docling/integrations/) incl. LangChain, LlamaIndex, Crew AI & Haystack for agentic AI
* 🔍 OCR support for scanned PDFs
* 🗂️ Parsing of [multiple documents formats][supported_formats] incl. PDF, DOCX, XLSX, HTML, images, & more
* 📑 Advanced PDF understanding including page layout, reading order & table structure
* 🧬 Unified, expressive [DoclingDocument][docling_document] representation format
* ↪️ Various [export formats][supported_formats] and options, including Markdown, HTML, and lossless JSON
* 🔒 Local execution capabilities for sensitive data and air-gapped environments
* 🤖 Plug-and-play [integrations][integrations] incl. LangChain, LlamaIndex, Crew AI & Haystack for agentic AI
* 🔍 OCR support for scanned PDFs and images
* 💻 Simple and convenient CLI

Explore the [documentation](https://ds4sd.github.io/docling/) to discover plenty examples and unlock the full power of Docling!
Expand Down Expand Up @@ -120,3 +122,7 @@ For individual model usage, please refer to the model licenses found in the orig
## IBM ❤️ Open Source AI

Docling has been brought to you by IBM.

[supported_formats]: https://ds4sd.github.io/docling/supported_formats/
[docling_document]: https://ds4sd.github.io/docling/concepts/docling_document/
[integrations]: https://ds4sd.github.io/docling/integrations/
16 changes: 11 additions & 5 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,11 +18,13 @@ Docling parses documents and exports them to the desired format with ease and sp

## Features

* 🗂️ Reads popular document formats (PDF, DOCX, PPTX, XLSX, Images, HTML, AsciiDoc & Markdown) and exports to HTML, Markdown and JSON (with embedded and referenced images)
* 📑 Advanced PDF document understanding incl. page layout, reading order & table structures
* 🧩 Unified, expressive [DoclingDocument](./concepts/docling_document.md) representation format
* 🤖 Plug-and-play [integrations](https://ds4sd.github.io/docling/integrations/) incl. LangChain, LlamaIndex, Crew AI & Haystack for agentic AI
* 🔍 OCR support for scanned PDFs
* 🗂️ Parsing of [multiple documents formats][supported_formats] incl. PDF, DOCX, XLSX, HTML, images, & more
* 📑 Advanced PDF understanding including page layout, reading order & table structure
* 🧬 Unified, expressive [DoclingDocument][docling_document] representation format
* ↪️ Various [export formats][supported_formats] and options, including Markdown, HTML, and lossless JSON
* 🔒 Local execution capabilities for sensitive data and air-gapped environments
* 🤖 Plug-and-play [integrations][integrations] incl. LangChain, LlamaIndex, Crew AI & Haystack for agentic AI
* 🔍 OCR support for scanned PDFs and images
* 💻 Simple and convenient CLI

### Coming soon
Expand All @@ -42,3 +44,7 @@ Docling parses documents and exports them to the desired format with ease and sp
## IBM ❤️ Open Source AI

Docling has been brought to you by IBM.

[supported_formats]: ./supported_formats.md
[docling_document]: ./concepts/docling_document.md
[integrations]: ./integrations/index.md
33 changes: 33 additions & 0 deletions docs/supported_formats.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
Docling can parse various documents formats into a unified representation (Docling
document), which it can export to different formats too — check out
[Architecture](./concepts/architecture.md) for more details.

Below you can find a listing of all supported input and output formats.

## Supported input formats

| Format | Description |
|--------|-------------|
| PDF | |
| DOCX, XLSX, PPTX | Default formats in MS Office 2007+, based on Office Open XML |
| Markdown | |
| AsciiDoc | |
| HTML, XHTML | |
| PNG, JPEG, TIFF, BMP | Image formats |

Schema-specific support:

| Format | Description |
|--------|-------------|
| USPTO XML | XML format followed by [USPTO](https://www.uspto.gov/patents) patents |
| PMC XML | XML format followed by [PubMed Central®](https://pmc.ncbi.nlm.nih.gov/) articles |
| Docling JSON | JSON-serialized [Docling Document](./concepts/docling_document.md) |

## Supported output formats

| Format | Description |
|--------|-------------|
| HTML | Docling supports both image embedding and referencing |
| Markdown | |
| JSON | Lossless serialization of Docling Document |
| Doctags | |
14 changes: 0 additions & 14 deletions docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,20 +24,6 @@ docling https://arxiv.org/pdf/2206.01062

To see all available options (export formats etc.) run `docling --help`. More details in the [CLI reference page](./reference/cli.md).

### Supported formats

The document conversion in Docling supports several popular formats, including:

- **PDF** (Portable Document Format): the format developed by Adobe to present documents compatible across application software, hardware, and operating systems.
- **.docx**, **.xlsx**, **.pptx** (Word, Excel, and PowerPoint): the Open XML formats suppored by Microsof Office.
- **Markdown**: a lightweight markup language to add formatting elements to plain text documents.
- **AsciiDoc**: a plain text markup language for writing technical content.
- **HTML** (Hypertext Markup Language): the standard markup language for creating web pages.
- **XHTML** (Extensible Hypertext Markup Language): the XML-based version of HTML.
- **XML** (Extensible Markup Language): a markup format for storing and transmitting data. Due to its flexibility, Docling requires custom implementations to identify the
semantics of the data. Currently, Docling supports the parsing of [USPTO](https://www.uspto.gov/patents) patents and [PubMed Central® (PMC)](https://pmc.ncbi.nlm.nih.gov/) articles.


### Advanced options

#### Adjust pipeline features
Expand Down
3 changes: 2 additions & 1 deletion mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,7 @@ nav:
- "Docling": index.md
- Installation: installation.md
- Usage: usage.md
- Supported formats: supported_formats.md
- FAQ: faq.md
- Docling v2: v2.md
- Concepts:
Expand All @@ -77,7 +78,7 @@ nav:
- "Force full page OCR": examples/full_page_ocr.py
- "Automatic OCR language detection with tesseract": examples/tesseract_lang_detection.py
- "Accelerator options": examples/run_with_accelerator.py
- "Simple translation": examples/translate.py
- "Simple translation": examples/translate.py
- examples/backend_xml_rag.ipynb
- ✂️ Chunking:
- examples/hybrid_chunking.ipynb
Expand Down
Loading