docs: document Docling JSON parsing (#819)

* docs: document Docling JSON parsing Also: - factored out and expanded supported formats - reorged feature list Signed-off-by: Panos Vagenas <[email protected]> * update feature list, minor fixes Signed-off-by: Panos Vagenas <[email protected]> --------- Signed-off-by: Panos Vagenas <[email protected]>
DS4SD · Jan 28, 2025 · 6875913 · 6875913
1 parent 5139b48
commit 6875913
Show file tree

Hide file tree

Showing 5 changed files with 70 additions and 34 deletions.
diff --git a/README.md b/README.md
@@ -22,22 +22,21 @@
 [![License MIT](https://img.shields.io/github/license/DS4SD/docling)](https://opensource.org/licenses/MIT)
 [![PyPI Downloads](https://static.pepy.tech/badge/docling/month)](https://pepy.tech/projects/docling)
 
-Docling parses documents and exports them to the desired format with ease and speed.
+Docling simplifies document processing, parsing diverse formats — including advanced PDF understanding — and providing seamless integrations with the gen AI ecosystem.
 
 ## Features
 
-* 🗂️ Reads popular document formats (PDF, DOCX, PPTX, XLSX, Images, HTML, AsciiDoc & Markdown) and exports to HTML, Markdown and JSON (with embedded and referenced images)
-* 📑 Advanced PDF document understanding including page layout, reading order & table structures
-* 🧩 Unified, expressive [DoclingDocument](https://ds4sd.github.io/docling/concepts/docling_document/) representation format
-* 🤖 Plug-and-play [integrations](https://ds4sd.github.io/docling/integrations/) incl. LangChain, LlamaIndex, Crew AI & Haystack for agentic AI
-* 🔍 OCR support for scanned PDFs
+* 🗂️ Parsing of [multiple document formats][supported_formats] incl. PDF, DOCX, XLSX, HTML, images, and more
+* 📑 Advanced PDF understanding incl. page layout, reading order, table structure, code, formulas, image classification, and more
+* 🧬 Unified, expressive [DoclingDocument][docling_document] representation format
+* ↪️ Various [export formats][supported_formats] and options, including Markdown, HTML, and lossless JSON
+* 🔒 Local execution capabilities for sensitive data and air-gapped environments
+* 🤖 Plug-and-play [integrations][integrations] incl. LangChain, LlamaIndex, Crew AI & Haystack for agentic AI
+* 🔍 Extensive OCR support for scanned PDFs and images
 * 💻 Simple and convenient CLI
 
-Explore the [documentation](https://ds4sd.github.io/docling/) to discover plenty examples and unlock the full power of Docling!
-
 ### Coming soon
 
-* ♾️ Equation & code extraction
 * 📝 Metadata extraction, including title, authors, references & language
 
 ## Installation
@@ -120,3 +119,7 @@ For individual model usage, please refer to the model licenses found in the orig
 ## IBM ❤️ Open Source AI
 
 Docling has been brought to you by IBM.
+
+[supported_formats]: https://ds4sd.github.io/docling/supported_formats/
+[docling_document]: https://ds4sd.github.io/docling/concepts/docling_document/
+[integrations]: https://ds4sd.github.io/docling/integrations/
diff --git a/docs/index.md b/docs/index.md
@@ -14,20 +14,21 @@
 [![License MIT](https://img.shields.io/github/license/DS4SD/docling)](https://opensource.org/licenses/MIT)
 [![PyPI Downloads](https://static.pepy.tech/badge/docling/month)](https://pepy.tech/projects/docling)
 
-Docling parses documents and exports them to the desired format with ease and speed.
+Docling simplifies document processing, parsing diverse formats — including advanced PDF understanding — and providing seamless integrations with the gen AI ecosystem.
 
 ## Features
 
-* 🗂️ Reads popular document formats (PDF, DOCX, PPTX, XLSX, Images, HTML, AsciiDoc & Markdown) and exports to HTML, Markdown and JSON (with embedded and referenced images)
-* 📑 Advanced PDF document understanding incl. page layout, reading order & table structures
-* 🧩 Unified, expressive [DoclingDocument](./concepts/docling_document.md) representation format
-* 🤖 Plug-and-play [integrations](https://ds4sd.github.io/docling/integrations/) incl. LangChain, LlamaIndex, Crew AI & Haystack for agentic AI
-* 🔍 OCR support for scanned PDFs
+* 🗂️ Parsing of [multiple document formats][supported_formats] incl. PDF, DOCX, XLSX, HTML, images, and more
+* 📑 Advanced PDF understanding incl. page layout, reading order, table structure, code, formulas, image classification, and more
+* 🧬 Unified, expressive [DoclingDocument][docling_document] representation format
+* ↪️ Various [export formats][supported_formats] and options, including Markdown, HTML, and lossless JSON
+* 🔒 Local execution capabilities for sensitive data and air-gapped environments
+* 🤖 Plug-and-play [integrations][integrations] incl. LangChain, LlamaIndex, Crew AI & Haystack for agentic AI
+* 🔍 Extensive OCR support for scanned PDFs and images
 * 💻 Simple and convenient CLI
 
 ### Coming soon
 
-* ♾️ Equation & code extraction
 * 📝 Metadata extraction, including title, authors, references & language
 
 ## Get started
@@ -42,3 +43,7 @@ Docling parses documents and exports them to the desired format with ease and sp
 ## IBM ❤️ Open Source AI
 
 Docling has been brought to you by IBM.
+
+[supported_formats]: ./supported_formats.md
+[docling_document]: ./concepts/docling_document.md
+[integrations]: ./integrations/index.md
diff --git a/docs/supported_formats.md b/docs/supported_formats.md
@@ -0,0 +1,34 @@
+Docling can parse various documents formats into a unified representation (Docling
+Document), which it can export to different formats too — check out
+[Architecture](./concepts/architecture.md) for more details.
+
+Below you can find a listing of all supported input and output formats.
+
+## Supported input formats
+
+| Format | Description |
+|--------|-------------|
+| PDF | |
+| DOCX, XLSX, PPTX | Default formats in MS Office 2007+, based on Office Open XML |
+| Markdown | |
+| AsciiDoc | |
+| HTML, XHTML | |
+| PNG, JPEG, TIFF, BMP | Image formats |
+
+Schema-specific support:
+
+| Format | Description |
+|--------|-------------|
+| USPTO XML | XML format followed by [USPTO](https://www.uspto.gov/patents) patents |
+| PMC XML | XML format followed by [PubMed Central®](https://pmc.ncbi.nlm.nih.gov/) articles |
+| Docling JSON | JSON-serialized [Docling Document](./concepts/docling_document.md) |
+
+## Supported output formats
+
+| Format | Description |
+|--------|-------------|
+| HTML | Both image embedding and referencing are supported |
+| Markdown | |
+| JSON | Lossless serialization of Docling Document |
+| Text | Plain text, i.e. without Markdown markers |
+| Doctags | |
diff --git a/docs/usage.md b/docs/usage.md
@@ -24,20 +24,6 @@ docling https://arxiv.org/pdf/2206.01062
 
 To see all available options (export formats etc.) run `docling --help`. More details in the [CLI reference page](./reference/cli.md).
 
-### Supported formats
-
-The document conversion in Docling supports several popular formats, including:
-
-- **PDF** (Portable Document Format): the format developed by Adobe to present documents compatible across application software, hardware, and operating systems.
-- **.docx**, **.xlsx**, **.pptx** (Word, Excel, and PowerPoint): the Open XML formats suppored by Microsof Office.
-- **Markdown**:  a lightweight markup language to add formatting elements to plain text documents.
-- **AsciiDoc**: a plain text markup language for writing technical content.
-- **HTML** (Hypertext Markup Language): the standard markup language for creating web pages.
-- **XHTML** (Extensible Hypertext Markup Language): the XML-based version of HTML.
-- **XML** (Extensible Markup Language): a markup format for storing and transmitting data. Due to its flexibility, Docling requires custom implementations to identify the
-semantics of the data. Currently, Docling supports the parsing of [USPTO](https://www.uspto.gov/patents) patents and [PubMed Central® (PMC)](https://pmc.ncbi.nlm.nih.gov/) articles.
-
-
 ### Advanced options
 
 #### Adjust pipeline features
@@ -142,7 +128,14 @@ You can limit the CPU threads used by Docling by setting the environment variabl
 
 #### Use specific backend converters
 
-By default, Docling will try to identify the document format to apply the appropriate conversion backend (see the list of [supported formats](#supported-formats)).
+!!! note
+
+    This section discusses directly invoking a [backend](./concepts/architecture.md),
+    i.e. using a low-level API. This should only be done when necessary. For most cases,
+    using a `DocumentConverter` (high-level API) as discussed in the sections above
+    should suffice — and is the recommended way.
+
+By default, Docling will try to identify the document format to apply the appropriate conversion backend (see the list of [supported formats](./supported_formats.md)).
 You can restrict the `DocumentConverter` to a set of allowed document formats, as shown in the [Multi-format conversion](./examples/run_with_formats.py) example.
 Alternatively, you can also use the specific backend that matches your document content. For instance, you can use `HTMLDocumentBackend` for HTML pages:
 
@@ -162,8 +155,8 @@ in_doc = InputDocument(
     filename="duck.html",
 )
 backend = HTMLDocumentBackend(in_doc=in_doc, path_or_stream=BytesIO(text))
-result = backend.convert()
-print(result.export_to_markdown())
+dl_doc = backend.convert()
+print(dl_doc.export_to_markdown())
 ```
 
 ## Chunking

diff --git a/mkdocs.yml b/mkdocs.yml
@@ -56,6 +56,7 @@ nav:
     - "Docling": index.md
     - Installation: installation.md
     - Usage: usage.md
+    - Supported formats: supported_formats.md
     - FAQ: faq.md
     - Docling v2: v2.md
   - Concepts:
@@ -77,7 +78,7 @@ nav:
       - "Force full page OCR": examples/full_page_ocr.py
       - "Automatic OCR language detection with tesseract": examples/tesseract_lang_detection.py
       - "Accelerator options": examples/run_with_accelerator.py
-      - "Simple translation": examples/translate.py   
+      - "Simple translation": examples/translate.py
       - examples/backend_xml_rag.ipynb
     - ✂️ Chunking:
       - examples/hybrid_chunking.ipynb