update feature list, minor fixes

Signed-off-by: Panos Vagenas <[email protected]>
DS4SD · Jan 28, 2025 · e7930b5 · e7930b5
1 parent 68272b9
commit e7930b5
Show file tree

Hide file tree

Showing 4 changed files with 21 additions and 17 deletions.
diff --git a/README.md b/README.md
@@ -22,24 +22,21 @@
 [![License MIT](https://img.shields.io/github/license/DS4SD/docling)](https://opensource.org/licenses/MIT)
 [![PyPI Downloads](https://static.pepy.tech/badge/docling/month)](https://pepy.tech/projects/docling)
 
-Docling parses documents and exports them to the desired format with ease and speed.
+Docling simplifies document processing, parsing diverse formats — including advanced PDF understanding — and providing seamless integrations with the gen AI ecosystem.
 
 ## Features
 
-* 🗂️ Parsing of [multiple documents formats][supported_formats] incl. PDF, DOCX, XLSX, HTML, images, & more
-* 📑 Advanced PDF understanding including page layout, reading order & table structure
+* 🗂️ Parsing of [multiple document formats][supported_formats] incl. PDF, DOCX, XLSX, HTML, images, and more
+* 📑 Advanced PDF understanding incl. page layout, reading order, table structure, code, formulas, image classification, and more
 * 🧬 Unified, expressive [DoclingDocument][docling_document] representation format
 * ↪️ Various [export formats][supported_formats] and options, including Markdown, HTML, and lossless JSON
 * 🔒 Local execution capabilities for sensitive data and air-gapped environments
 * 🤖 Plug-and-play [integrations][integrations] incl. LangChain, LlamaIndex, Crew AI & Haystack for agentic AI
-* 🔍 OCR support for scanned PDFs and images
+* 🔍 Extensive OCR support for scanned PDFs and images
 * 💻 Simple and convenient CLI
 
-Explore the [documentation](https://ds4sd.github.io/docling/) to discover plenty examples and unlock the full power of Docling!
-
 ### Coming soon
 
-* ♾️ Equation & code extraction
 * 📝 Metadata extraction, including title, authors, references & language
 
 ## Installation

diff --git a/docs/index.md b/docs/index.md
@@ -14,22 +14,21 @@
 [![License MIT](https://img.shields.io/github/license/DS4SD/docling)](https://opensource.org/licenses/MIT)
 [![PyPI Downloads](https://static.pepy.tech/badge/docling/month)](https://pepy.tech/projects/docling)
 
-Docling parses documents and exports them to the desired format with ease and speed.
+Docling simplifies document processing, parsing diverse formats — including advanced PDF understanding — and providing seamless integrations with the gen AI ecosystem.
 
 ## Features
 
-* 🗂️ Parsing of [multiple documents formats][supported_formats] incl. PDF, DOCX, XLSX, HTML, images, & more
-* 📑 Advanced PDF understanding including page layout, reading order & table structure
+* 🗂️ Parsing of [multiple document formats][supported_formats] incl. PDF, DOCX, XLSX, HTML, images, and more
+* 📑 Advanced PDF understanding incl. page layout, reading order, table structure, code, formulas, image classification, and more
 * 🧬 Unified, expressive [DoclingDocument][docling_document] representation format
 * ↪️ Various [export formats][supported_formats] and options, including Markdown, HTML, and lossless JSON
 * 🔒 Local execution capabilities for sensitive data and air-gapped environments
 * 🤖 Plug-and-play [integrations][integrations] incl. LangChain, LlamaIndex, Crew AI & Haystack for agentic AI
-* 🔍 OCR support for scanned PDFs and images
+* 🔍 Extensive OCR support for scanned PDFs and images
 * 💻 Simple and convenient CLI
 
 ### Coming soon
 
-* ♾️ Equation & code extraction
 * 📝 Metadata extraction, including title, authors, references & language
 
 ## Get started

diff --git a/docs/supported_formats.md b/docs/supported_formats.md
@@ -1,5 +1,5 @@
 Docling can parse various documents formats into a unified representation (Docling
-document), which it can export to different formats too — check out
+Document), which it can export to different formats too — check out
 [Architecture](./concepts/architecture.md) for more details.
 
 Below you can find a listing of all supported input and output formats.
@@ -27,7 +27,8 @@ Schema-specific support:
 
 | Format | Description |
 |--------|-------------|
-| HTML | Docling supports both image embedding and referencing |
+| HTML | Both image embedding and referencing are supported |
 | Markdown | |
 | JSON | Lossless serialization of Docling Document |
+| Text | Plain text, i.e. without Markdown markers |
 | Doctags | |
diff --git a/docs/usage.md b/docs/usage.md
@@ -128,7 +128,14 @@ You can limit the CPU threads used by Docling by setting the environment variabl
 
 #### Use specific backend converters
 
-By default, Docling will try to identify the document format to apply the appropriate conversion backend (see the list of [supported formats](#supported-formats)).
+!!! note
+
+    This section discusses directly invoking a [backend](./concepts/architecture.md),
+    i.e. using a low-level API. This should only be done when necessary. For most cases,
+    using a `DocumentConverter` (high-level API) as discussed in the sections above
+    should suffice — and is the recommended way.
+
+By default, Docling will try to identify the document format to apply the appropriate conversion backend (see the list of [supported formats](./supported_formats.md)).
 You can restrict the `DocumentConverter` to a set of allowed document formats, as shown in the [Multi-format conversion](./examples/run_with_formats.py) example.
 Alternatively, you can also use the specific backend that matches your document content. For instance, you can use `HTMLDocumentBackend` for HTML pages:
 
@@ -148,8 +155,8 @@ in_doc = InputDocument(
     filename="duck.html",
 )
 backend = HTMLDocumentBackend(in_doc=in_doc, path_or_stream=BytesIO(text))
-result = backend.convert()
-print(result.export_to_markdown())
+dl_doc = backend.convert()
+print(dl_doc.export_to_markdown())
 ```
 
 ## Chunking