diff --git a/README.md b/README.md index 78acb592..8050365f 100644 --- a/README.md +++ b/README.md @@ -22,22 +22,21 @@ [![License MIT](https://img.shields.io/github/license/DS4SD/docling)](https://opensource.org/licenses/MIT) [![PyPI Downloads](https://static.pepy.tech/badge/docling/month)](https://pepy.tech/projects/docling) -Docling parses documents and exports them to the desired format with ease and speed. +Docling simplifies document processing, parsing diverse formats โ€” including advanced PDF understanding โ€” and providing seamless integrations with the gen AI ecosystem. ## Features -* ๐Ÿ—‚๏ธ Reads popular document formats (PDF, DOCX, PPTX, XLSX, Images, HTML, AsciiDoc & Markdown) and exports to HTML, Markdown and JSON (with embedded and referenced images) -* ๐Ÿ“‘ Advanced PDF document understanding including page layout, reading order & table structures -* ๐Ÿงฉ Unified, expressive [DoclingDocument](https://ds4sd.github.io/docling/concepts/docling_document/) representation format -* ๐Ÿค– Plug-and-play [integrations](https://ds4sd.github.io/docling/integrations/) incl. LangChain, LlamaIndex, Crew AI & Haystack for agentic AI -* ๐Ÿ” OCR support for scanned PDFs +* ๐Ÿ—‚๏ธ Parsing of [multiple document formats][supported_formats] incl. PDF, DOCX, XLSX, HTML, images, and more +* ๐Ÿ“‘ Advanced PDF understanding incl. page layout, reading order, table structure, code, formulas, image classification, and more +* ๐Ÿงฌ Unified, expressive [DoclingDocument][docling_document] representation format +* โ†ช๏ธ Various [export formats][supported_formats] and options, including Markdown, HTML, and lossless JSON +* ๐Ÿ”’ Local execution capabilities for sensitive data and air-gapped environments +* ๐Ÿค– Plug-and-play [integrations][integrations] incl. LangChain, LlamaIndex, Crew AI & Haystack for agentic AI +* ๐Ÿ” Extensive OCR support for scanned PDFs and images * ๐Ÿ’ป Simple and convenient CLI -Explore the [documentation](https://ds4sd.github.io/docling/) to discover plenty examples and unlock the full power of Docling! - ### Coming soon -* โ™พ๏ธ Equation & code extraction * ๐Ÿ“ Metadata extraction, including title, authors, references & language ## Installation @@ -120,3 +119,7 @@ For individual model usage, please refer to the model licenses found in the orig ## IBM โค๏ธ Open Source AI Docling has been brought to you by IBM. + +[supported_formats]: https://ds4sd.github.io/docling/supported_formats/ +[docling_document]: https://ds4sd.github.io/docling/concepts/docling_document/ +[integrations]: https://ds4sd.github.io/docling/integrations/ diff --git a/docs/index.md b/docs/index.md index c88ee7c6..f44e6dba 100644 --- a/docs/index.md +++ b/docs/index.md @@ -14,20 +14,21 @@ [![License MIT](https://img.shields.io/github/license/DS4SD/docling)](https://opensource.org/licenses/MIT) [![PyPI Downloads](https://static.pepy.tech/badge/docling/month)](https://pepy.tech/projects/docling) -Docling parses documents and exports them to the desired format with ease and speed. +Docling simplifies document processing, parsing diverse formats โ€” including advanced PDF understanding โ€” and providing seamless integrations with the gen AI ecosystem. ## Features -* ๐Ÿ—‚๏ธ Reads popular document formats (PDF, DOCX, PPTX, XLSX, Images, HTML, AsciiDoc & Markdown) and exports to HTML, Markdown and JSON (with embedded and referenced images) -* ๐Ÿ“‘ Advanced PDF document understanding incl. page layout, reading order & table structures -* ๐Ÿงฉ Unified, expressive [DoclingDocument](./concepts/docling_document.md) representation format -* ๐Ÿค– Plug-and-play [integrations](https://ds4sd.github.io/docling/integrations/) incl. LangChain, LlamaIndex, Crew AI & Haystack for agentic AI -* ๐Ÿ” OCR support for scanned PDFs +* ๐Ÿ—‚๏ธ Parsing of [multiple document formats][supported_formats] incl. PDF, DOCX, XLSX, HTML, images, and more +* ๐Ÿ“‘ Advanced PDF understanding incl. page layout, reading order, table structure, code, formulas, image classification, and more +* ๐Ÿงฌ Unified, expressive [DoclingDocument][docling_document] representation format +* โ†ช๏ธ Various [export formats][supported_formats] and options, including Markdown, HTML, and lossless JSON +* ๐Ÿ”’ Local execution capabilities for sensitive data and air-gapped environments +* ๐Ÿค– Plug-and-play [integrations][integrations] incl. LangChain, LlamaIndex, Crew AI & Haystack for agentic AI +* ๐Ÿ” Extensive OCR support for scanned PDFs and images * ๐Ÿ’ป Simple and convenient CLI ### Coming soon -* โ™พ๏ธ Equation & code extraction * ๐Ÿ“ Metadata extraction, including title, authors, references & language ## Get started @@ -42,3 +43,7 @@ Docling parses documents and exports them to the desired format with ease and sp ## IBM โค๏ธ Open Source AI Docling has been brought to you by IBM. + +[supported_formats]: ./supported_formats.md +[docling_document]: ./concepts/docling_document.md +[integrations]: ./integrations/index.md diff --git a/docs/supported_formats.md b/docs/supported_formats.md new file mode 100644 index 00000000..e217bb19 --- /dev/null +++ b/docs/supported_formats.md @@ -0,0 +1,34 @@ +Docling can parse various documents formats into a unified representation (Docling +Document), which it can export to different formats too โ€” check out +[Architecture](./concepts/architecture.md) for more details. + +Below you can find a listing of all supported input and output formats. + +## Supported input formats + +| Format | Description | +|--------|-------------| +| PDF | | +| DOCX, XLSX, PPTX | Default formats in MS Office 2007+, based on Office Open XML | +| Markdown | | +| AsciiDoc | | +| HTML, XHTML | | +| PNG, JPEG, TIFF, BMP | Image formats | + +Schema-specific support: + +| Format | Description | +|--------|-------------| +| USPTO XML | XML format followed by [USPTO](https://www.uspto.gov/patents) patents | +| PMC XML | XML format followed by [PubMed Centralยฎ](https://pmc.ncbi.nlm.nih.gov/) articles | +| Docling JSON | JSON-serialized [Docling Document](./concepts/docling_document.md) | + +## Supported output formats + +| Format | Description | +|--------|-------------| +| HTML | Both image embedding and referencing are supported | +| Markdown | | +| JSON | Lossless serialization of Docling Document | +| Text | Plain text, i.e. without Markdown markers | +| Doctags | | diff --git a/docs/usage.md b/docs/usage.md index 824f0f22..a577a3e3 100644 --- a/docs/usage.md +++ b/docs/usage.md @@ -24,20 +24,6 @@ docling https://arxiv.org/pdf/2206.01062 To see all available options (export formats etc.) run `docling --help`. More details in the [CLI reference page](./reference/cli.md). -### Supported formats - -The document conversion in Docling supports several popular formats, including: - -- **PDF** (Portable Document Format): the format developed by Adobe to present documents compatible across application software, hardware, and operating systems. -- **.docx**, **.xlsx**, **.pptx** (Word, Excel, and PowerPoint): the Open XML formats suppored by Microsof Office. -- **Markdown**: a lightweight markup language to add formatting elements to plain text documents. -- **AsciiDoc**: a plain text markup language for writing technical content. -- **HTML** (Hypertext Markup Language): the standard markup language for creating web pages. -- **XHTML** (Extensible Hypertext Markup Language): the XML-based version of HTML. -- **XML** (Extensible Markup Language): a markup format for storing and transmitting data. Due to its flexibility, Docling requires custom implementations to identify the -semantics of the data. Currently, Docling supports the parsing of [USPTO](https://www.uspto.gov/patents) patents and [PubMed Centralยฎ (PMC)](https://pmc.ncbi.nlm.nih.gov/) articles. - - ### Advanced options #### Adjust pipeline features @@ -142,7 +128,14 @@ You can limit the CPU threads used by Docling by setting the environment variabl #### Use specific backend converters -By default, Docling will try to identify the document format to apply the appropriate conversion backend (see the list of [supported formats](#supported-formats)). +!!! note + + This section discusses directly invoking a [backend](./concepts/architecture.md), + i.e. using a low-level API. This should only be done when necessary. For most cases, + using a `DocumentConverter` (high-level API) as discussed in the sections above + should sufficeย โ€”ย and is the recommended way. + +By default, Docling will try to identify the document format to apply the appropriate conversion backend (see the list of [supported formats](./supported_formats.md)). You can restrict the `DocumentConverter` to a set of allowed document formats, as shown in the [Multi-format conversion](./examples/run_with_formats.py) example. Alternatively, you can also use the specific backend that matches your document content. For instance, you can use `HTMLDocumentBackend` for HTML pages: @@ -162,8 +155,8 @@ in_doc = InputDocument( filename="duck.html", ) backend = HTMLDocumentBackend(in_doc=in_doc, path_or_stream=BytesIO(text)) -result = backend.convert() -print(result.export_to_markdown()) +dl_doc = backend.convert() +print(dl_doc.export_to_markdown()) ``` ## Chunking diff --git a/mkdocs.yml b/mkdocs.yml index bbff382e..0fcc2ca4 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -56,6 +56,7 @@ nav: - "Docling": index.md - Installation: installation.md - Usage: usage.md + - Supported formats: supported_formats.md - FAQ: faq.md - Docling v2: v2.md - Concepts: @@ -77,7 +78,7 @@ nav: - "Force full page OCR": examples/full_page_ocr.py - "Automatic OCR language detection with tesseract": examples/tesseract_lang_detection.py - "Accelerator options": examples/run_with_accelerator.py - - "Simple translation": examples/translate.py + - "Simple translation": examples/translate.py - examples/backend_xml_rag.ipynb - โœ‚๏ธ Chunking: - examples/hybrid_chunking.ipynb