From 4fa8028bd8120d7557e1d45ba31e200e130af698 Mon Sep 17 00:00:00 2001
From: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Date: Thu, 9 Jan 2025 14:12:05 +0100
Subject: [PATCH] docs: add LangChain docs (#717)

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
---
 README.md                           |   3 +-
 docs/examples/hybrid_chunking.ipynb |   2 +-
 docs/examples/rag_langchain.ipynb   | 350 ++++++++++++++++------------
 docs/index.md                       |   3 +-
 docs/integrations/haystack.md       |   4 +-
 docs/integrations/langchain.md      |   9 +
 mkdocs.yml                          |   2 +-
 7 files changed, 222 insertions(+), 151 deletions(-)
 create mode 100644 docs/integrations/langchain.md
diff --git a/README.md b/README.md
index 98222f83..78acb592 100644
--- a/README.md
+++ b/README.md
@@ -29,7 +29,7 @@ Docling parses documents and exports them to the desired format with ease and sp
 * 🗂️ Reads popular document formats (PDF, DOCX, PPTX, XLSX, Images, HTML, AsciiDoc & Markdown) and exports to HTML, Markdown and JSON (with embedded and referenced images)
 * 📑 Advanced PDF document understanding including page layout, reading order & table structures
 * 🧩 Unified, expressive [DoclingDocument](https://ds4sd.github.io/docling/concepts/docling_document/) representation format
-* 🤖 Easy integration with 🦙 LlamaIndex & 🦜🔗 LangChain for powerful RAG / QA applications
+* 🤖 Plug-and-play [integrations](https://ds4sd.github.io/docling/integrations/) incl. LangChain, LlamaIndex, Crew AI & Haystack for agentic AI
 * 🔍 OCR support for scanned PDFs
 * 💻 Simple and convenient CLI
 
@@ -39,7 +39,6 @@ Explore the [documentation](https://ds4sd.github.io/docling/) to discover plenty
 
 * ♾️ Equation & code extraction
 * 📝 Metadata extraction, including title, authors, references & language
-* 🦜🔗 Native LangChain extension
 
 ## Installation
 
diff --git a/docs/examples/hybrid_chunking.ipynb b/docs/examples/hybrid_chunking.ipynb
index 6e3b930c..2b7861aa 100644
--- a/docs/examples/hybrid_chunking.ipynb
+++ b/docs/examples/hybrid_chunking.ipynb
@@ -44,7 +44,7 @@
     }
    ],
    "source": [
-    "%pip install -qU 'docling-core[chunking]' sentence-transformers transformers"
+    "%pip install -qU docling transformers"
    ]
   },
   {
diff --git a/docs/examples/rag_langchain.ipynb b/docs/examples/rag_langchain.ipynb
index 68ad884f..ef8374aa 100644
--- a/docs/examples/rag_langchain.ipynb
+++ b/docs/examples/rag_langchain.ipynb
@@ -1,5 +1,12 @@
 {
  "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<a href=\"https://colab.research.google.com/github/DS4SD/docling/blob/main/docs/examples/rag_langchain.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -15,7 +22,29 @@
     "| --- | --- | --- |\n",
     "| Embedding | Hugging Face / Sentence Transformers | 💻 Local |\n",
     "| Vector store | Milvus | 💻 Local |\n",
-    "| Gen AI | Hugging Face Inference API | 🌐 Remote |"
+    "| Gen AI | Hugging Face Inference API | 🌐 Remote | "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "This example leverages the\n",
+    "[LangChain Docling integration](../../integrations/langchain/), along with a Milvus\n",
+    "vector store, as well as sentence-transformers embeddings.\n",
+    "\n",
+    "The presented `DoclingLoader` component enables you to:\n",
+    "- use various document types in your LLM applications with ease and speed, and\n",
+    "- leverage Docling's rich format for advanced, document-native grounding.\n",
+    "\n",
+    "`DoclingLoader` supports two different export modes:\n",
+    "- `ExportType.MARKDOWN`: if you want to capture each input document as a separate\n",
+    "  LangChain document, or\n",
+    "- `ExportType.DOC_CHUNKS` (default): if you want to have each input document chunked and\n",
+    "  to then capture each individual chunk as a separate LangChain document downstream.\n",
+    "\n",
+    "The example allows exploring both modes via parameter `EXPORT_TYPE`; depending on the\n",
+    "value set, the example pipeline is then set up accordingly."
    ]
   },
   {
@@ -25,6 +54,15 @@
     "## Setup"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "- 👉 For best conversion speed, use GPU acceleration whenever available; e.g. if running on Colab, use GPU-enabled runtime.\n",
+    "- Notebook uses HuggingFace's Inference API; for increased LLM quota, token can be provided via env var `HF_TOKEN`.\n",
+    "- Requirements can be installed as shown below (`--no-warn-conflicts` meant for Colab's pre-populated Python env; feel free to remove for stricter usage):"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 1,
@@ -39,162 +77,189 @@
     }
    ],
    "source": [
-    "# requirements for this example:\n",
-    "%pip install -qq docling docling-core python-dotenv langchain-text-splitters langchain-huggingface langchain-milvus"
+    "%pip install -q --progress-bar off --no-warn-conflicts langchain-docling langchain-core langchain-huggingface langchain_milvus langchain python-dotenv"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 2,
    "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "text/plain": [
-       "True"
-      ]
-     },
-     "execution_count": 2,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
+   "outputs": [],
    "source": [
     "import os\n",
+    "from pathlib import Path\n",
+    "from tempfile import mkdtemp\n",
     "\n",
     "from dotenv import load_dotenv\n",
+    "from langchain_core.prompts import PromptTemplate\n",
+    "from langchain_docling.loader import ExportType\n",
     "\n",
-    "load_dotenv()"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### Loader and splitter"
+    "\n",
+    "def _get_env_from_colab_or_os(key):\n",
+    "    try:\n",
+    "        from google.colab import userdata\n",
+    "\n",
+    "        try:\n",
+    "            return userdata.get(key)\n",
+    "        except userdata.SecretNotFoundError:\n",
+    "            pass\n",
+    "    except ImportError:\n",
+    "        pass\n",
+    "    return os.getenv(key)\n",
+    "\n",
+    "\n",
+    "load_dotenv()\n",
+    "\n",
+    "# https://github.com/huggingface/transformers/issues/5486:\n",
+    "os.environ[\"TOKENIZERS_PARALLELISM\"] = \"false\"\n",
+    "\n",
+    "HF_TOKEN = _get_env_from_colab_or_os(\"HF_TOKEN\")\n",
+    "FILE_PATH = [\"https://arxiv.org/pdf/2408.09869\"]  # Docling Technical Report\n",
+    "EMBED_MODEL_ID = \"sentence-transformers/all-MiniLM-L6-v2\"\n",
+    "GEN_MODEL_ID = \"mistralai/Mixtral-8x7B-Instruct-v0.1\"\n",
+    "EXPORT_TYPE = ExportType.DOC_CHUNKS\n",
+    "QUESTION = \"Which are the main AI models in Docling?\"\n",
+    "PROMPT = PromptTemplate.from_template(\n",
+    "    \"Context information is below.\\n---------------------\\n{context}\\n---------------------\\nGiven the context information and not prior knowledge, answer the query.\\nQuery: {input}\\nAnswer:\\n\",\n",
+    ")\n",
+    "TOP_K = 3\n",
+    "MILVUS_URI = str(Path(mkdtemp()) / \"docling.db\")"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Below we set up:\n",
-    "- a `Loader` which will be used to create LangChain documents, and\n",
-    "- a splitter, which will be used to split these documents"
+    "## Document loading\n",
+    "\n",
+    "Now we can instantiate our loader and load documents."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 3,
    "metadata": {},
-   "outputs": [],
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Token indices sequence length is longer than the specified maximum sequence length for this model (1041 > 512). Running this sequence through the model will result in indexing errors\n"
+     ]
+    }
+   ],
    "source": [
-    "from typing import Iterator\n",
-    "\n",
-    "from langchain_core.document_loaders import BaseLoader\n",
-    "from langchain_core.documents import Document as LCDocument\n",
-    "\n",
-    "from docling.document_converter import DocumentConverter\n",
-    "\n",
+    "from langchain_docling import DoclingLoader\n",
     "\n",
-    "class DoclingPDFLoader(BaseLoader):\n",
+    "from docling.chunking import HybridChunker\n",
     "\n",
-    "    def __init__(self, file_path: str | list[str]) -> None:\n",
-    "        self._file_paths = file_path if isinstance(file_path, list) else [file_path]\n",
-    "        self._converter = DocumentConverter()\n",
+    "loader = DoclingLoader(\n",
+    "    file_path=FILE_PATH,\n",
+    "    export_type=EXPORT_TYPE,\n",
+    "    chunker=HybridChunker(tokenizer=EMBED_MODEL_ID),\n",
+    ")\n",
     "\n",
-    "    def lazy_load(self) -> Iterator[LCDocument]:\n",
-    "        for source in self._file_paths:\n",
-    "            dl_doc = self._converter.convert(source).document\n",
-    "            text = dl_doc.export_to_markdown()\n",
-    "            yield LCDocument(page_content=text)"
+    "docs = loader.load()"
    ]
   },
   {
-   "cell_type": "code",
-   "execution_count": 4,
+   "cell_type": "markdown",
    "metadata": {},
-   "outputs": [],
    "source": [
-    "FILE_PATH = \"https://raw.githubusercontent.com/DS4SD/docling/main/tests/data/2206.01062.pdf\"  # DocLayNet paper"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 5,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "from langchain_text_splitters import RecursiveCharacterTextSplitter\n",
-    "\n",
-    "loader = DoclingPDFLoader(file_path=FILE_PATH)\n",
-    "text_splitter = RecursiveCharacterTextSplitter(\n",
-    "    chunk_size=1000,\n",
-    "    chunk_overlap=200,\n",
-    ")"
+    "> Note: a message saying `\"Token indices sequence length is longer than the specified\n",
+    "maximum sequence length...\"` can be ignored in this case — details\n",
+    "[here](https://github.com/DS4SD/docling-core/issues/119#issuecomment-2577418826)."
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "We now used the above-defined objects to get the document splits:"
+    "Determining the splits:"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 6,
+   "execution_count": 4,
    "metadata": {},
    "outputs": [],
    "source": [
-    "docs = loader.load()\n",
-    "splits = text_splitter.split_documents(docs)"
+    "if EXPORT_TYPE == ExportType.DOC_CHUNKS:\n",
+    "    splits = docs\n",
+    "elif EXPORT_TYPE == ExportType.MARKDOWN:\n",
+    "    from langchain_text_splitters import MarkdownHeaderTextSplitter\n",
+    "\n",
+    "    splitter = MarkdownHeaderTextSplitter(\n",
+    "        headers_to_split_on=[\n",
+    "            (\"#\", \"Header_1\"),\n",
+    "            (\"##\", \"Header_2\"),\n",
+    "            (\"###\", \"Header_3\"),\n",
+    "        ],\n",
+    "    )\n",
+    "    splits = [split for doc in docs for split in splitter.split_text(doc.page_content)]\n",
+    "else:\n",
+    "    raise ValueError(f\"Unexpected export type: {EXPORT_TYPE}\")"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### Embeddings"
+    "Inspecting some sample splits:"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 7,
+   "execution_count": 5,
    "metadata": {},
-   "outputs": [],
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "- d.page_content='arXiv:2408.09869v5  [cs.CL]  9 Dec 2024'\n",
+      "- d.page_content='Docling Technical Report\\nVersion 1.0\\nChristoph Auer Maksym Lysak Ahmed Nassar Michele Dolfi Nikolaos Livathinos Panos Vagenas Cesar Berrospi Ramis Matteo Omenetti Fabian Lindlbauer Kasper Dinkla Lokesh Mishra Yusik Kim Shubham Gupta Rafael Teixeira de Lima Valery Weber Lucas Morin Ingmar Meijer Viktor Kuropiatnyk Peter W. J. Staar\\nAI4K Group, IBM Research R¨uschlikon, Switzerland'\n",
+      "- d.page_content='Abstract\\nThis technical report introduces Docling , an easy to use, self-contained, MITlicensed open-source package for PDF document conversion. It is powered by state-of-the-art specialized AI models for layout analysis (DocLayNet) and table structure recognition (TableFormer), and runs efficiently on commodity hardware in a small resource budget. The code interface allows for easy extensibility and addition of new features and models.'\n",
+      "...\n"
+     ]
+    }
+   ],
    "source": [
-    "from langchain_huggingface.embeddings import HuggingFaceEmbeddings\n",
-    "\n",
-    "HF_EMBED_MODEL_ID = \"BAAI/bge-small-en-v1.5\"\n",
-    "embeddings = HuggingFaceEmbeddings(model_name=HF_EMBED_MODEL_ID)"
+    "for d in splits[:3]:\n",
+    "    print(f\"- {d.page_content=}\")\n",
+    "print(\"...\")"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### Vector store"
+    "## Ingestion"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 8,
+   "execution_count": 6,
    "metadata": {},
    "outputs": [],
    "source": [
-    "from tempfile import TemporaryDirectory\n",
+    "import json\n",
+    "from pathlib import Path\n",
+    "from tempfile import mkdtemp\n",
     "\n",
+    "from langchain_huggingface.embeddings import HuggingFaceEmbeddings\n",
     "from langchain_milvus import Milvus\n",
     "\n",
-    "MILVUS_URI = os.environ.get(\n",
-    "    \"MILVUS_URI\", f\"{(tmp_dir := TemporaryDirectory()).name}/milvus_demo.db\"\n",
-    ")\n",
+    "embedding = HuggingFaceEmbeddings(model_name=EMBED_MODEL_ID)\n",
+    "\n",
     "\n",
+    "milvus_uri = str(Path(mkdtemp()) / \"docling.db\")  # or set as needed\n",
     "vectorstore = Milvus.from_documents(\n",
-    "    splits,\n",
-    "    embeddings,\n",
-    "    connection_args={\"uri\": MILVUS_URI},\n",
+    "    documents=splits,\n",
+    "    embedding=embedding,\n",
+    "    collection_name=\"docling_demo\",\n",
+    "    connection_args={\"uri\": milvus_uri},\n",
+    "    index_params={\"index_type\": \"FLAT\"},\n",
     "    drop_old=True,\n",
     ")"
    ]
@@ -203,95 +268,94 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### LLM"
+    "## RAG"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 9,
+   "execution_count": 7,
    "metadata": {},
    "outputs": [
     {
-     "name": "stdout",
+     "name": "stderr",
      "output_type": "stream",
      "text": [
-      "The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.\n",
-      "Token is valid (permission: write).\n",
-      "Your token has been saved to /Users/pva/.cache/huggingface/token\n",
-      "Login successful\n"
+      "Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.\n"
      ]
     }
    ],
    "source": [
+    "from langchain.chains import create_retrieval_chain\n",
+    "from langchain.chains.combine_documents import create_stuff_documents_chain\n",
     "from langchain_huggingface import HuggingFaceEndpoint\n",
     "\n",
-    "HF_API_KEY = os.environ.get(\"HF_API_KEY\")\n",
-    "HF_LLM_MODEL_ID = \"mistralai/Mistral-7B-Instruct-v0.3\"\n",
-    "\n",
+    "retriever = vectorstore.as_retriever(search_kwargs={\"k\": TOP_K})\n",
     "llm = HuggingFaceEndpoint(\n",
-    "    repo_id=HF_LLM_MODEL_ID,\n",
-    "    huggingfacehub_api_token=HF_API_KEY,\n",
-    ")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## RAG"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 10,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "from typing import Iterable\n",
-    "\n",
-    "from langchain_core.documents import Document as LCDocument\n",
-    "from langchain_core.output_parsers import StrOutputParser\n",
-    "from langchain_core.prompts import PromptTemplate\n",
-    "from langchain_core.runnables import RunnablePassthrough\n",
-    "\n",
-    "\n",
-    "def format_docs(docs: Iterable[LCDocument]):\n",
-    "    return \"\\n\\n\".join(doc.page_content for doc in docs)\n",
-    "\n",
-    "\n",
-    "retriever = vectorstore.as_retriever()\n",
-    "\n",
-    "prompt = PromptTemplate.from_template(\n",
-    "    \"Context information is below.\\n---------------------\\n{context}\\n---------------------\\nGiven the context information and not prior knowledge, answer the query.\\nQuery: {question}\\nAnswer:\\n\"\n",
+    "    repo_id=GEN_MODEL_ID,\n",
+    "    huggingfacehub_api_token=HF_TOKEN,\n",
     ")\n",
     "\n",
-    "rag_chain = (\n",
-    "    {\"context\": retriever | format_docs, \"question\": RunnablePassthrough()}\n",
-    "    | prompt\n",
-    "    | llm\n",
-    "    | StrOutputParser()\n",
-    ")"
+    "\n",
+    "def clip_text(text, threshold=100):\n",
+    "    return f\"{text[:threshold]}...\" if len(text) > threshold else text"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 11,
+   "execution_count": 8,
    "metadata": {},
    "outputs": [
     {
-     "data": {
-      "text/plain": [
-       "'- 80,863 pages were human annotated for DocLayNet.'"
-      ]
-     },
-     "execution_count": 11,
-     "metadata": {},
-     "output_type": "execute_result"
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Question:\n",
+      "Which are the main AI models in Docling?\n",
+      "\n",
+      "Answer:\n",
+      "Docling initially releases two AI models, a layout analysis model and TableFormer. The layout analysis model is an accurate object-detector for page elements, and TableFormer is a state-of-the-art tab...\n",
+      "\n",
+      "Source 1:\n",
+      "  text: \"3.2 AI models\\nAs part of Docling, we initially release two highly capable AI models to the open-source community, which have been developed and published recently by our team. The first model is a layout analysis model, an accurate object-detector for page elements [13]. The second model is TableFormer [12, 9], a state-of-the-art table structure re...\"\n",
+      "  dl_meta: {'schema_name': 'docling_core.transforms.chunker.DocMeta', 'version': '1.0.0', 'doc_items': [{'self_ref': '#/texts/50', 'parent': {'$ref': '#/body'}, 'children': [], 'label': 'text', 'prov': [{'page_no': 3, 'bbox': {'l': 108.0, 't': 405.1419982910156, 'r': 504.00299072265625, 'b': 330.7799987792969, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 608]}]}], 'headings': ['3.2 AI models'], 'origin': {'mimetype': 'application/pdf', 'binary_hash': 11465328351749295394, 'filename': '2408.09869v5.pdf'}}\n",
+      "  source: https://arxiv.org/pdf/2408.09869\n",
+      "\n",
+      "Source 2:\n",
+      "  text: \"3 Processing pipeline\\nDocling implements a linear pipeline of operations, which execute sequentially on each given document (see Fig. 1). Each document is first parsed by a PDF backend, which retrieves the programmatic text tokens, consisting of string content and its coordinates on the page, and also renders a bitmap image of each page to support ...\"\n",
+      "  dl_meta: {'schema_name': 'docling_core.transforms.chunker.DocMeta', 'version': '1.0.0', 'doc_items': [{'self_ref': '#/texts/26', 'parent': {'$ref': '#/body'}, 'children': [], 'label': 'text', 'prov': [{'page_no': 2, 'bbox': {'l': 108.0, 't': 273.01800537109375, 'r': 504.00299072265625, 'b': 176.83799743652344, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 796]}]}], 'headings': ['3 Processing pipeline'], 'origin': {'mimetype': 'application/pdf', 'binary_hash': 11465328351749295394, 'filename': '2408.09869v5.pdf'}}\n",
+      "  source: https://arxiv.org/pdf/2408.09869\n",
+      "\n",
+      "Source 3:\n",
+      "  text: \"6 Future work and contributions\\nDocling is designed to allow easy extension of the model library and pipelines. In the future, we plan to extend Docling with several more models, such as a figure-classifier model, an equationrecognition model, a code-recognition model and more. This will help improve the quality of conversion for specific types of ...\"\n",
+      "  dl_meta: {'schema_name': 'docling_core.transforms.chunker.DocMeta', 'version': '1.0.0', 'doc_items': [{'self_ref': '#/texts/76', 'parent': {'$ref': '#/body'}, 'children': [], 'label': 'text', 'prov': [{'page_no': 5, 'bbox': {'l': 108.0, 't': 322.468994140625, 'r': 504.00299072265625, 'b': 259.0169982910156, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 543]}]}, {'self_ref': '#/texts/77', 'parent': {'$ref': '#/body'}, 'children': [], 'label': 'text', 'prov': [{'page_no': 5, 'bbox': {'l': 108.0, 't': 251.6540069580078, 'r': 504.00299072265625, 'b': 198.99200439453125, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 402]}]}], 'headings': ['6 Future work and contributions'], 'origin': {'mimetype': 'application/pdf', 'binary_hash': 11465328351749295394, 'filename': '2408.09869v5.pdf'}}\n",
+      "  source: https://arxiv.org/pdf/2408.09869\n"
+     ]
     }
    ],
    "source": [
-    "rag_chain.invoke(\"How many pages were human annotated for DocLayNet?\")"
+    "question_answer_chain = create_stuff_documents_chain(llm, PROMPT)\n",
+    "rag_chain = create_retrieval_chain(retriever, question_answer_chain)\n",
+    "resp_dict = rag_chain.invoke({\"input\": QUESTION})\n",
+    "\n",
+    "clipped_answer = clip_text(resp_dict[\"answer\"], threshold=200)\n",
+    "print(f\"Question:\\n{resp_dict['input']}\\n\\nAnswer:\\n{clipped_answer}\")\n",
+    "for i, doc in enumerate(resp_dict[\"context\"]):\n",
+    "    print()\n",
+    "    print(f\"Source {i+1}:\")\n",
+    "    print(f\"  text: {json.dumps(clip_text(doc.page_content, threshold=350))}\")\n",
+    "    for key in doc.metadata:\n",
+    "        if key != \"pk\":\n",
+    "            val = doc.metadata.get(key)\n",
+    "            clipped_val = clip_text(val) if isinstance(val, str) else val\n",
+    "            print(f\"  {key}: {clipped_val}\")"
    ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
   }
  ],
  "metadata": {
@@ -310,7 +374,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.12.7"
+   "version": "3.12.8"
   }
  },
  "nbformat": 4,
diff --git a/docs/index.md b/docs/index.md
index cb76c62f..c88ee7c6 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -21,7 +21,7 @@ Docling parses documents and exports them to the desired format with ease and sp
 * 🗂️ Reads popular document formats (PDF, DOCX, PPTX, XLSX, Images, HTML, AsciiDoc & Markdown) and exports to HTML, Markdown and JSON (with embedded and referenced images)
 * 📑 Advanced PDF document understanding incl. page layout, reading order & table structures
 * 🧩 Unified, expressive [DoclingDocument](./concepts/docling_document.md) representation format
-* 🤖 Easy integration with 🦙 LlamaIndex & 🦜🔗 LangChain for powerful RAG / QA applications
+* 🤖 Plug-and-play [integrations](https://ds4sd.github.io/docling/integrations/) incl. LangChain, LlamaIndex, Crew AI & Haystack for agentic AI
 * 🔍 OCR support for scanned PDFs
 * 💻 Simple and convenient CLI
 
@@ -29,7 +29,6 @@ Docling parses documents and exports them to the desired format with ease and sp
 
 * ♾️ Equation & code extraction
 * 📝 Metadata extraction, including title, authors, references & language
-* 🦜🔗 Native LangChain extension
 
 ## Get started
 
diff --git a/docs/integrations/haystack.md b/docs/integrations/haystack.md
index 08d051f0..77507e21 100644
--- a/docs/integrations/haystack.md
+++ b/docs/integrations/haystack.md
@@ -1,6 +1,6 @@
 Docling is available as a converter in [Haystack](https://haystack.deepset.ai/):
 
-- 📖 [Docling Haystack integration docs](https://haystack.deepset.ai/integrations/docling)
+- 📖 [Docling Haystack integration docs][docs]
 - 💻 [Docling Haystack integration GitHub][github]
 - 🧑🏽‍🍳 [Docling Haystack integration example][example]
 - 📦 [Docling Haystack integration PyPI][pypi]
@@ -8,4 +8,4 @@ Docling is available as a converter in [Haystack](https://haystack.deepset.ai/):
 [github]: https://github.com/DS4SD/docling-haystack
 [docs]: https://haystack.deepset.ai/integrations/docling
 [pypi]: https://pypi.org/project/docling-haystack
-[example]: https://ds4sd.github.io/docling/examples/rag_haystack/
+[example]: ../examples/rag_haystack.ipynb
diff --git a/docs/integrations/langchain.md b/docs/integrations/langchain.md
new file mode 100644
index 00000000..baa2695e
--- /dev/null
+++ b/docs/integrations/langchain.md
@@ -0,0 +1,9 @@
+Docling is available as a [LangChain](https://www.langchain.com/) document loader:
+
+- 💻 [LangChain Docling integration GitHub][github]
+- 🧑🏽‍🍳 [LangChain Docling integration example][example]
+- 📦 [LangChain Docling integration PyPI][pypi]
+
+[github]: https://github.com/DS4SD/docling-langchain
+[example]: ../examples/rag_langchain.ipynb
+[pypi]: https://pypi.org/project/langchain-docling/
diff --git a/mkdocs.yml b/mkdocs.yml
index 01de6e7d..8d9f6591 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -91,7 +91,7 @@ nav:
       - "Bee Agent Framework": integrations/bee.md
       - "Crew AI": integrations/crewai.md
       - "Haystack": integrations/haystack.md
-      # - "LangChain": integrations/langchain.md
+      - "LangChain": integrations/langchain.md
       - "LlamaIndex": integrations/llamaindex.md
       - "txtai": integrations/txtai.md
     - ⭐️ Featured: