Merge pull request #16 from davidhou17/DOCSP-48649

davidhou17 · web-flow · commit f47874f4f8e5 · 2025-04-08T10:40:21.000-05:00
(DOCSP-48649): Create notebook for GraphRAG with Langchain and MongoDB
diff --git a/ai-integrations/langchain-graphrag.ipynb b/ai-integrations/langchain-graphrag.ipynb
@@ -0,0 +1,278 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "b5dcbf95-9a30-416d-afed-d5b2bf0e8651",
+   "metadata": {},
+   "source": [
+    "# GraphRAG with MongoDB and LangChain\n",
+    "\n",
+    "This notebook is a companion to the [GraphRAG with MongoDB and LangChain](https://www.mongodb.com/docs/atlas/atlas-vector-search/ai-integrations/langchain/graph-rag/) tutorial. Refer to the page for set-up instructions and detailed explanations.\n",
+    "\n",
+    "This notebook demonstrates a GraphRAG implementation using MongoDB Atlas and LangChain. Compared to vector-based RAG, which structures your data as vector embeddings, GraphRAG structures data as a knowledge graph with entities and their relationships. This enables relationship-aware retrieval and multi-hop reasoning.\n",
+    "\n",
+    "<a target=\"_blank\" href=\"https://colab.research.google.com/github/mongodb/docs-notebooks/blob/main/ai-integrations/langchain-graphrag.ipynb\">\n",
+    "  <img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/>\n",
+    "</a>"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "23f70093-83ea-4ecc-87db-2f2f89e546d7",
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [],
+   "source": [
+    "pip install --quiet --upgrade pymongo langchain_community wikipedia langchain_openai langchain_mongodb pyvis"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d96955f9-a370-4f45-970d-ef187ee6195c",
+   "metadata": {},
+   "source": [
+    "## Set up your environment\n",
+    "\n",
+    "Before you begin, make sure you have the following:\n",
+    "\n",
+    "- An Atlas cluster up and running (you'll need the [connection string](https://www.mongodb.com/docs/guides/atlas/connection-string/))\n",
+    "- An API key to access an LLM (This tutorial uses a model from OpenAI, but you can use any model [supported by LangChain](https://python.langchain.com/docs/integrations/chat/))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0119b58d-f14e-4f36-a284-345d94478537",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "\n",
+    "os.environ[\"OPENAI_API_KEY\"] = \"<api-key>\"\n",
+    "ATLAS_CONNECTION_STRING = \"<connection-string>\"\n",
+    "ATLAS_DB_NAME = \"langchain_db\"    # MongoDB database to store the knowledge graph\n",
+    "ATLAS_COLLECTION = \"wikipedia\"    # MongoDB collection to store the knowledge graph"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0adf66a8",
+   "metadata": {},
+   "source": [
+    "## Use MongoDB Atlas as a knowledge graph\n",
+    "\n",
+    "Use the `MongoDBGraphStore` component to store your data as a knowledge graph. This component allows you to implement GraphRAG by storing entities (nodes) and their relationships (edges) in a MongoDB collection. It stores each entity as a document with relationship fields that reference other documents in your collection."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f4e8db2f-d918-41aa-92f8-41f80a6d747a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain_openai import OpenAI\n",
+    "from langchain.chat_models import init_chat_model\n",
+    "\n",
+    "# For best results, use latest models such as gpt-4o and Claude Sonnet 3.5+, etc.\n",
+    "chat_model = init_chat_model(\"gpt-4o\", model_provider=\"openai\", temperature=0)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "72cd5c08-e17b-4f47-bca7-ded0fb25fb85",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain_community.document_loaders import WikipediaLoader\n",
+    "from langchain.text_splitter import TokenTextSplitter\n",
+    "\n",
+    "# Load Wikipedia pages corresponding to the query \"Sherlock Holmes\"\n",
+    "wikipedia_pages = WikipediaLoader(query=\"Sherlock Holmes\", load_max_docs=3).load()\n",
+    "\n",
+    "# Split the documents into chunks for efficient downstream processing (graph creation)\n",
+    "text_splitter = TokenTextSplitter(chunk_size=1024, chunk_overlap=0)\n",
+    "wikipedia_docs = text_splitter.split_documents(wikipedia_pages)\n",
+    "\n",
+    "# Print the first document\n",
+    "wikipedia_docs[0]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2dc8f05b-0f9a-4293-b9ea-761030c98dca",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain_mongodb.graphrag.graph import MongoDBGraphStore\n",
+    "\n",
+    "graph_store = MongoDBGraphStore(\n",
+    "    connection_string = ATLAS_CONNECTION_STRING,\n",
+    "    database_name = ATLAS_DB_NAME,\n",
+    "    collection_name = ATLAS_COLLECTION,\n",
+    "    entity_extraction_model = chat_model\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "3664189e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Extract entities and create knowledge graph in Atlas\n",
+    "# This might take a few minutes; you can ignore any warnings\n",
+    "graph_store.add_documents(wikipedia_docs)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b167c2eb-b2c5-45ef-bdc9-8230f7da4c52",
+   "metadata": {},
+   "source": [
+    "## Visualize the knowledge graph\n",
+    "\n",
+    "To visualize the knowledge graph, you can export the structured data to a visualization library like `pyvis`.\n",
+    "This helps you to explore and understand the relationships and hierarchies within your data."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "8b515723-a8a4-435b-b386-5cb3244c2745",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import networkx as nx\n",
+    "from pyvis.network import Network\n",
+    "\n",
+    "def visualize_graph(collection):\n",
+    "    docs = list(collection.find())\n",
+    "    \n",
+    "    def format_attributes(attrs):\n",
+    "        return \"<br>\".join(f\"{k}: {', '.join(v)}\" for k, v in attrs.items()) if attrs else \"\"\n",
+    "    \n",
+    "    G = nx.DiGraph()\n",
+    "\n",
+    "    # Create nodes\n",
+    "    for doc in docs:\n",
+    "        node_id = str(doc[\"_id\"])\n",
+    "        info = f\"Type: {doc.get('type', '')}\"\n",
+    "        if \"attributes\" in doc:\n",
+    "            attr_info = format_attributes(doc[\"attributes\"])\n",
+    "            if attr_info:\n",
+    "                info += \"<br>\" + attr_info\n",
+    "        G.add_node(node_id, label=node_id, title=info.replace(\"<br>\", \"\\n\"))\n",
+    "\n",
+    "    # Create edges\n",
+    "    for doc in docs:\n",
+    "        source = str(doc[\"_id\"])\n",
+    "        rels = doc.get(\"relationships\", {})\n",
+    "        targets = rels.get(\"target_ids\", [])\n",
+    "        types = rels.get(\"types\", [])\n",
+    "        attrs = rels.get(\"attributes\", [])\n",
+    "        \n",
+    "        for i, target in enumerate(targets):\n",
+    "            edge_type = types[i] if i < len(types) else \"\"\n",
+    "            extra = attrs[i] if i < len(attrs) else {}\n",
+    "            edge_info = f\"Relationship: {edge_type}\"\n",
+    "            if extra:\n",
+    "                edge_info += \"<br>\" + format_attributes(extra)\n",
+    "            G.add_edge(source, str(target), label=edge_type, title=edge_info.replace(\"<br>\", \"\\n\"))\n",
+    "\n",
+    "    # Build and configure network\n",
+    "    nt = Network(notebook=True, cdn_resources='in_line', width=\"800px\", height=\"600px\", directed=True)\n",
+    "    nt.from_nx(G)\n",
+    "    nt.set_options('''\n",
+    "    var options = {\n",
+    "      \"interaction\": {\n",
+    "        \"hover\": true,\n",
+    "        \"tooltipDelay\": 200\n",
+    "      },\n",
+    "      \"nodes\": {\n",
+    "        \"font\": {\"multi\": \"html\"}\n",
+    "      },\n",
+    "      \"physics\": {\n",
+    "        \"repulsion\": {\n",
+    "          \"nodeDistance\": 300,\n",
+    "          \"centralGravity\": 0.2,\n",
+    "          \"springLength\": 200,\n",
+    "          \"springStrength\": 0.05,\n",
+    "          \"damping\": 0.09\n",
+    "        }\n",
+    "      }\n",
+    "    }\n",
+    "    ''')\n",
+    "\n",
+    "    return nt.generate_html()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "62f9040e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from IPython.display import HTML, display\n",
+    "from pymongo import MongoClient\n",
+    "\n",
+    "client = MongoClient(ATLAS_CONNECTION_STRING)\n",
+    "\n",
+    "collection = client[ATLAS_DB_NAME][ATLAS_COLLECTION]\n",
+    "html = visualize_graph(collection)\n",
+    "\n",
+    "display(HTML(html))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "fbea568d-c656-4271-9e40-6ee01292255e",
+   "metadata": {},
+   "source": [
+    "## Answer questions on your data\n",
+    "\n",
+    "The `MongoDBGraphStore` class provides a `chat_response` method that you can use to answer questions on your data. It executes queries by using the `$graphLookup` aggregation stage."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "506c7366-972c-4e50-88c4-3d5b0151e363",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "query = \"Who inspired Sherlock Holmes?\"\n",
+    "\n",
+    "answer = graph_store.chat_response(query)\n",
+    "answer.content"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.6"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/ai-integrations/llamaindex.ipynb b/ai-integrations/llamaindex.ipynb
@@ -84,7 +84,7 @@
    "source": [
     "# Load the sample data\n",
     "mkdir -p 'data/'\n",
-    "wget 'https://query.prod.cms.rt.microsoft.com/cms/api/am/binary/RE4HkJP' -O 'data/atlas_best_practices.pdf'"
+    "wget 'https://investors.mongodb.com/node/13176/pdf' -O 'data/mongodb-earnings-report.pdf'"
    ]
   },
   {
@@ -93,7 +93,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "sample_data = SimpleDirectoryReader(input_files=[\"./data/atlas_best_practices.pdf\"]).load_data()\n",
+    "sample_data = SimpleDirectoryReader(input_files=[\"./data/mongodb-earnings-report.pdf\"]).load_data()\n",
     "\n",
     "# Print the first document\n",
     "sample_data[0]"
@@ -176,7 +176,7 @@
    "outputs": [],
    "source": [
     "retriever = vector_store_index.as_retriever(similarity_top_k=3)\n",
-    "nodes = retriever.retrieve(\"MongoDB Atlas security\")\n",
+    "nodes = retriever.retrieve(\"MongoDB acquisition\")\n",
     "\n",
     "for node in nodes:\n",
     "    print(node)"
@@ -197,10 +197,10 @@
    "source": [
     "# Specify metadata filters\n",
     "metadata_filters = MetadataFilters(\n",
-    "   filters=[ExactMatchFilter(key=\"metadata.page_label\", value=\"17\")]\n",
+    "   filters=[ExactMatchFilter(key=\"metadata.page_label\", value=\"2\")]\n",
     ")\n",
     "retriever = vector_store_index.as_retriever(similarity_top_k=3, filters=metadata_filters)\n",
-    "nodes = retriever.retrieve(\"MongoDB Atlas security\")\n",
+    "nodes = retriever.retrieve(\"MongoDB acquisition\")\n",
     "\n",
     "for node in nodes:\n",
     "    print(node)"
@@ -226,7 +226,7 @@
     "query_engine = RetrieverQueryEngine(retriever=vector_store_retriever)\n",
     "\n",
     "# Prompt the LLM\n",
-    "response = query_engine.query('How can I secure my MongoDB Atlas cluster?')\n",
+    "response = query_engine.query(\"What was MongoDB's latest acquisition?\")\n",
     "\n",
     "print(response)\n",
     "print(\"\\nSource documents: \")\n",
@@ -248,7 +248,7 @@
    "source": [
     "# Specify metadata filters\n",
     "metadata_filters = MetadataFilters(\n",
-    "   filters=[ExactMatchFilter(key=\"metadata.page_label\", value=\"17\")]\n",
+    "   filters=[ExactMatchFilter(key=\"metadata.page_label\", value=\"2\")]\n",
     ")\n",
     "\n",
     "# Instantiate Atlas Vector Search as a retriever\n",
@@ -258,7 +258,7 @@
     "query_engine = RetrieverQueryEngine(retriever=vector_store_retriever)\n",
     "\n",
     "# Prompt the LLM\n",
-    "response = query_engine.query('How can I secure my MongoDB Atlas cluster?')\n",
+    "response = query_engine.query(\"What was MongoDB's latest acquisition?\")\n",
     "\n",
     "print(response)\n",
     "print(\"\\nSource documents: \")\n",