diff --git a/README.md b/README.md
index 79fc9c32..9ec5d28d 100644
--- a/README.md
+++ b/README.md
@@ -94,9 +94,9 @@ Looking to get started with LLMs, vectorDBs, and the world of Generative AI? The
| --------- | -------------------------- | ----------- |
| | | |
| [Build RAG from Scratch](./tutorials/RAG-from-Scratch) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/lancedb/vectordb-recipes/blob/main/tutorials/RAG-from-Scratch/RAG_from_Scratch.ipynb) [![LLM](https://img.shields.io/badge/openai-api-white)](#) | |
+| [Langchain LlamaIndex Chunking](./tutorials/RAG-from-Scratch) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/lancedb/vectordb-recipes/blob/main/tutorials/RAG-from-Scratch/RAG_from_Scratch.ipynb)| [![Ghost](https://img.shields.io/badge/ghost-000?style=for-the-badge&logo=ghost&logoColor=%23F7DF1E)](https://blog.lancedb.com/product-quantization-compress-high-dimensional-vectors-dfcba98fab47) |
| [Product Quantization: Compress High Dimensional Vectors](https://blog.lancedb.com/product-quantization-compress-high-dimensional-vectors-dfcba98fab47) | | [![Ghost](https://img.shields.io/badge/ghost-000?style=for-the-badge&logo=ghost&logoColor=%23F7DF1E)](https://blog.lancedb.com/product-quantization-compress-high-dimensional-vectors-dfcba98fab47) |
| [Corrective RAG with Langgraph](./tutorials/Corrective-RAG-with_Langgraph/) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/lancedb/vectordb-recipes/blob/main/tutorials/Corrective-RAG-with_Langgraph/CRAG_with_Langgraph.ipynb) [![LLM](https://img.shields.io/badge/openai-api-white)](#) | [![Ghost](https://img.shields.io/badge/ghost-000?style=for-the-badge&logo=ghost&logoColor=%23F7DF1E)](https://blog.lancedb.com/implementing-corrective-rag-in-the-easiest-way-2/)|
-| [Product Quantization: Compress High Dimensional Vectors](https://blog.lancedb.com/product-quantization-compress-high-dimensional-vectors-dfcba98fab47) | | [![Ghost](https://img.shields.io/badge/ghost-000?style=for-the-badge&logo=ghost&logoColor=%23F7DF1E)](https://blog.lancedb.com/product-quantization-compress-high-dimensional-vectors-dfcba98fab47) |
| [LLMs, RAG, & the missing storage layer for AI](https://blog.lancedb.com/llms-rag-the-missing-storage-layer-for-ai-28ded35fa984) | | [![Ghost](https://img.shields.io/badge/ghost-000?style=for-the-badge&logo=ghost&logoColor=%23F7DF1E)](https://blog.lancedb.com/llms-rag-the-missing-storage-layer-for-ai-28ded35fa984/) |
| [Fine-Tuning LLM using PEFT & QLoRA](./tutorials/fine-tuning_LLM_with_PEFT_QLoRA) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/lancedb/vectordb-recipes/blob/main/tutorials/fine-tuning_LLM_with_PEFT_QLoRA/main.ipynb) [![local LLM](https://img.shields.io/badge/local-llm-green)](#)| [![Ghost](https://img.shields.io/badge/ghost-000?style=for-the-badge&logo=ghost&logoColor=%23F7DF1E)](https://blog.lancedb.com/optimizing-llms-a-step-by-step-guide-to-fine-tuning-with-peft-and-qlora-22eddd13d25b) |
| [Context-Aware Chatbot using Llama 2 & LanceDB](./tutorials/chatbot_using_Llama2_&_lanceDB) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/lancedb/vectordb-recipes/blob/main/tutorials/chatbot_using_Llama2_&_lanceDB/main.ipynb) [![local LLM](https://img.shields.io/badge/local-llm-green)](#)| [![Ghost](https://img.shields.io/badge/ghost-000?style=for-the-badge&logo=ghost&logoColor=%23F7DF1E)](https://blog.lancedb.com/context-aware-chatbot-using-llama-2-lancedb-as-vector-database-4d771d95c755) |
diff --git a/assets/chunking.png b/assets/chunking.png
new file mode 100644
index 00000000..5acd5823
Binary files /dev/null and b/assets/chunking.png differ
diff --git a/tutorials/Langchain-LlamaIndex-Chunking/Langchain_Llamaindex_chunking.ipynb b/tutorials/Langchain-LlamaIndex-Chunking/Langchain_Llamaindex_chunking.ipynb
new file mode 100644
index 00000000..badab977
--- /dev/null
+++ b/tutorials/Langchain-LlamaIndex-Chunking/Langchain_Llamaindex_chunking.ipynb
@@ -0,0 +1,1148 @@
+{
+ "nbformat": 4,
+ "nbformat_minor": 0,
+ "metadata": {
+ "colab": {
+ "provenance": []
+ },
+ "kernelspec": {
+ "name": "python3",
+ "display_name": "Python 3"
+ },
+ "language_info": {
+ "name": "python"
+ }
+ },
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "source": [
+ "# Llama Index Text Chunking Strategies\n"
+ ],
+ "metadata": {
+ "id": "1o6oVH_fNA0N"
+ }
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "lCgSu4wS5L02",
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "outputId": "c9cef73c-2e7b-4415-f1e0-3dad04cb76c6"
+ },
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m496.7/496.7 kB\u001b[0m \u001b[31m5.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m8.4/8.4 MB\u001b[0m \u001b[31m18.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m15.4/15.4 MB\u001b[0m \u001b[31m33.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m2.0/2.0 MB\u001b[0m \u001b[31m28.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m268.3/268.3 kB\u001b[0m \u001b[31m19.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m75.6/75.6 kB\u001b[0m \u001b[31m7.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m136.1/136.1 kB\u001b[0m \u001b[31m15.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.8/1.8 MB\u001b[0m \u001b[31m58.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m290.4/290.4 kB\u001b[0m \u001b[31m21.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m77.9/77.9 kB\u001b[0m \u001b[31m5.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m58.3/58.3 kB\u001b[0m \u001b[31m6.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m49.4/49.4 kB\u001b[0m \u001b[31m4.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+ "\u001b[?25h"
+ ]
+ }
+ ],
+ "source": [
+ "!pip install llama_index tree_sitter tree_sitter_languages -q"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "# Download for running any text file\n",
+ "!wget https://raw.githubusercontent.com/lancedb/vectordb-recipes/main/README.md\n",
+ "!wget https://frontiernerds.com/files/state_of_the_union.txt"
+ ],
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "xYx0PGZTD2xs",
+ "outputId": "042bb5d1-cb90-45c9-8183-0c986193cdb1"
+ },
+ "execution_count": null,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "--2024-04-15 10:06:43-- https://raw.githubusercontent.com/lancedb/vectordb-recipes/main/README.md\n",
+ "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...\n",
+ "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.\n",
+ "HTTP request sent, awaiting response... 200 OK\n",
+ "Length: 29701 (29K) [text/plain]\n",
+ "Saving to: ‘README.md’\n",
+ "\n",
+ "\rREADME.md 0%[ ] 0 --.-KB/s \rREADME.md 100%[===================>] 29.00K --.-KB/s in 0.002s \n",
+ "\n",
+ "2024-04-15 10:06:43 (12.1 MB/s) - ‘README.md’ saved [29701/29701]\n",
+ "\n",
+ "--2024-04-15 10:06:43-- https://frontiernerds.com/files/state_of_the_union.txt\n",
+ "Resolving frontiernerds.com (frontiernerds.com)... 172.67.180.189, 104.21.31.232, 2606:4700:3036::6815:1fe8, ...\n",
+ "Connecting to frontiernerds.com (frontiernerds.com)|172.67.180.189|:443... connected.\n",
+ "HTTP request sent, awaiting response... 200 OK\n",
+ "Length: unspecified [text/plain]\n",
+ "Saving to: ‘state_of_the_union.txt’\n",
+ "\n",
+ "state_of_the_union. [ <=> ] 39.91K --.-KB/s in 0.001s \n",
+ "\n",
+ "2024-04-15 10:06:43 (64.5 MB/s) - ‘state_of_the_union.txt’ saved [40864]\n",
+ "\n"
+ ]
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "## File based Node Parsers"
+ ],
+ "metadata": {
+ "id": "2olIZB2unXwz"
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "### Node Parser - Simple File\n",
+ "Covering all the files intelligently"
+ ],
+ "metadata": {
+ "id": "HtW5uzownkVP"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "# Simple File\n",
+ "from llama_index.core.node_parser import SimpleFileNodeParser\n",
+ "from llama_index.readers.file import FlatReader\n",
+ "from pathlib import Path\n",
+ "\n",
+ "md_docs = FlatReader().load_data(Path(\"README.md\"))\n",
+ "\n",
+ "parser = SimpleFileNodeParser()\n",
+ "\n",
+ "# Additionally, you can augment this with a text-based parser to accurately handle text length\n",
+ "md_nodes = parser.get_nodes_from_documents(md_docs)\n",
+ "md_nodes[0].text"
+ ],
+ "metadata": {
+ "id": "GqWdmhBdWrB4",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 105
+ },
+ "outputId": "172b7873-06cc-4a71-fd5b-acdcd24aeb0d"
+ },
+ "execution_count": null,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "'VectorDB-recipes\\n
\\nDive into building GenAI applications!\\nThis repository contains examples, applications, starter code, & tutorials to help you kickstart your GenAI projects.\\n\\n- These are built using LanceDB, a free, open-source, serverless vectorDB that **requires no setup**. \\n- It **integrates into python data ecosystem** so you can simply start using these in your existing data pipelines in pandas, arrow, pydantic etc.\\n- LanceDB has **native Typescript SDK** using which you can **run vector search** in serverless functions!\\n\\n\\n\\n
\\nJoin our community for support - Discord •\\nTwitter\\n\\n---\\n\\nThis repository is divided into 3 sections:\\n- [Examples](#examples) - Get right into the code with minimal introduction, aimed at getting you from an idea to PoC within minutes!\\n- [Applications](#projects--applications) - Ready to use Python and web apps using applied LLMs, VectorDB and GenAI tools\\n- [Tutorials](#tutorials) - A curated list of tutorials, blogs, Colabs and courses to get you started with GenAI in greater depth.'"
+ ],
+ "application/vnd.google.colaboratory.intrinsic+json": {
+ "type": "string"
+ }
+ },
+ "metadata": {},
+ "execution_count": 4
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "### Node Parser - HTML"
+ ],
+ "metadata": {
+ "id": "-au7BAS2nvBC"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "# HTML\n",
+ "\n",
+ "import requests\n",
+ "from llama_index.core import Document\n",
+ "from llama_index.core.node_parser import HTMLNodeParser\n",
+ "\n",
+ "# URL of the website to fetch HTML from\n",
+ "url = \"https://www.utoronto.ca/\"\n",
+ "\n",
+ "# Send a GET request to the URL\n",
+ "response = requests.get(url)\n",
+ "print(response)\n",
+ "\n",
+ "# Check if the request was successful (status code 200)\n",
+ "if response.status_code == 200:\n",
+ " # Extract the HTML content from the response\n",
+ " html_doc = response.text\n",
+ " document = Document(id_=url, text=html_doc)\n",
+ "\n",
+ " parser = HTMLNodeParser(tags=[\"p\", \"h1\"])\n",
+ " nodes = parser.get_nodes_from_documents([document])\n",
+ " print(nodes)\n",
+ "else:\n",
+ " # Print an error message if the request was unsuccessful\n",
+ " print(\"Failed to fetch HTML content:\", response.status_code)"
+ ],
+ "metadata": {
+ "id": "Zhe7xYJtXw4l",
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "outputId": "7f32b9e5-225e-4a3e-b9a0-9a6287615f5a"
+ },
+ "execution_count": null,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "\n",
+ "[TextNode(id_='bf308ea9-b937-4746-8645-c8023e2087d7', embedding=None, metadata={'tag': 'h1'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={: RelatedNodeInfo(node_id='https://www.utoronto.ca/', node_type=, metadata={}, hash='247fb639a05bc6898fd1750072eceb47511d3b8dae80999f9438e50a1faeb4b2'), : RelatedNodeInfo(node_id='7c280bdf-7373-4be8-8e70-6360848581e9', node_type=, metadata={'tag': 'p'}, hash='3e989bb32b04814d486ed9edeefb1b0ce580ba7fc8c375f64473ddd95ca3e824')}, text='Welcome to University of Toronto', start_char_idx=2784, end_char_idx=2816, text_template='{metadata_str}\\n\\n{content}', metadata_template='{key}: {value}', metadata_seperator='\\n'), TextNode(id_='7c280bdf-7373-4be8-8e70-6360848581e9', embedding=None, metadata={'tag': 'p'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={: RelatedNodeInfo(node_id='https://www.utoronto.ca/', node_type=, metadata={}, hash='247fb639a05bc6898fd1750072eceb47511d3b8dae80999f9438e50a1faeb4b2'), : RelatedNodeInfo(node_id='bf308ea9-b937-4746-8645-c8023e2087d7', node_type=, metadata={'tag': 'h1'}, hash='e1e6af749b6a40a4055c80ca6b821ed841f1d20972e878ca1881e508e4446c26')}, text='In photos: Under cloudy skies, U of T community gathers to experience near-total solar eclipse\\nYour guide to the U of T community\\nThe University of Toronto is home to some of the world’s top faculty, students, alumni and staff. U of T Celebrates recognizes their award-winning accomplishments.\\nDavid Dyzenhaus recognized with Gold Medal from Social Sciences and Humanities Research Council\\nOur latest issue is all about feeling good: the only diet you really need to know about, the science behind cold plunges, a uniquely modern way to quit smoking, the “sex, drugs and rock ‘n’ roll” of university classes, how to become a better workplace leader, and more.\\nFaculty and Staff\\nHis course about the body is a workout for the mind\\nProfessor Doug Richards teaches his students the secret to living a longer – and healthier – life\\n\\nStatement of Land Acknowledgement\\nWe wish to acknowledge this land on which the University of Toronto operates. For thousands of years it has been the traditional land of the Huron-Wendat, the Seneca, and the Mississaugas of the Credit. Today, this meeting place is still the home to many Indigenous people from across Turtle Island and we are grateful to have the opportunity to work on this land.\\nRead about U of T’s Statement of Land Acknowledgement.\\nUNIVERSITY OF TORONTO - SINCE 1827', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\\n\\n{content}', metadata_template='{key}: {value}', metadata_seperator='\\n')]\n"
+ ]
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "### Node Parser - JSON"
+ ],
+ "metadata": {
+ "id": "rbn4Rvt-n4Zr"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "# JSON\n",
+ "\n",
+ "from llama_index.core.node_parser import JSONNodeParser\n",
+ "\n",
+ "url = \"https://housesigma.com/bkv2/api/search/address_v2/suggest\"\n",
+ "\n",
+ "payload = {\"lang\": \"en_US\", \"province\": \"ON\", \"search_term\": \"Mississauga, ontario\"}\n",
+ "\n",
+ "headers = {\"Authorization\": \"Bearer 20240127frk5hls1ba07nsb8idfdg577qa\"}\n",
+ "\n",
+ "response = requests.post(url, headers=headers, data=payload)\n",
+ "\n",
+ "if response.status_code == 200:\n",
+ " document = Document(id_=url, text=response.text)\n",
+ " parser = JSONNodeParser()\n",
+ "\n",
+ " nodes = parser.get_nodes_from_documents([document])\n",
+ " print(nodes[0])\n",
+ "else:\n",
+ " print(\"Failed to fetch JSON content:\", response.status_code)"
+ ],
+ "metadata": {
+ "id": "CW8pTEsEYdgL",
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "outputId": "28dae2de-f880-4874-95bb-5de82a716019"
+ },
+ "execution_count": null,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "Node ID: 05325093-16a2-41ac-b952-3882c817ac4d\n",
+ "Text: status True data house_list id_listing owJKR7PNnP9YXeLP data\n",
+ "house_list house_type_in_map D data house_list price_abbr 0.75M data\n",
+ "house_list price 749,000 data house_list price_sold 690,000 data\n",
+ "house_list tags Sold data house_list list_status public 1 data\n",
+ "house_list list_status live 0 data house_list list_status s_r Sale\n",
+ "data house_list list_s...\n"
+ ]
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "### Node Parser - Markdown"
+ ],
+ "metadata": {
+ "id": "VYBTqzmJn9Z5"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "# Markdown\n",
+ "from llama_index.core.node_parser import MarkdownNodeParser\n",
+ "\n",
+ "md_docs = FlatReader().load_data(Path(\"README.md\"))\n",
+ "parser = MarkdownNodeParser()\n",
+ "\n",
+ "nodes = parser.get_nodes_from_documents(md_docs)\n",
+ "nodes[0].text"
+ ],
+ "metadata": {
+ "id": "55f43LgJYkok",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 105
+ },
+ "outputId": "3b9a3865-0a58-4e53-cff5-17cb8d024631"
+ },
+ "execution_count": null,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "'VectorDB-recipes\\n
\\nDive into building GenAI applications!\\nThis repository contains examples, applications, starter code, & tutorials to help you kickstart your GenAI projects.\\n\\n- These are built using LanceDB, a free, open-source, serverless vectorDB that **requires no setup**. \\n- It **integrates into python data ecosystem** so you can simply start using these in your existing data pipelines in pandas, arrow, pydantic etc.\\n- LanceDB has **native Typescript SDK** using which you can **run vector search** in serverless functions!\\n\\n\\n\\n
\\nJoin our community for support - Discord •\\nTwitter\\n\\n---\\n\\nThis repository is divided into 3 sections:\\n- [Examples](#examples) - Get right into the code with minimal introduction, aimed at getting you from an idea to PoC within minutes!\\n- [Applications](#projects--applications) - Ready to use Python and web apps using applied LLMs, VectorDB and GenAI tools\\n- [Tutorials](#tutorials) - A curated list of tutorials, blogs, Colabs and courses to get you started with GenAI in greater depth.'"
+ ],
+ "application/vnd.google.colaboratory.intrinsic+json": {
+ "type": "string"
+ }
+ },
+ "metadata": {},
+ "execution_count": 10
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "## Chunking"
+ ],
+ "metadata": {
+ "id": "gCFoPc1PZFI5"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "# Download for running Code Splitting\n",
+ "!wget https://raw.githubusercontent.com/lancedb/vectordb-recipes/main/applications/talk-with-podcast/app.py"
+ ],
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "rVkeWuwvDwu-",
+ "outputId": "1ddb950a-0c0f-4be4-88fd-3029d53e6640"
+ },
+ "execution_count": null,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "--2024-04-15 10:22:58-- https://raw.githubusercontent.com/lancedb/vectordb-recipes/main/applications/talk-with-podcast/app.py\n",
+ "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.108.133, ...\n",
+ "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.\n",
+ "HTTP request sent, awaiting response... 200 OK\n",
+ "Length: 1582 (1.5K) [text/plain]\n",
+ "Saving to: ‘app.py’\n",
+ "\n",
+ "\rapp.py 0%[ ] 0 --.-KB/s \rapp.py 100%[===================>] 1.54K --.-KB/s in 0s \n",
+ "\n",
+ "2024-04-15 10:22:58 (12.1 MB/s) - ‘app.py’ saved [1582/1582]\n",
+ "\n"
+ ]
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "### Code Splitting"
+ ],
+ "metadata": {
+ "id": "spibLOthoCsK"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "# Code Splitting\n",
+ "\n",
+ "from llama_index.core.node_parser import CodeSplitter\n",
+ "\n",
+ "documents = FlatReader().load_data(Path(\"app.py\"))\n",
+ "splitter = CodeSplitter(\n",
+ " language=\"python\",\n",
+ " chunk_lines=40, # lines per chunk\n",
+ " chunk_lines_overlap=15, # lines overlap between chunks\n",
+ " max_chars=1500, # max chars per chunk\n",
+ ")\n",
+ "nodes = splitter.get_nodes_from_documents(documents)\n",
+ "nodes[0].text"
+ ],
+ "metadata": {
+ "id": "IDoDzDeiYqpL",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 140
+ },
+ "outputId": "5ac4578a-c5de-4060-cf9c-420a9078652b"
+ },
+ "execution_count": null,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stderr",
+ "text": [
+ "/usr/local/lib/python3.10/dist-packages/tree_sitter/__init__.py:36: FutureWarning: Language(path, name) is deprecated. Use Language(ptr, name) instead.\n",
+ " warn(\"{} is deprecated. Use {} instead.\".format(old, new), FutureWarning)\n"
+ ]
+ },
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "'from youtube_podcast_download import podcast_audio_retreival\\nfrom transcribe_podcast import transcribe\\nfrom chat_retreival import retrieverSetup, chat\\nfrom langroid_utils import configure, agent\\n\\nimport os\\nimport glob\\nimport json\\nimport streamlit as st\\n\\nOPENAI_KEY = os.environ[\"OPENAI_API_KEY\"]\\n\\n\\n@st.cache_resource\\ndef video_data_retreival(framework):\\n f = open(\"output.json\")\\n data = json.load(f)\\n\\n # setting up reteriver\\n if framework == \"Langchain\":\\n qa = retrieverSetup(data[\"text\"], OPENAI_KEY)\\n return qa\\n elif framework == \"Langroid\":\\n langroid_file = open(\"langroid_doc.txt\", \"w\") # write mode\\n langroid_file.write(data[\"text\"])\\n cfg = configure(\"langroid_doc.txt\")\\n return cfg\\n\\n\\nst.header(\"Talk with Youtube Podcasts\", divider=\"rainbow\")\\n\\nurl = st.text_input(\"Youtube Link\")\\nframework = st.radio(\\n \"**Select Framework 👇**\",\\n [\"Langchain\", \"Langroid\"],\\n key=\"Langchain\",\\n)\\n\\nif url:\\n st.video(url)\\n # Podcast Audio Retreival from Youtube\\n podcast_audio_retreival(url)\\n\\n # Trascribing podcast audio\\n filename = glob.glob(\"*.mp3\")[0]\\n transcribe(filename)\\n\\n st.markdown(f\"##### `{framework}` Framework Selected for talking with Podcast\")\\n # Chat Agent getting ready\\n qa = video_data_retreival(framework)\\n\\n\\nprompt = st.chat_input(\"Talk with Podcast\")\\n\\ni'"
+ ],
+ "application/vnd.google.colaboratory.intrinsic+json": {
+ "type": "string"
+ }
+ },
+ "metadata": {},
+ "execution_count": 13
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "### Sentence Splitting"
+ ],
+ "metadata": {
+ "id": "On40RuqBoGNL"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "# Sentence Splitting\n",
+ "\n",
+ "from llama_index.core.node_parser import SentenceSplitter\n",
+ "\n",
+ "documents = FlatReader().load_data(Path(\"state_of_the_union.txt\"))\n",
+ "splitter = SentenceSplitter(\n",
+ " chunk_size=254,\n",
+ " chunk_overlap=20,\n",
+ ")\n",
+ "nodes = splitter.get_nodes_from_documents(documents)\n",
+ "nodes[0].text"
+ ],
+ "metadata": {
+ "id": "iNKuiCNrZOHl",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 105
+ },
+ "outputId": "47064bf4-4079-42df-83b6-d519ba92a135"
+ },
+ "execution_count": null,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "\"Madame Speaker, Vice President Biden, members of Congress, distinguished guests, and fellow Americans:\\n\\nOur Constitution declares that from time to time, the president shall give to Congress information about the state of our union. For 220 years, our leaders have fulfilled this duty. They have done so during periods of prosperity and tranquility. And they have done so in the midst of war and depression; at moments of great strife and great struggle.\\n\\nIt's tempting to look back on these moments and assume that our progress was inevitable, that America was always destined to succeed. But when the Union was turned back at Bull Run and the Allies first landed at Omaha Beach, victory was very much in doubt. When the market crashed on Black Tuesday and civil rights marchers were beaten on Bloody Sunday, the future was anything but certain. These were times that tested the courage of our convictions and the strength of our union. And despite all our divisions and disagreements, our hesitations and our fears, America prevailed because we chose to move forward as one nation and one people.\\n\\nAgain, we are tested. And again, we must answer history's call.\""
+ ],
+ "application/vnd.google.colaboratory.intrinsic+json": {
+ "type": "string"
+ }
+ },
+ "metadata": {},
+ "execution_count": 15
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "### Node Parser - Sentence Window"
+ ],
+ "metadata": {
+ "id": "h5zc7_YmoJHP"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "# SentenceWindowNodeParser\n",
+ "\n",
+ "import nltk\n",
+ "from llama_index.core.node_parser import SentenceWindowNodeParser\n",
+ "\n",
+ "node_parser = SentenceWindowNodeParser.from_defaults(\n",
+ " window_size=3,\n",
+ " window_metadata_key=\"window\",\n",
+ " original_text_metadata_key=\"original_sentence\",\n",
+ ")\n",
+ "nodes = node_parser.get_nodes_from_documents(documents)\n",
+ "nodes[0].text"
+ ],
+ "metadata": {
+ "id": "76tbzUrMZRFF",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 53
+ },
+ "outputId": "7f57513d-da7b-45f9-96c0-5698e06f1562"
+ },
+ "execution_count": null,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "'Madame Speaker, Vice President Biden, members of Congress, distinguished guests, and fellow Americans:\\n\\nOur Constitution declares that from time to time, the president shall give to Congress information about the state of our union. '"
+ ],
+ "application/vnd.google.colaboratory.intrinsic+json": {
+ "type": "string"
+ }
+ },
+ "metadata": {},
+ "execution_count": 16
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "### Node Parser - Semantic Splitting"
+ ],
+ "metadata": {
+ "id": "zj45BKLMoRgp"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "# SemanticSplitterNodeParser\n",
+ "\n",
+ "from llama_index.core.node_parser import SemanticSplitterNodeParser\n",
+ "from llama_index.embeddings.openai import OpenAIEmbedding\n",
+ "import os\n",
+ "\n",
+ "# Add OpenAI API key as environment variable\n",
+ "os.environ[\"OPENAI_API_KEY\"] = \"sk-****\"\n",
+ "\n",
+ "embed_model = OpenAIEmbedding()\n",
+ "splitter = SemanticSplitterNodeParser(\n",
+ " buffer_size=1, breakpoint_percentile_threshold=95, embed_model=embed_model\n",
+ ")\n",
+ "\n",
+ "nodes = splitter.get_nodes_from_documents(documents)\n",
+ "nodes[0].text"
+ ],
+ "metadata": {
+ "id": "wAp7BU25ZdRt",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 53
+ },
+ "outputId": "3f51c53b-7617-4c67-c247-125f7a6b84be"
+ },
+ "execution_count": null,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "'Madame Speaker, Vice President Biden, members of Congress, distinguished guests, and fellow Americans:\\n\\nOur Constitution declares that from time to time, the president shall give to Congress information about the state of our union. For 220 years, our leaders have fulfilled this duty. '"
+ ],
+ "application/vnd.google.colaboratory.intrinsic+json": {
+ "type": "string"
+ }
+ },
+ "metadata": {},
+ "execution_count": 17
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "### Token Text Splitting"
+ ],
+ "metadata": {
+ "id": "vH9xni1SoWYE"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "# TokenTextSplitting\n",
+ "\n",
+ "from llama_index.core.node_parser import TokenTextSplitter\n",
+ "\n",
+ "splitter = TokenTextSplitter(\n",
+ " chunk_size=254,\n",
+ " chunk_overlap=20,\n",
+ " separator=\" \",\n",
+ ")\n",
+ "nodes = splitter.get_nodes_from_documents(documents)\n",
+ "nodes[0].text"
+ ],
+ "metadata": {
+ "id": "9G61og__Ziec",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 105
+ },
+ "outputId": "5b11b5f9-94d1-4b6d-a7d5-a58025f58f2a"
+ },
+ "execution_count": null,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "\"Madame Speaker, Vice President Biden, members of Congress, distinguished guests, and fellow Americans:\\n\\nOur Constitution declares that from time to time, the president shall give to Congress information about the state of our union. For 220 years, our leaders have fulfilled this duty. They have done so during periods of prosperity and tranquility. And they have done so in the midst of war and depression; at moments of great strife and great struggle.\\n\\nIt's tempting to look back on these moments and assume that our progress was inevitable, that America was always destined to succeed. But when the Union was turned back at Bull Run and the Allies first landed at Omaha Beach, victory was very much in doubt. When the market crashed on Black Tuesday and civil rights marchers were beaten on Bloody Sunday, the future was anything but certain. These were times that tested the courage of our convictions and the strength of our union. And despite all our divisions and disagreements, our hesitations and our fears, America prevailed because we chose to move forward as one nation and one people.\\n\\nAgain, we are tested. And again, we must answer history's call.\\n\\nOne year ago, I took office amid two wars, an economy\""
+ ],
+ "application/vnd.google.colaboratory.intrinsic+json": {
+ "type": "string"
+ }
+ },
+ "metadata": {},
+ "execution_count": 18
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "## Relation based Node Parser"
+ ],
+ "metadata": {
+ "id": "rpbqXxeaawOt"
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "### Node Parser - Hierarchical"
+ ],
+ "metadata": {
+ "id": "z_HuwzzAoabc"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "# HierarchicalNodeParser\n",
+ "\n",
+ "from llama_index.core.node_parser import HierarchicalNodeParser\n",
+ "\n",
+ "node_parser = HierarchicalNodeParser.from_defaults(chunk_sizes=[512, 254, 128])\n",
+ "\n",
+ "nodes = node_parser.get_nodes_from_documents(documents)\n",
+ "nodes[0].text"
+ ],
+ "metadata": {
+ "id": "qwZFEDlaZpKT",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 105
+ },
+ "outputId": "12888840-daf5-45f5-f934-e253b6036621"
+ },
+ "execution_count": null,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "\"Madame Speaker, Vice President Biden, members of Congress, distinguished guests, and fellow Americans:\\n\\nOur Constitution declares that from time to time, the president shall give to Congress information about the state of our union. For 220 years, our leaders have fulfilled this duty. They have done so during periods of prosperity and tranquility. And they have done so in the midst of war and depression; at moments of great strife and great struggle.\\n\\nIt's tempting to look back on these moments and assume that our progress was inevitable, that America was always destined to succeed. But when the Union was turned back at Bull Run and the Allies first landed at Omaha Beach, victory was very much in doubt. When the market crashed on Black Tuesday and civil rights marchers were beaten on Bloody Sunday, the future was anything but certain. These were times that tested the courage of our convictions and the strength of our union. And despite all our divisions and disagreements, our hesitations and our fears, America prevailed because we chose to move forward as one nation and one people.\\n\\nAgain, we are tested. And again, we must answer history's call.\\n\\nOne year ago, I took office amid two wars, an economy rocked by severe recession, a financial system on the verge of collapse and a government deeply in debt. Experts from across the political spectrum warned that if we did not act, we might face a second depression. So we acted immediately and aggressively. And one year later, the worst of the storm has passed.\\n\\nBut the devastation remains. One in 10 Americans still cannot find work. Many businesses have shuttered. Home values have declined. Small towns and rural communities have been hit especially hard. For those who had already known poverty, life has become that much harder.\\n\\nThis recession has also compounded the burdens that America's families have been dealing with for decades -- the burden of working harder and longer for less, of being unable to save enough to retire or help kids with college.\\n\\nSo I know the anxieties that are out there right now. They're not new. These struggles are the reason I ran for president. These struggles are what I've witnessed for years in places like Elkhart, Ind., and Galesburg, Ill. I hear about them in the letters that I read each night.\""
+ ],
+ "application/vnd.google.colaboratory.intrinsic+json": {
+ "type": "string"
+ }
+ },
+ "metadata": {},
+ "execution_count": 19
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "# Langchain Text Chunking Strategies"
+ ],
+ "metadata": {
+ "id": "64yUhjV_a9dk"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "!pip install -qU langchain-text-splitters\n",
+ "!pip install requests"
+ ],
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "8OijyBvqbCLC",
+ "outputId": "8c5d46c4-0435-4f56-fe50-34c28d7846fd"
+ },
+ "execution_count": null,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m287.5/287.5 kB\u001b[0m \u001b[31m5.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m113.0/113.0 kB\u001b[0m \u001b[31m11.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m53.0/53.0 kB\u001b[0m \u001b[31m6.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m144.8/144.8 kB\u001b[0m \u001b[31m11.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+ "\u001b[?25hRequirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (2.31.0)\n",
+ "Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests) (3.3.2)\n",
+ "Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests) (3.6)\n",
+ "Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests) (2.0.7)\n",
+ "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests) (2024.2.2)\n"
+ ]
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "### Text Splitting - Character"
+ ],
+ "metadata": {
+ "id": "QVKmz3Rvok9Y"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "# Split with Character\n",
+ "\n",
+ "with open(\"state_of_the_union.txt\") as f:\n",
+ " state_of_the_union = f.read()\n",
+ "\n",
+ "\n",
+ "from langchain_text_splitters import CharacterTextSplitter\n",
+ "\n",
+ "text_splitter = CharacterTextSplitter(\n",
+ " separator=\"\\n\\n\",\n",
+ " chunk_size=1000,\n",
+ " chunk_overlap=200,\n",
+ " length_function=len,\n",
+ " is_separator_regex=False,\n",
+ ")\n",
+ "\n",
+ "texts = text_splitter.create_documents([state_of_the_union])\n",
+ "print(texts[0].page_content)"
+ ],
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "EdjiGEitI4La",
+ "outputId": "29bb4bfd-e198-4902-e7c0-40c6df7d488b"
+ },
+ "execution_count": null,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stderr",
+ "text": [
+ "WARNING:langchain_text_splitters.base:Created a chunk of size 1163, which is longer than the specified 1000\n",
+ "WARNING:langchain_text_splitters.base:Created a chunk of size 1015, which is longer than the specified 1000\n"
+ ]
+ },
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "Madame Speaker, Vice President Biden, members of Congress, distinguished guests, and fellow Americans:\n",
+ "\n",
+ "Our Constitution declares that from time to time, the president shall give to Congress information about the state of our union. For 220 years, our leaders have fulfilled this duty. They have done so during periods of prosperity and tranquility. And they have done so in the midst of war and depression; at moments of great strife and great struggle.\n"
+ ]
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "### Text Splitting - Recursive Character"
+ ],
+ "metadata": {
+ "id": "ivNYVKPZowKh"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "# Recursive Split Character\n",
+ "\n",
+ "# This is a long document we can split up.\n",
+ "with open(\"state_of_the_union.txt\") as f:\n",
+ " state_of_the_union = f.read()\n",
+ "\n",
+ "from langchain_text_splitters import RecursiveCharacterTextSplitter\n",
+ "\n",
+ "text_splitter = RecursiveCharacterTextSplitter(\n",
+ " # Set a really small chunk size, just to show.\n",
+ " chunk_size=1000,\n",
+ " chunk_overlap=100,\n",
+ " length_function=len,\n",
+ " is_separator_regex=False,\n",
+ ")\n",
+ "\n",
+ "texts = text_splitter.create_documents([state_of_the_union])\n",
+ "print(\"Chunk 2: \", texts[1].page_content)\n",
+ "print(\"Chunk 3: \", texts[2].page_content)"
+ ],
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "9X6_duxwN3nI",
+ "outputId": "d6ce1302-1a9a-4887-9505-1c206390ab2f"
+ },
+ "execution_count": null,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "Chunk 2: It's tempting to look back on these moments and assume that our progress was inevitable, that America was always destined to succeed. But when the Union was turned back at Bull Run and the Allies first landed at Omaha Beach, victory was very much in doubt. When the market crashed on Black Tuesday and civil rights marchers were beaten on Bloody Sunday, the future was anything but certain. These were times that tested the courage of our convictions and the strength of our union. And despite all our divisions and disagreements, our hesitations and our fears, America prevailed because we chose to move forward as one nation and one people.\n",
+ "\n",
+ "Again, we are tested. And again, we must answer history's call.\n",
+ "Chunk 3: Again, we are tested. And again, we must answer history's call.\n",
+ "\n",
+ "One year ago, I took office amid two wars, an economy rocked by severe recession, a financial system on the verge of collapse and a government deeply in debt. Experts from across the political spectrum warned that if we did not act, we might face a second depression. So we acted immediately and aggressively. And one year later, the worst of the storm has passed.\n",
+ "\n",
+ "But the devastation remains. One in 10 Americans still cannot find work. Many businesses have shuttered. Home values have declined. Small towns and rural communities have been hit especially hard. For those who had already known poverty, life has become that much harder.\n",
+ "\n",
+ "This recession has also compounded the burdens that America's families have been dealing with for decades -- the burden of working harder and longer for less, of being unable to save enough to retire or help kids with college.\n"
+ ]
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "### Text Splitting - HTML Header"
+ ],
+ "metadata": {
+ "id": "I1nKMkm4o1Ft"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "# Split with HTML Tags\n",
+ "\n",
+ "from langchain_text_splitters import HTMLHeaderTextSplitter\n",
+ "import requests\n",
+ "\n",
+ "# URL of the website to fetch HTML from\n",
+ "url = \"https://www.utoronto.ca/\"\n",
+ "\n",
+ "# Send a GET request to the URL\n",
+ "response = requests.get(url)\n",
+ "if response.status_code == 200:\n",
+ " html_doc = response.text\n",
+ "\n",
+ "headers_to_split_on = [\n",
+ " (\"h1\", \"Header 1\"),\n",
+ " (\"h2\", \"Header 2\"),\n",
+ " (\"h3\", \"Header 3\"),\n",
+ "]\n",
+ "\n",
+ "html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)\n",
+ "html_header_splits = html_splitter.split_text(html_doc)\n",
+ "html_header_splits[0].page_content"
+ ],
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 35
+ },
+ "id": "671M7BEVJ5zL",
+ "outputId": "d6d5df83-98f5-42e5-c50a-de7455a46b93"
+ },
+ "execution_count": null,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "'Welcome to University of Toronto \\nMain menu tools'"
+ ],
+ "application/vnd.google.colaboratory.intrinsic+json": {
+ "type": "string"
+ }
+ },
+ "metadata": {},
+ "execution_count": 29
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "### Text Splitting - Code"
+ ],
+ "metadata": {
+ "id": "y8utGi0No6tr"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "# Code Splitting\n",
+ "\n",
+ "from langchain_text_splitters import Language, RecursiveCharacterTextSplitter\n",
+ "\n",
+ "\n",
+ "with open(\"app.py\") as f:\n",
+ " code = f.read()\n",
+ "\n",
+ "python_splitter = RecursiveCharacterTextSplitter.from_language(\n",
+ " language=Language.PYTHON, chunk_size=100, chunk_overlap=0\n",
+ ")\n",
+ "python_docs = python_splitter.create_documents([code])\n",
+ "python_docs[0].page_content"
+ ],
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 35
+ },
+ "id": "9nfbAYj1KGQL",
+ "outputId": "92edafe3-7d0c-4e20-90ea-5111c565b232"
+ },
+ "execution_count": null,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "'from youtube_podcast_download import podcast_audio_retreival'"
+ ],
+ "application/vnd.google.colaboratory.intrinsic+json": {
+ "type": "string"
+ }
+ },
+ "metadata": {},
+ "execution_count": 33
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "### Text Splitting - Recursive JSON"
+ ],
+ "metadata": {
+ "id": "tePMloUspEcX"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "# Recursive Split Json\n",
+ "\n",
+ "from langchain_text_splitters import RecursiveJsonSplitter\n",
+ "import json\n",
+ "import requests\n",
+ "\n",
+ "json_data = requests.get(\"https://api.smith.langchain.com/openapi.json\").json()\n",
+ "\n",
+ "splitter = RecursiveJsonSplitter(max_chunk_size=300)\n",
+ "json_chunks = splitter.split_json(json_data=json_data)\n",
+ "json_chunks[0]"
+ ],
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "8bW_6wkmMAoR",
+ "outputId": "73dadc8f-30bc-491f-c3f5-a95e75486971"
+ },
+ "execution_count": null,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "{'openapi': '3.1.0',\n",
+ " 'info': {'title': 'LangSmith', 'version': '0.1.0'},\n",
+ " 'servers': [{'url': 'https://api.smith.langchain.com',\n",
+ " 'description': 'LangSmith API endpoint.'}]}"
+ ]
+ },
+ "metadata": {},
+ "execution_count": 33
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "### Semantic Splitting"
+ ],
+ "metadata": {
+ "id": "a8Rt52AepNNk"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "# Semantic Chunking\n",
+ "\n",
+ "!pip install --quiet langchain_experimental langchain_openai\n",
+ "\n",
+ "import os\n",
+ "from langchain_experimental.text_splitter import SemanticChunker\n",
+ "from langchain_openai.embeddings import OpenAIEmbeddings\n",
+ "\n",
+ "# Add OpenAI API key as environment variable\n",
+ "os.environ[\"OPENAI_API_KEY\"] = \"sk-****\"\n",
+ "\n",
+ "with open(\"state_of_the_union.txt\") as f:\n",
+ " state_of_the_union = f.read()\n",
+ "\n",
+ "text_splitter = SemanticChunker(OpenAIEmbeddings())\n",
+ "\n",
+ "docs = text_splitter.create_documents([state_of_the_union])\n",
+ "print(docs[0].page_content)"
+ ],
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "oHDYFAHeOjPA",
+ "outputId": "f7a5bbc2-c432-4370-bbf5-e529f4ff8c77"
+ },
+ "execution_count": null,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "Madame Speaker, Vice President Biden, members of Congress, distinguished guests, and fellow Americans:\n",
+ "\n",
+ "Our Constitution declares that from time to time, the president shall give to Congress information about the state of our union. For 220 years, our leaders have fulfilled this duty.\n"
+ ]
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "### Splitting by Tokens"
+ ],
+ "metadata": {
+ "id": "dV7RMi7_pRWn"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "# Splits by Tokens\n",
+ "\n",
+ "# Using Tiktoken\n",
+ "!pip install --upgrade --quiet tiktoken\n",
+ "\n",
+ "with open(\"state_of_the_union.txt\") as f:\n",
+ " state_of_the_union = f.read()\n",
+ "\n",
+ "from langchain_text_splitters import CharacterTextSplitter\n",
+ "\n",
+ "text_splitter = CharacterTextSplitter.from_tiktoken_encoder(\n",
+ " chunk_size=100, chunk_overlap=0\n",
+ ")\n",
+ "texts = text_splitter.split_text(state_of_the_union)\n",
+ "\n",
+ "print(texts[0])"
+ ],
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "7_WVw_kEQmJg",
+ "outputId": "93f501ed-d8a4-4350-a670-a6af25d2879d"
+ },
+ "execution_count": null,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stderr",
+ "text": [
+ "WARNING:langchain_text_splitters.base:Created a chunk of size 123, which is longer than the specified 100\n",
+ "WARNING:langchain_text_splitters.base:Created a chunk of size 104, which is longer than the specified 100\n",
+ "WARNING:langchain_text_splitters.base:Created a chunk of size 109, which is longer than the specified 100\n",
+ "WARNING:langchain_text_splitters.base:Created a chunk of size 106, which is longer than the specified 100\n",
+ "WARNING:langchain_text_splitters.base:Created a chunk of size 129, which is longer than the specified 100\n",
+ "WARNING:langchain_text_splitters.base:Created a chunk of size 111, which is longer than the specified 100\n",
+ "WARNING:langchain_text_splitters.base:Created a chunk of size 118, which is longer than the specified 100\n",
+ "WARNING:langchain_text_splitters.base:Created a chunk of size 132, which is longer than the specified 100\n",
+ "WARNING:langchain_text_splitters.base:Created a chunk of size 231, which is longer than the specified 100\n",
+ "WARNING:langchain_text_splitters.base:Created a chunk of size 177, which is longer than the specified 100\n",
+ "WARNING:langchain_text_splitters.base:Created a chunk of size 112, which is longer than the specified 100\n",
+ "WARNING:langchain_text_splitters.base:Created a chunk of size 130, which is longer than the specified 100\n",
+ "WARNING:langchain_text_splitters.base:Created a chunk of size 116, which is longer than the specified 100\n",
+ "WARNING:langchain_text_splitters.base:Created a chunk of size 184, which is longer than the specified 100\n",
+ "WARNING:langchain_text_splitters.base:Created a chunk of size 139, which is longer than the specified 100\n",
+ "WARNING:langchain_text_splitters.base:Created a chunk of size 112, which is longer than the specified 100\n",
+ "WARNING:langchain_text_splitters.base:Created a chunk of size 151, which is longer than the specified 100\n",
+ "WARNING:langchain_text_splitters.base:Created a chunk of size 203, which is longer than the specified 100\n",
+ "WARNING:langchain_text_splitters.base:Created a chunk of size 138, which is longer than the specified 100\n",
+ "WARNING:langchain_text_splitters.base:Created a chunk of size 123, which is longer than the specified 100\n",
+ "WARNING:langchain_text_splitters.base:Created a chunk of size 213, which is longer than the specified 100\n",
+ "WARNING:langchain_text_splitters.base:Created a chunk of size 134, which is longer than the specified 100\n",
+ "WARNING:langchain_text_splitters.base:Created a chunk of size 130, which is longer than the specified 100\n",
+ "WARNING:langchain_text_splitters.base:Created a chunk of size 125, which is longer than the specified 100\n",
+ "WARNING:langchain_text_splitters.base:Created a chunk of size 139, which is longer than the specified 100\n",
+ "WARNING:langchain_text_splitters.base:Created a chunk of size 111, which is longer than the specified 100\n",
+ "WARNING:langchain_text_splitters.base:Created a chunk of size 130, which is longer than the specified 100\n",
+ "WARNING:langchain_text_splitters.base:Created a chunk of size 124, which is longer than the specified 100\n"
+ ]
+ },
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "Madame Speaker, Vice President Biden, members of Congress, distinguished guests, and fellow Americans:\n",
+ "\n",
+ "Our Constitution declares that from time to time, the president shall give to Congress information about the state of our union. For 220 years, our leaders have fulfilled this duty. They have done so during periods of prosperity and tranquility. And they have done so in the midst of war and depression; at moments of great strife and great struggle.\n"
+ ]
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "source": [],
+ "metadata": {
+ "id": "vMYrBTIvvGEg"
+ },
+ "execution_count": null,
+ "outputs": []
+ }
+ ]
+}
\ No newline at end of file
diff --git a/tutorials/Langchain-LlamaIndex-Chunking/README.md b/tutorials/Langchain-LlamaIndex-Chunking/README.md
new file mode 100644
index 00000000..3de1b089
--- /dev/null
+++ b/tutorials/Langchain-LlamaIndex-Chunking/README.md
@@ -0,0 +1,7 @@
+[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/lancedb/vectordb-recipes/blob/main/tutorials/Langchain-LlamaIndex-Chunking/Langchain_Llamaindex_chunking.ipynb)
+
+![alt text](../../assets/chunking.png)
+
+We have comprehensively covered all the chunking techniques available in Langchain and LlamaIndex.
+
+[Read More in Blog](https://blog.lancedb.com/chunking-techiniques-with-langchain-and-llamaindex/)
\ No newline at end of file