diff --git a/examples/cookbook/firecrawl/.env.example b/examples/cookbook/firecrawl/.env.example new file mode 100644 index 00000000..07a84e0d --- /dev/null +++ b/examples/cookbook/firecrawl/.env.example @@ -0,0 +1,9 @@ +# Example env for Firecrawl + Moss cookbook +# Copy to .env and fill in values before running the notebook. + +# Moss credentials +MOSS_PROJECT_ID=your_moss_project_id +MOSS_PROJECT_KEY=your_moss_project_key + +# Firecrawl API key +FIRECRAWL_API_KEY=your_firecrawl_api_key \ No newline at end of file diff --git a/examples/cookbook/firecrawl/README.md b/examples/cookbook/firecrawl/README.md new file mode 100644 index 00000000..04b050c3 --- /dev/null +++ b/examples/cookbook/firecrawl/README.md @@ -0,0 +1,98 @@ +# Firecrawl + Moss Cookbook Example + +Use Firecrawl to turn one or more URLs into clean markdown, then index the results into Moss and query them semantically from a notebook. + +> This is a cookbook example, not a packaged integration. Open [firecrawl_moss.ipynb](firecrawl_moss.ipynb) to follow the full URL-to-query pipeline. + +## Installation + +```bash +pip install firecrawl-py moss python-dotenv +``` + +## Setup + +Set these environment variables in your shell or a `.env` file: + +```bash +FIRECRAWL_API_KEY=your-firecrawl-api-key +MOSS_PROJECT_ID=your-project-id +MOSS_PROJECT_KEY=your-project-key +``` + +## Quick Start + +1. Open [firecrawl_moss.ipynb](firecrawl_moss.ipynb) in Jupyter or VS Code. +2. Run the setup and helper cells. +3. Set `urls` to the pages you want to ingest. +4. Run `await build_and_query_knowledge_base(urls)` to crawl, index, and query the content. + +## Workflow + +The notebook is structured for efficiency: + +1. **Prepare** (one-time): Crawl URLs → normalize markdown → index into Moss +2. **Query** (repeated): Run semantic queries against the indexed knowledge base without re-crawling + +This design lets you crawl once (which can be slow/expensive) and then iterate on queries quickly. + +## Architecture + +``` +┌─────────────┐ +│ URLs │ +└──────┬──────┘ + │ + ├──> Firecrawl (crawl + scrape) + │ +┌──────▼─────────────────┐ +│ Crawled Pages │ +│ (raw HTML/markdown) │ +└──────┬─────────────────┘ + │ + ├──> Markdown Normalization + │ (clean text, remove chrome) + │ +┌──────▼─────────────────┐ +│ Cleaned Markdown │ +│ (one DocumentInfo │ +│ per page) │ +└──────┬─────────────────┘ + │ + ├──> Moss Create Index + │ +┌──────▼─────────────────┐ +│ Indexed Knowledge │ +│ Base (local or cloud) │ +└──────┬─────────────────┘ + │ + ├──> Semantic Query (reusable) + │ (no re-crawling needed) + │ +┌──────▼─────────────────┐ +│ Top-K Results │ +│ (scored passages) │ +└─────────────────────────┘ +``` + +## What the notebook does + +```python +from firecrawl import Firecrawl +from moss import DocumentInfo, MossClient, QueryOptions + +job = Firecrawl(api_key=FIRECRAWL_API_KEY).crawl( + url="https://example.com", + limit=3, + scrape_options={"formats": ["markdown"]}, +) + +documents = [DocumentInfo(id="1", text=job.data[0].markdown, metadata={"source_url": "https://example.com"})] +await MossClient(MOSS_PROJECT_ID, MOSS_PROJECT_KEY).create_index("firecrawl-demo", documents) +``` + +## Files + +| File | Description | +|------|-------------| +| `firecrawl_moss.ipynb` | Notebook that crawls URLs, indexes markdown into Moss, and runs semantic search | diff --git a/examples/cookbook/firecrawl/firecrawl_moss.ipynb b/examples/cookbook/firecrawl/firecrawl_moss.ipynb new file mode 100644 index 00000000..06c95ec9 --- /dev/null +++ b/examples/cookbook/firecrawl/firecrawl_moss.ipynb @@ -0,0 +1,308 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "15050d77", + "metadata": {}, + "source": [ + "# Firecrawl + Moss Cookbook\n", + "\n", + "Crawl one or more URLs with Firecrawl, convert the results to clean markdown, index them into Moss, and query the knowledge base semantically." + ] + }, + { + "cell_type": "markdown", + "id": "ca524d0e", + "metadata": {}, + "source": [ + "## 1. Set Up Project Environment\n", + "\n", + "Install the SDKs and set your credentials before running the notebook." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "6da4124d", + "metadata": {}, + "outputs": [], + "source": [ + "#pip install firecrawl-py moss python-dotenv" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "c22437b8", + "metadata": {}, + "outputs": [], + "source": [ + "from __future__ import annotations\n", + "\n", + "import os\n", + "import uuid\n", + "from dataclasses import dataclass\n", + "from typing import Any\n", + "\n", + "from dotenv import load_dotenv\n", + "from firecrawl import Firecrawl\n", + "from moss import DocumentInfo, MossClient, QueryOptions\n", + "\n", + "load_dotenv()\n", + "\n", + "FIRECRAWL_API_KEY = os.getenv(\"FIRECRAWL_API_KEY\")\n", + "MOSS_PROJECT_ID = os.getenv(\"MOSS_PROJECT_ID\")\n", + "MOSS_PROJECT_KEY = os.getenv(\"MOSS_PROJECT_KEY\")\n", + "DEFAULT_QUERY = \"What does the knowledge base say about the topic?\"" + ] + }, + { + "cell_type": "markdown", + "id": "ce4dae7b", + "metadata": {}, + "source": [ + "## 2. Define Core Data Structures\n", + "\n", + "Normalize each crawled page into a small Python structure before converting it into Moss documents." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "bf5da039", + "metadata": {}, + "outputs": [], + "source": [ + "@dataclass\n", + "class CrawledPage:\n", + " url: str\n", + " markdown: str\n", + " title: str | None = None\n", + "\n", + "\n", + "def page_to_crawled_page(page: Any) -> CrawledPage:\n", + " markdown = getattr(page, \"markdown\", None)\n", + " if markdown is None and isinstance(page, dict):\n", + " markdown = page.get(\"markdown\")\n", + "\n", + " metadata = getattr(page, \"metadata\", None)\n", + " if metadata is None and isinstance(page, dict):\n", + " metadata = page.get(\"metadata\", {})\n", + "\n", + " url = None\n", + " title = None\n", + " if isinstance(metadata, dict):\n", + " url = metadata.get(\"source_url\") or metadata.get(\"sourceURL\") or metadata.get(\"url\")\n", + " title = metadata.get(\"title\") or metadata.get(\"og_title\")\n", + " elif metadata is not None:\n", + " url = getattr(metadata, \"source_url\", None) or getattr(metadata, \"sourceURL\", None) or getattr(metadata, \"url\", None)\n", + " title = getattr(metadata, \"title\", None) or getattr(metadata, \"og_title\", None)\n", + "\n", + " return CrawledPage(url=url or \"unknown\", markdown=markdown or \"\", title=title)\n", + "\n", + "\n", + "def crawled_pages_to_moss_docs(pages: list[CrawledPage]) -> list[DocumentInfo]:\n", + " docs: list[DocumentInfo] = []\n", + " for index, page in enumerate(pages, start=1):\n", + " docs.append(\n", + " DocumentInfo(\n", + " id=f\"firecrawl-{index}\",\n", + " text=page.markdown,\n", + " metadata={\"source_url\": page.url, \"title\": page.title or \"\"},\n", + " )\n", + " )\n", + " return docs" + ] + }, + { + "cell_type": "markdown", + "id": "fcb41889", + "metadata": {}, + "source": [ + "## 3. Implement Main Functionality\n", + "\n", + "Firecrawl handles URL-to-markdown extraction. Moss handles indexing and semantic search." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "42c24a13", + "metadata": {}, + "outputs": [], + "source": [ + "def validate_configuration(urls: list[str]) -> None:\n", + " if not urls:\n", + " raise ValueError(\"Provide at least one URL to crawl.\")\n", + " if not FIRECRAWL_API_KEY:\n", + " raise ValueError(\"Set FIRECRAWL_API_KEY before running the notebook.\")\n", + " if not MOSS_PROJECT_ID or not MOSS_PROJECT_KEY:\n", + " raise ValueError(\"Set MOSS_PROJECT_ID and MOSS_PROJECT_KEY before running the notebook.\")\n", + "\n", + "\n", + "def crawl_urls(urls: list[str], limit: int = 3) -> list[CrawledPage]:\n", + " firecrawl = Firecrawl(api_key=FIRECRAWL_API_KEY)\n", + " pages: list[CrawledPage] = []\n", + "\n", + " for url in urls:\n", + " job = firecrawl.crawl(url=url, limit=limit, scrape_options={\"formats\": [\"markdown\"]})\n", + " raw_pages = getattr(job, \"data\", None) or (job.get(\"data\") if isinstance(job, dict) else []) or []\n", + " pages.extend(page_to_crawled_page(page) for page in raw_pages)\n", + "\n", + " return [page for page in pages if page.markdown.strip()]\n", + "\n", + "\n", + "async def prepare_knowledge_base(urls: list[str], limit: int = 10) -> tuple[MossClient, str]:\n", + " validate_configuration(urls)\n", + " crawled_pages = crawl_urls(urls, limit=limit)\n", + " documents = crawled_pages_to_moss_docs(crawled_pages)\n", + "\n", + " if not documents:\n", + " raise RuntimeError(\"Firecrawl returned no markdown content to index.\")\n", + "\n", + " index_name = f\"firecrawl-cookbook-{uuid.uuid4().hex[:8]}\"\n", + " client = MossClient(MOSS_PROJECT_ID, MOSS_PROJECT_KEY)\n", + "\n", + " await client.create_index(index_name, documents)\n", + " await client.load_index(index_name)\n", + "\n", + " print(f\"Indexed {len(documents)} documents into {index_name}\")\n", + " return client, index_name\n", + "\n", + "\n", + "async def query_knowledge_base(client: MossClient, index_name: str, query: str = DEFAULT_QUERY) -> None:\n", + " results = await client.query(index_name, query, QueryOptions(top_k=3, alpha=0.8))\n", + "\n", + " print(f\"Query: {query}\")\n", + " for item in results.docs:\n", + " source_url = item.metadata.get(\"source_url\", \"unknown\") if item.metadata else \"unknown\"\n", + " print(f\"- [{item.score:.3f}] {source_url}\")\n", + " print(f\" {item.text[:200].strip()}\")\n", + "\n", + "\n", + "# Build knowledge base and query it in one step\n", + "async def build_and_query_knowledge_base(urls: list[str], query: str = DEFAULT_QUERY) -> None:\n", + " client, index_name = await prepare_knowledge_base(urls)\n", + " await query_knowledge_base(client, index_name, query)" + ] + }, + { + "cell_type": "markdown", + "id": "b47066ee", + "metadata": {}, + "source": [ + "## 4. Full Firecrawl + Moss Test (Crawl, Index, and Query)\n", + "\n", + "\n", + "Enter URLs and a question to run end-to-end Firecrawl ingestion and Moss semantic search." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "bb2790da", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Indexed 10 documents into firecrawl-cookbook-af681b7b\n" + ] + } + ], + "source": [ + "urls = [\"https://docs.moss.dev\"]\n", + "\n", + "# Crawl + index once\n", + "client, index_name = await prepare_knowledge_base(urls)" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "1bfe1d30", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Query: What is Moss used for?\n", + "- [1.000] https://docs.moss.dev/docs/start/what-is-moss\n", + " [Skip to main content](https://docs.moss.dev/docs/start/what-is-moss#content-area)\n", + "\n", + "[Moss Docs home page![light logo](https://mintcdn.com/moss-afcfb0b6/b460p8xEydp14WML/logo/moss-wordmark-light.svg?fi\n", + "- [0.939] https://docs.moss.dev/docs/api-reference/v1/getting-started/introduction\n", + " [Skip to main content](https://docs.moss.dev/docs/api-reference/v1/getting-started/introduction#content-area)\n", + "\n", + "[Moss Docs home page![light logo](https://mintcdn.com/moss-afcfb0b6/b460p8xEydp14WML/logo\n", + "- [0.912] https://docs.moss.dev/docs/reference/python/interfaces/JobStatus\n", + " [Skip to main content](https://docs.moss.dev/docs/reference/python/interfaces/JobStatus#content-area)\n", + "\n", + "[Moss Docs home page![light logo](https://mintcdn.com/moss-afcfb0b6/b460p8xEydp14WML/logo/moss-wo\n" + ] + } + ], + "source": [ + "# Query multiple times without crawling again\n", + "await query_knowledge_base(client, index_name, \"What is Moss used for?\")" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "6956e7a8", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Query: What evidence in the docs supports the claim of sub-10 ms search, and what assumptions or caveats should an engineering team validate before adoption?\n", + "- [0.952] https://docs.moss.dev/docs/start/what-is-moss\n", + " [Skip to main content](https://docs.moss.dev/docs/start/what-is-moss#content-area)\n", + "\n", + "[Moss Docs home page![light logo](https://mintcdn.com/moss-afcfb0b6/b460p8xEydp14WML/logo/moss-wordmark-light.svg?fi\n", + "- [0.907] https://docs.moss.dev/docs/api-reference/v1/document-operations/getDocs\n", + " [Skip to main content](https://docs.moss.dev/docs/api-reference/v1/document-operations/getDocs#content-area)\n", + "\n", + "[Moss Docs home page![light logo](https://mintcdn.com/moss-afcfb0b6/b460p8xEydp14WML/logo/\n", + "- [0.891] https://docs.moss.dev/docs/api-reference/v1/document-operations/deleteDocs\n", + " [Skip to main content](https://docs.moss.dev/docs/api-reference/v1/document-operations/deleteDocs#content-area)\n", + "\n", + "[Moss Docs home page![light logo](https://mintcdn.com/moss-afcfb0b6/b460p8xEydp14WML/lo\n" + ] + } + ], + "source": [ + "await query_knowledge_base(\n", + " client,\n", + " index_name,\n", + " \"What evidence in the docs supports the claim of sub-10 ms search, and what assumptions or caveats should an engineering team validate before adoption?\",\n", + ")" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "base", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.7" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +}