Skip to content

Commit

Permalink
Merge pull request #12 from Cyb3rWard0g/feature/fetcher-arxiv
Browse files Browse the repository at this point in the history
Integrate Arxiv Module and Refactor Core Components for Improved Functionality
  • Loading branch information
Cyb3rWard0g authored Dec 31, 2024
2 parents 1db177d + 6b8fbe5 commit 5677230
Show file tree
Hide file tree
Showing 19 changed files with 1,710 additions and 134 deletions.
1,037 changes: 1,037 additions & 0 deletions cookbook/arxiv_search.ipynb

Large diffs are not rendered by default.

167 changes: 167 additions & 0 deletions docs/concepts/arxiv_fetcher.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,167 @@
# Arxiv Fetcher

The Arxiv Fetcher module in `Floki` provides a powerful interface to interact with the [arXiv API](https://info.arxiv.org/help/api/index.html). It is designed to help users programmatically search for, retrieve, and download scientific papers from arXiv. With advanced querying capabilities, metadata extraction, and support for downloading PDF files, the Arxiv Fetcher is ideal for researchers, developers, and teams working with academic literature.

## Why Use the Arxiv Fetcher?

The Arxiv Fetcher simplifies the process of accessing research papers, offering features like:

* Automated Literature Search: Query arXiv for specific topics, keywords, or authors.
* Metadata Retrieval: Extract structured metadata, such as titles, abstracts, authors, categories, and submission dates.
* Precise Filtering: Limit search results by date ranges (e.g., retrieve the latest research in a field).
* PDF Downloading: Fetch full-text PDFs of papers for offline use.

## How to Use the Arxiv Fetcher

### Step 1: Install Required Modules

!!! info
The Arxiv Fetcher relies on a [lightweight Python wrapper](https://github.com/lukasschwab/arxiv.py) for the arXiv API, which is not included in the Floki core module. This design choice helps maintain modularity and avoids adding unnecessary dependencies for users who may not require this functionality. To use the Arxiv Fetcher, ensure you install the [library](https://pypi.org/project/arxiv/) separately.

```python
pip install arxiv
```

### Step 2: Initialize the Fetcher

Set up the `ArxivFetcher` to begin interacting with the arXiv API.

```python
from floki.document import ArxivFetcher

# Initialize the fetcher
fetcher = ArxivFetcher()
```

### Step 3: Perform Searches

**Basic Search by Query String**

Search for papers using simple keywords. The results are returned as Document objects, each containing:

* `text`: The abstract of the paper.
* `metadata`: Structured metadata such as title, authors, categories, and submission dates.

```python
# Search for papers related to "machine learning"
results = fetcher.search(query="machine learning", max_results=5)

# Display metadata and summaries
for doc in results:
print(f"Title: {doc.metadata['title']}")
print(f"Authors: {', '.join(doc.metadata['authors'])}")
print(f"Summary: {doc.text}\n")
```

**Advanced Querying**

Refine searches using logical operators like AND, OR, and NOT or perform field-specific searches, such as by author.

Examples:

Search for papers on "agents" and "cybersecurity":

```python
results = fetcher.search(query="all:(agents AND cybersecurity)", max_results=10)
```

Exclude specific terms (e.g., "quantum" but not "computing"):

```python
results = fetcher.search(query="all:(quantum NOT computing)", max_results=10)
```

Search for papers by a specific author:

```python
results = fetcher.search(query='au:"John Doe"', max_results=10)
```

**Filter Papers by Date**

Limit search results to a specific time range, such as papers submitted in the last 24 hours.

```python
from datetime import datetime, timedelta

# Calculate the date range
last_24_hours = (datetime.now() - timedelta(days=1)).strftime("%Y%m%d")
today = datetime.now().strftime("%Y%m%d")

# Search for recent papers
recent_results = fetcher.search(
query="all:(agents AND cybersecurity)",
from_date=last_24_hours,
to_date=today,
max_results=5
)

# Display metadata
for doc in recent_results:
print(f"Title: {doc.metadata['title']}")
print(f"Authors: {', '.join(doc.metadata['authors'])}")
print(f"Published: {doc.metadata['published']}")
print(f"Summary: {doc.text}\n")
```

### Step 4: Download PDFs

Fetch the full-text PDFs of papers for offline use. Metadata is preserved alongside the downloaded files.

```python
import os
from pathlib import Path

# Create a directory for downloads
os.makedirs("arxiv_papers", exist_ok=True)

# Download PDFs
download_results = fetcher.search(
query="all:(agents AND cybersecurity)",
max_results=5,
download=True,
dirpath=Path("arxiv_papers")
)

for paper in download_results:
print(f"Downloaded Paper: {paper['title']}")
print(f"File Path: {paper['file_path']}\n")
```

### Step 5: Extract and Process PDF Content

Use `PyPDFReader` from `Floki` to extract content from downloaded PDFs. Each page is treated as a separate Document object with metadata.

```python
from pathlib import Path
from floki.document import PyPDFReader

reader = PyPDFReader()
docs_read = []

for paper in download_results:
local_pdf_path = Path(paper["file_path"])
documents = reader.load(local_pdf_path, additional_metadata=paper)
docs_read.extend(documents)

# Verify results
print(f"Extracted {len(docs_read)} documents.")
print(f"First document text: {docs_read[0].text}")
print(f"Metadata: {docs_read[0].metadata}")
```

## Practical Applications

The Arxiv Fetcher enables various use cases for researchers and developers:

* Literature Reviews: Quickly retrieve and organize relevant papers on a given topic or by a specific author.
* Trend Analysis: Identify the latest research in a domain by filtering for recent submissions.
* Offline Research Workflows: Download and process PDFs for local analysis and archiving.

## Next Steps

While the Arxiv Fetcher provides robust functionality for retrieving and processing research papers, its output can be integrated into advanced workflows:

* Building a Searchable Knowledge Base: Combine fetched papers with tools like text splitting and vector embeddings for advanced search capabilities.
* Retrieval-Augmented Generation (RAG): Use processed papers as inputs for RAG pipelines to power question-answering systems.
* Automated Literature Surveys: Generate summaries or insights based on the fetched and processed research.
11 changes: 11 additions & 0 deletions docs/concepts/text_splitter.md
Original file line number Diff line number Diff line change
Expand Up @@ -89,6 +89,17 @@ if not local_pdf_path.exists():

### Step 2: Read the Document

For this example, we use Floki's `PyPDFReader`.

!!! info
The PyPDF Reader relies on the [pypdf python library](https://pypi.org/project/pypdf/), which is not included in the Floki core module. This design choice helps maintain modularity and avoids adding unnecessary dependencies for users who may not require this functionality. To use the PyPDF Reader, ensure that you install the library separately.

```python
pip install pypdf
```

Then, initialize the reader to load the PDF file.

```python
from floki.document.reader.pdf.pypdf import PyPDFReader

Expand Down
3 changes: 2 additions & 1 deletion mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -141,4 +141,5 @@ nav:
- Core Concepts:
- Agents: concepts/agents.md
- Messaging: concepts/messaging.md
- Text Splitter: concepts/text_splitter.md
- Text Splitter: concepts/text_splitter.md
- Arxiv Fetcher: concepts/arxiv_fetcher.md
3 changes: 1 addition & 2 deletions src/floki/__init__.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,8 @@
from floki.agent import (
Agent, AgentService, AgentServiceBase,
Agent, AgentService,
AgenticWorkflowService, RoundRobinWorkflowService, RandomWorkflowService,
LLMWorkflowService, ReActAgent, ToolCallAgent, OpenAPIReActAgent
)
from floki.llm import LLMClientBase, ChatClientBase
from floki.llm.openai import OpenAIChatClient, OpenAIAudioClient
from floki.llm.huggingface import HFHubChatClient
from floki.tool import AgentTool, tool
Expand Down
8 changes: 6 additions & 2 deletions src/floki/agent/utils/factory.py
Original file line number Diff line number Diff line change
Expand Up @@ -51,8 +51,8 @@ def __new__(
role: str,
name: Optional[str] = None,
pattern: str = "toolcalling",
llm: Optional[LLMClientBase] = OpenAIChatClient(),
memory: Optional[MemoryBase] = ConversationListMemory(),
llm: Optional[LLMClientBase] = None,
memory: Optional[MemoryBase] = None,
tools: Optional[List[AgentTool]] = [],
**kwargs
) -> Union[ToolCallAgent, ReActAgent, OpenAPIReActAgent]:
Expand All @@ -72,6 +72,10 @@ def __new__(
"""
agent_class = AgentFactory.create_agent_class(pattern)

# Lazy initialization
llm = llm or OpenAIChatClient()
memory = memory or ConversationListMemory()

if pattern == "openapireact":
kwargs.update({
"spec_parser": kwargs.get('spec_parser', OpenAPISpecParser()),
Expand Down
3 changes: 2 additions & 1 deletion src/floki/document/__init__.py
Original file line number Diff line number Diff line change
@@ -1,2 +1,3 @@
from .fetcher import ArxivFetcher
from .reader import PyMuPDFReader, PyPDFReader
from .splitter import SplitterBase, TextSplitter
from .splitter import TextSplitter
1 change: 1 addition & 0 deletions src/floki/document/fetcher/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
from .arxiv import ArxivFetcher
Loading

0 comments on commit 5677230

Please sign in to comment.