Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Firecrawl tool for accessing web pages #251

Merged
merged 3 commits into from
Feb 12, 2025
Merged

Firecrawl tool for accessing web pages #251

merged 3 commits into from
Feb 12, 2025

Conversation

kinshuksinghbist
Copy link
Collaborator

@kinshuksinghbist kinshuksinghbist commented Feb 11, 2025

must add a FIRECRAWL_API_KEY. if there exists no key, it does not add the tool.

Summary by CodeRabbit

  • New Features
    • Integrated a conditional webpage extraction capability across several functionalities, enhancing content gathering when a valid API key is configured.
  • Chores
    • Introduced a new configuration option for the API key.
    • Updated system dependencies to support the enhanced webpage extraction feature.

Copy link
Contributor

coderabbitai bot commented Feb 11, 2025

Walkthrough

This pull request introduces a new environment variable, FIRECRAWL_API_KEY, in the .env.template file and integrates a new tool, webpage_extractor_tool, into multiple agent modules. Each agent (such as BlastRadiusAgent, CodeGenerationAgent, DebugRAGAgent, IntegrationTestAgent, LowLevelDesignAgent, RAGAgent, and UnitTestAgent) now conditionally initializes and includes this tool if the API key is present. The changes extend to the ToolService class, which is updated to add the new tool to its tools dictionary, and a new file provides the implementation for the webpage extraction logic. Additionally, a new dependency is declared in requirements.txt.

Changes

File(s) Change Summary
.env.template Added new environment variable FIRECRAWL_API_KEY while retaining POTPIE_PLUS_HMAC_KEY.
.../agents/agents/{blast_radius,code_gen,debug_rag,integration_test,low_level_design,rag,unit_test}_agent.py Added conditional initialization and inclusion of webpage_extractor_tool in the agents’ tool lists based on the presence of FIRECRAWL_API_KEY.
.../tools/tool_service.py Updated _initialize_tools to conditionally add webpage_extractor_tool to the tools dictionary.
.../tools/web_tools/webpage_extractor_tool.py New file implementing WebpageExtractorTool with input validation via Pydantic, synchronous and asynchronous extraction methods, and API integration with Firecrawl.
requirements.txt Added new dependency firecrawl-py==1.11.1.

Sequence Diagram(s)

sequenceDiagram
    participant Agent
    participant Env
    participant Tool as WebpageExtractorTool

    Agent->>Env: Check FIRECRAWL_API_KEY
    alt API key is set
        Env-->>Agent: API key available
        Agent->>Tool: Initialize webpage_extractor_tool
        Agent->>Agent: Include tool in tools list
    else API key is not set
        Env-->>Agent: No API key found
        Agent->>Agent: Skip tool initialization
    end
Loading
sequenceDiagram
    participant Client
    participant WET as WebpageExtractorTool
    participant API as Firecrawl API

    Client->>WET: Request extraction (arun/run)
    WET->>WET: Validate URL and API key
    alt API key present
        WET->>API: Send extraction request
        API-->>WET: Return extracted content & metadata
    else API key missing
        WET-->>Client: Return error message (failure)
    end
    WET-->>Client: Provide extraction result
Loading

Possibly related PRs

Poem

Hopping through the code with glee,
I found a key for you and me.
FIRECRAWL_API_KEY now unlocks the door,
Letting webpage extraction soar.
With a twitch of my whiskers and a joyful beat,
I, the coding rabbit, rap with hop-filled feet!
🐇✨

✨ Finishing Touches
  • 📝 Generate Docstrings (Beta)

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

Quality Gate Failed Quality Gate failed

Failed conditions
4.8% Duplication on New Code (required ≤ 3%)

See analysis details on SonarQube Cloud

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (12)
app/modules/intelligence/agents/agents/code_gen_agent.py (2)

55-56: Add validation for FIRECRAWL_API_KEY.

While the conditional initialization is good, consider validating the API key's value to ensure it's not empty or malformed before initializing the tool.

-        if os.getenv("FIRECRAWL_API_KEY"):
+        firecrawl_api_key = os.getenv("FIRECRAWL_API_KEY")
+        if firecrawl_api_key and firecrawl_api_key.strip():
             self.webpage_extractor_tool = webpage_extractor_tool(sql_db, user_id)

88-88: Improve readability and safety of tool addition.

The current implementation could raise an AttributeError if the tool initialization fails. Consider using getattr for safer access and improving readability.

-            ]+ ([self.webpage_extractor_tool] if os.getenv("FIRECRAWL_API_KEY") else []),
+            ] + ([getattr(self, 'webpage_extractor_tool')] if hasattr(self, 'webpage_extractor_tool') else []),
app/modules/intelligence/agents/agents/unit_test_agent.py (2)

43-46: Improve list formatting for better readability.

While the conditional addition of the tool is correct, the formatting could be improved for better readability.

Consider this formatting:

-            tools=[
-                self.get_code_from_probable_node_name,
-                self.get_code_from_node_id,
-                ]+ ([self.webpage_extractor_tool] if os.getenv("FIRECRAWL_API_KEY") else []),
+            tools=[
+                self.get_code_from_probable_node_name,
+                self.get_code_from_node_id,
+            ] + (
+                [self.webpage_extractor_tool]
+                if os.getenv("FIRECRAWL_API_KEY")
+                else []
+            ),

74-132: Update task description to include webpage extraction capability.

The task description should be updated to include information about the webpage extraction capability when FIRECRAWL_API_KEY is available. This will help users understand when and how to use this feature in their test plans and unit tests.

Consider adding a section about webpage extraction in the "Process" section of the task description, such as:

            Process:
            1. **Code Retrieval:**
            - If not already present in the history, Fetch the docstrings and code for the provided node IDs using the get_code_from_node_id tool.
            - Node IDs: {', '.join(node_ids_list)}
            - Project ID: {project_id}
            - Fetch the code for the file path of the function/class mentioned in the user's query using the get code from probable node name tool. This is needed for correct inport of class name in the unit test file.
+           - When FIRECRAWL_API_KEY is configured, you can extract content from webpages to enhance test scenarios with real-world data or documentation.
app/modules/intelligence/agents/agents/rag_agent.py (1)

107-107: Consider improving readability of the conditional tool addition.

While the logic is correct, the inline list concatenation could be made more readable.

Consider this alternative approach for better readability:

-            ]+ ([self.webpage_extractor_tool] if os.getenv("FIRECRAWL_API_KEY") else []),
+            ],
+            # Add webpage extractor tool if API key is available
+            *([self.webpage_extractor_tool] if os.getenv("FIRECRAWL_API_KEY") else []),
app/modules/intelligence/tools/web_tools/webpage_extractor_tool.py (3)

1-114: Apply PEP 8 formatting using Black
The pipeline has flagged that the code does not adhere to PEP 8 standards.

Please reformat this file with Black to resolve the style warnings.

🧰 Tools
🪛 GitHub Actions: Pre-commit

[warning] 1-1: Code does not adhere to PEP 8 standards. Please format the code using Black.


60-67: Assess user-provided URLs for potential SSRF issues
Because users may supply arbitrary URLs, confirm that the Firecrawl service implementation properly handles or restricts requests to avoid SSRF or other malicious redirects.


68-90: Handle partial or failed responses more robustly
Currently, if response.get("markdown") is missing, the method returns None. Consider adding timeout or exception handling for scrape_url calls, plus logic to handle partial data in response to improve resilience.

app/modules/intelligence/tools/tool_service.py (1)

39-39: Reformat import for PEP 8 compliance
The pipeline flagged these lines for formatting issues. Please run Black to fix any style discrepancies.

🧰 Tools
🪛 GitHub Actions: Pre-commit

[warning] 39-39: Code does not adhere to PEP 8 standards. Please format the code using Black.

app/modules/intelligence/agents/agents/blast_radius_agent.py (1)

22-24: Address PEP 8 style feedback
The pipeline detected style noncompliance here. Please run Black to format the code consistently.

app/modules/intelligence/agents/agents/low_level_design_agent.py (1)

99-99: Consider adding error handling for API key validation.

While the conditional check for FIRECRAWL_API_KEY existence is good, consider validating the API key format and handling potential initialization errors.

-            ]+ ([self.webpage_extractor_tool] if os.getenv("FIRECRAWL_API_KEY") else []),
+            ]+ ([self.webpage_extractor_tool] if os.getenv("FIRECRAWL_API_KEY") and self._validate_api_key() else []),

Add the validation method:

def _validate_api_key(self) -> bool:
    api_key = os.getenv("FIRECRAWL_API_KEY")
    try:
        # Add validation logic (e.g., check format, length)
        return bool(api_key and len(api_key) > 0)
    except Exception as e:
        logger.warning(f"Failed to validate FIRECRAWL_API_KEY: {str(e)}")
        return False
app/modules/intelligence/agents/agents/debug_rag_agent.py (1)

39-41: Consider reordering tool initialization.

The webpage extractor tool initialization is placed between other tool initializations. Consider grouping all conditional tool initializations together for better maintainability.

-        self.get_node_neighbours_from_node_id = get_node_neighbours_from_node_id_tool(
-            sql_db
-        )
-        if os.getenv("FIRECRAWL_API_KEY"):
-            self.webpage_extractor_tool = webpage_extractor_tool(sql_db, user_id)
-        self.get_code_file_structure = get_code_file_structure_tool(sql_db)
+        self.get_node_neighbours_from_node_id = get_node_neighbours_from_node_id_tool(
+            sql_db
+        )
+        self.get_code_file_structure = get_code_file_structure_tool(sql_db)
+
+        # Initialize conditional tools
+        if os.getenv("FIRECRAWL_API_KEY"):
+            self.webpage_extractor_tool = webpage_extractor_tool(sql_db, user_id)

Also applies to: 76-78

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 24bebe9 and 6590bc4.

📒 Files selected for processing (11)
  • .env.template (1 hunks)
  • app/modules/intelligence/agents/agents/blast_radius_agent.py (3 hunks)
  • app/modules/intelligence/agents/agents/code_gen_agent.py (3 hunks)
  • app/modules/intelligence/agents/agents/debug_rag_agent.py (3 hunks)
  • app/modules/intelligence/agents/agents/integration_test_agent.py (3 hunks)
  • app/modules/intelligence/agents/agents/low_level_design_agent.py (3 hunks)
  • app/modules/intelligence/agents/agents/rag_agent.py (3 hunks)
  • app/modules/intelligence/agents/agents/unit_test_agent.py (2 hunks)
  • app/modules/intelligence/tools/tool_service.py (2 hunks)
  • app/modules/intelligence/tools/web_tools/webpage_extractor_tool.py (1 hunks)
  • requirements.txt (1 hunks)
✅ Files skipped from review due to trivial changes (2)
  • .env.template
  • requirements.txt
🧰 Additional context used
🪛 GitHub Actions: Pre-commit
app/modules/intelligence/tools/web_tools/webpage_extractor_tool.py

[warning] 1-1: Code does not adhere to PEP 8 standards. Please format the code using Black.

app/modules/intelligence/tools/tool_service.py

[warning] 39-39: Code does not adhere to PEP 8 standards. Please format the code using Black.

🔇 Additional comments (15)
app/modules/intelligence/agents/agents/code_gen_agent.py (1)

33-35: LGTM!

The import statement follows the project's import organization pattern.

app/modules/intelligence/agents/agents/unit_test_agent.py (2)

18-20: LGTM!

The import statement follows the project's import structure and is correctly placed with other tool imports.


35-36: LGTM! Verify environment variable setup.

The conditional initialization is correctly implemented. The tool is only initialized when FIRECRAWL_API_KEY is present, which aligns with the PR requirements.

Run this script to verify the environment variable setup:

✅ Verification successful

🏁 Script executed:

#!/bin/bash
# Description: Check if FIRECRAWL_API_KEY is properly documented in .env.template

# Test: Check if FIRECRAWL_API_KEY is present in .env.template
if ! grep -q "FIRECRAWL_API_KEY" .env.template; then
  echo "Warning: FIRECRAWL_API_KEY not found in .env.template"
  exit 1
fi

Length of output: 41


Environment Variable Setup Confirmed

The execution of the script (with no warning output) confirms that FIRECRAWL_API_KEY is present in the .env.template file. This verifies that the conditional initialization for webpage_extractor_tool is set up correctly as intended.

app/modules/intelligence/agents/agents/rag_agent.py (3)

39-41: LGTM!

The import statement follows the existing pattern and is properly organized with other tool imports.


77-78: LGTM!

The conditional initialization of the webpage extractor tool is well-implemented, following the pattern of other tool initializations and properly checking for the required API key.


77-107: Verify error handling for missing API key.

The code correctly checks for the presence of FIRECRAWL_API_KEY, but we should verify that the absence of the key is properly handled and doesn't cause issues.

Let's verify the error handling:

✅ Verification successful

🏁 Script executed:

#!/bin/bash
# Description: Check for proper error handling when FIRECRAWL_API_KEY is missing

# Test 1: Search for any error handling related to FIRECRAWL_API_KEY
rg -A 5 "FIRECRAWL_API_KEY"

# Test 2: Search for any try-except blocks around webpage_extractor_tool
ast-grep --pattern 'try {
  $$$
  webpage_extractor_tool
  $$$
} catch {
  $$$
}'

Length of output: 9709


Missing API Key Handling Verified

  • The conditional check for FIRECRAWL_API_KEY in the agent files prevents instantiation of the webpage extractor tool without a valid key.
  • In webpage_extractor_tool.py, the missing API key is handled by logging a warning and returning None, which is consistent with the expected behavior.
app/modules/intelligence/tools/web_tools/webpage_extractor_tool.py (1)

38-40: Confirm thread-safety of FirecrawlApp usage
You are invoking firecrawl.scrape_url from within a background thread (to_thread). If FirecrawlApp is not thread-safe, consider using additional synchronization or creating a separate app instance for each thread to avoid potential data races.

app/modules/intelligence/tools/tool_service.py (2)

46-46: Review usage of self.webpage_extractor_tool
self.webpage_extractor_tool is assigned here but not referenced within ToolService outside _initialize_tools. Consider removing it if unneeded or ensure it’s actively used.


72-75: Nice conditional approach
Conditionally adding the webpage extractor if the environment variable is present keeps the service flexible and avoids redundant initialization. Good job!

app/modules/intelligence/agents/agents/blast_radius_agent.py (2)

36-38: Efficient conditional tool initialization
Constructing the webpage extractor tool only when the API key exists is a clean design choice that avoids unnecessary resource usage.


48-48: Good pattern for optional tooling
Including webpage_extractor_tool in the agent's tool list only if the key is present helps maintain consistent behavior across environments.

app/modules/intelligence/agents/agents/low_level_design_agent.py (1)

33-35: LGTM! Secure initialization of webpage extractor tool.

The conditional initialization based on environment variable presence is a good security practice.

Also applies to: 83-84

app/modules/intelligence/agents/agents/integration_test_agent.py (2)

22-24: LGTM! Consistent implementation with other agents.

The initialization pattern matches the implementation in other agent classes.

Also applies to: 38-39


157-157: Consider adding error handling for API key validation.

Similar to other agents, consider adding API key validation and error handling.

app/modules/intelligence/agents/agents/debug_rag_agent.py (1)

107-107: Consider adding error handling for API key validation.

Similar to other agents, consider adding API key validation and error handling.

@dhirenmathur dhirenmathur changed the title added a tool that extracts webpages for the LLM Firecrawl tool for accessing web pages Feb 12, 2025
@dhirenmathur dhirenmathur merged commit fa48009 into main Feb 12, 2025
2 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants