Skip to content

Conversation

@HKanoje
Copy link

@HKanoje HKanoje commented Nov 18, 2025

Description

Adds comprehensive domain filtering capability for all web search tools in smolagents, allowing users to control which websites can appear in search results through blocklists and allowlists.

Problem Solved

Agents using web search tools could inadvertently access or return results from:

  • Malicious or phishing websites
  • Low-quality content farms
  • Ad networks and tracking domains
  • Policy-violating domains
  • Paywalled or irrelevant sources

Solution

Features

Blocklist - Exclude specific domains (spam, malicious, tracking)
Allowlist - Restrict to trusted sources only (.edu, .gov)
Wildcard patterns - *.edu, *.ads.*, tracker.*
Automatic subdomain handling - Blocking example.com also blocks subdomains
Combined filtering - Use allowlist with blocklist refinements
Case-insensitive - Domain matching is case-insensitive
Backward compatible - Optional parameters, no breaking changes

Implementation

Core Components:

  • Created DomainFilter utility class (src/smolagents/domain_filter.py)
  • Integrated filtering into all 4 search tools:
    • DuckDuckGoSearchTool
    • WebSearchTool
    • ApiWebSearchTool (Brave Search)
    • GoogleSearchTool

API Changes:

  • Added blocked_domains parameter to all search tools
  • Added allowed_domains parameter to all search tools
  • Both parameters are optional (default: no filtering)

Usage Examples

from smolagents import DuckDuckGoSearchTool

# Block specific domains
tool = DuckDuckGoSearchTool(
    blocked_domains=["spam.com", "*.ads.*"]
)

# Allow only trusted sources
tool = DuckDuckGoSearchTool(
    allowed_domains=["*.edu", "*.gov", "wikipedia.org"]
)

# Combined filtering
tool = DuckDuckGoSearchTool(
    allowed_domains=["*.edu"],
    blocked_domains=["spam-university.edu"]
)

Testing

37 tests passing (100%)

  • 26 unit tests for DomainFilter class
  • 11 integration tests for search tools
  • All tool validation tests passing

Code quality checks passing

  • make quality
  • make style
  • Follows contributor guidelines

Use Cases

  • Security — Block malicious, phishing, and tracking websites
  • Quality Control — Filter out content farms and spam
  • Research — Restrict to academic and peer-reviewed sources
  • Compliance — Enforce organizational policies
  • Privacy — Block ad networks and analytics services

Files Changed

  • domain_filter.py (NEW) — Core filtering utility
  • __init__.py — Export DomainFilter
  • default_tools.py — Integrated filtering into search tools
  • test_domain_filter.py (NEW) — Unit tests
  • test_search_tools_domain_filtering.py (NEW) — Integration tests
  • domain_filtering.py (NEW) — Example script with 6 use cases

Checklist

  • Follows OOP principles
  • Pythonic code style
  • Comprehensive unit tests
  • Integration tests
  • Documentation with examples
  • Backward compatible
  • Code quality checks passing
  • All existing tests still passing

Related Issue

Addresses the problem of agents accessing undesirable websites during web searches, improving security, quality, and compliance.

closes #1857

Add comprehensive domain filtering capability for all web search tools in smolagents,
allowing users to control which websites can appear in search results through blocklists
and allowlists.

Features:
- Blocklist support to exclude specific domains (e.g., spam, malicious, tracking)
- Allowlist support to restrict results to trusted sources only (e.g., .edu, .gov)
- Wildcard pattern matching (*.edu, *.ads.*, tracker.*)
- Automatic subdomain handling (blocking example.com also blocks www.example.com)
- Combined filtering (allowlist with blocklist refinements)
- Case-insensitive domain matching
- Backward compatible (optional parameters)

Implementation:
- Created DomainFilter utility class (src/smolagents/domain_filter.py)
- Integrated filtering into all 4 search tools:
  * DuckDuckGoSearchTool
  * WebSearchTool
  * ApiWebSearchTool (Brave Search)
  * GoogleSearchTool
- Added blocked_domains and allowed_domains parameters to tool constructors
- Tools filter results before returning, with updated error messages

Testing:
- 26 unit tests for DomainFilter class (100% passing)
- 11 integration tests for search tools (100% passing)
- All tool validation tests passing
- Example script demonstrating 6 practical use cases

Documentation:
- Comprehensive docstrings with examples
- Example script (examples/domain_filtering.py)
- Follows OOP principles and Python best practices

Use Cases:
- Security: Block malicious, phishing, and tracking websites
- Quality Control: Filter out content farms and spam
- Research: Restrict to academic and peer-reviewed sources
- Compliance: Enforce organizational policies on information sources
- Privacy: Block ad networks and analytics services
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ENH: Feature Request: Add sites_to_avoid parameter to WebSearchTool

1 participant