Skip to content

feat: add local document indexing and harden fetch validation#1

Open
gqf2008 wants to merge 1 commit into
lispking:mainfrom
gqf2008:feat/local-index-cli-and-url-hardening
Open

feat: add local document indexing and harden fetch validation#1
gqf2008 wants to merge 1 commit into
lispking:mainfrom
gqf2008:feat/local-index-cli-and-url-hardening

Conversation

@gqf2008

@gqf2008 gqf2008 commented Mar 31, 2026

Copy link
Copy Markdown

What

  • add a clap-based CLI for search, fetch, config inspection, local indexing, and MCP startup
  • add Tantivy-backed local document indexing and search for markdown, text, HTML, and PDF files
  • tighten URL safety checks and host validation in fetchers
  • update README and skill docs for the new CLI workflow

Why

  • make ferris-search usable as both an MCP server and a standalone CLI
  • support local knowledge search without depending only on remote engines
  • reduce fetch-time safety and URL-matching risks

Validation

  • cargo test (80 passed)

Copilot AI review requested due to automatic review settings March 31, 2026 16:01

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a Clap-based CLI entrypoint (while keeping MCP/stdin auto-detection), introduces Tantivy-backed local document indexing + search, and tightens fetch URL validation / host matching across tools and docs.

Changes:

  • Add CLI subcommands for web search, fetch, config inspection, engine listing, MCP startup, and local index/search workflows.
  • Implement local document collection, Tantivy indexing, and local search (Markdown/TXT/HTML/PDF).
  • Harden URL validation by adding host-matching helpers and improving safe truncation; update docs and add CLI/integration tests.

Reviewed changes

Copilot reviewed 26 out of 27 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
tests/cli.rs Adds end-to-end CLI smoke tests (help/version, list/show-config, index-local/search-local, JSON validity).
src/utils/url_safety.rs Adds is_url_from_host and expands URL safety unit tests.
src/tools/mod.rs Exposes helpers publicly for reuse (e.g., CLI).
src/tools/helpers.rs Centralizes engine list + engine resolution and adds multi-engine concurrent search helper + tests.
src/tools/handlers.rs Updates MCP tool handler to use shared engine resolution and host-based URL checks.
src/main.rs Replaces single-mode main with Clap CLI + MCP auto-detection; adds local indexing/search commands and config/engine listing.
src/index/searcher.rs Implements Tantivy searcher + result struct and unit tests.
src/index/schema.rs Defines Tantivy schema for local docs and tests schema shape.
src/index/mod.rs Adds index module exports.
src/index/indexer.rs Implements Tantivy index creation/open, document add, commit, and tests.
src/index/collector.rs Adds document collector/extractors for md/txt/html/pdf, title extraction, normalization, and tests.
src/fetchers/zhihu.rs Switches to assert_public_http_url for SSRF protection.
src/fetchers/web.rs Fixes truncation to be UTF-8 safe via floor_char_boundary.
src/fetchers/juejin.rs Switches to assert_public_http_url and simplifies URL validation.
src/fetchers/github.rs Improves URL parsing (strip query/fragment) and adds unit tests.
src/fetchers/csdn.rs Adds assert_public_http_url SSRF protection.
src/config.rs Adds env-driven config for local index path and allowed extensions.
src/cli.rs Introduces Clap CLI definition (subcommands + output formats).
skills/ferris-search-tools/SKILL.md Updates skill docs for CLI + MCP clarification and host-based validation guidance.
skills/ferris-search-tools/references/tools-api.md Refreshes tool API docs and URL safety rules for host-based validation.
skills/ferris-search-setup/SKILL.md Updates setup skill to reflect CLI+MCP behavior and removed legacy env vars.
skills/ferris-search-setup/references/configuration.md Updates configuration docs (adds LOCAL_DOCS_*; removes old transport env var section).
skills/ferris-search-cli/SKILL.md Adds dedicated CLI skill documentation.
skills/ferris-search-cli/references/cli-reference.md Adds comprehensive CLI reference (commands, routing, exit codes, behaviors).
README.md Updates README to document CLI workflows, local indexing, and new configuration.
Cargo.toml Adds dependencies for CLI, indexing, PDF extraction, and CLI integration testing.
Cargo.lock Locks new dependency graph for clap/tantivy/pdf-extract and test deps.
Comments suppressed due to low confidence (1)

src/fetchers/csdn.rs:36

  • fetch_csdn_article now only calls assert_public_http_url and no longer verifies that the URL host is actually csdn.net (or a subdomain). Because this function also forces a Host: blog.csdn.net header, calling it with a non-CSDN URL will send a mismatched Host header and almost certainly produce incorrect behavior; it also weakens the intended fetcher hardening. Please add a host-based check (e.g., is_url_from_host(url, "csdn.net")) and fail fast when it doesn’t match.
pub async fn fetch_csdn_article(url: &str) -> anyhow::Result<String> {
    assert_public_http_url(url)?;
    let client = build_client()?;
    let resp = client
        .get(url)
        .headers(chrome_headers())
        .header("Host", "blog.csdn.net")
        .send()
        .await?;

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/utils/url_safety.rs
Comment on lines +71 to +79
/// Check if a URL's host matches the given domain (or is a subdomain of it).
pub fn is_url_from_host(raw: &str, domain: &str) -> bool {
let Ok(url) = Url::parse(raw) else {
return false;
};
match url.host_str() {
Some(host) => host == domain || host.ends_with(&format!(".{}", domain)),
None => false,
}

Copilot AI Mar 31, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_url_from_host only checks the parsed host string and ignores the URL scheme, so values like file://github.com/... (or other non-HTTP(S) schemes that still have a host component) will be treated as “from github.com”. Since this helper is used for fetcher validation, it should also enforce http/https (or alternatively call is_public_http_url / assert_public_http_url internally) before doing the host/subdomain comparison.

Copilot uses AI. Check for mistakes.
Comment thread src/tools/handlers.rs
Comment on lines 121 to 125
pub async fn fetch_github_readme_tool(&self, p: Parameters<GithubReadmeParams>) -> String {
let url = p.0.url;
if !url.contains("github.com") {
if !is_url_from_host(&url, "github.com") {
return "URL must be from github.com".into();
}

Copilot AI Mar 31, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The domain-specific fetch tools validate the host via is_url_from_host, but they do not enforce the SSRF/public-URL checks (is_public_http_url/assert_public_http_url). Given the docs state “All fetch tools enforce SSRF protection”, these handlers should reject non-public and non-HTTP(S) URLs before doing host/path validation (same applies to the other domain-specific fetch tools below).

Copilot uses AI. Check for mistakes.
Comment thread src/fetchers/juejin.rs
Comment on lines 41 to 45
pub async fn fetch_juejin_article(url: &str) -> anyhow::Result<String> {
// Validate it's a juejin post URL
if !url.contains("juejin.cn") || !url.contains("/post/") {
anyhow::bail!("URL must be from juejin.cn and contain /post/ path");
assert_public_http_url(url)?;
if !url.contains("/post/") {
anyhow::bail!("URL must contain /post/ path");
}

Copilot AI Mar 31, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fetch_juejin_article validates that the URL is public HTTP(S) and contains /post/, but it no longer validates that the host is juejin.cn (or a subdomain). This makes it possible to route arbitrary public sites through the Juejin extractor and contradicts the “host-based” validation described in the docs. Add a host check (preferably using is_url_from_host(url, "juejin.cn")) before fetching.

Copilot uses AI. Check for mistakes.
Comment thread src/fetchers/zhihu.rs
Comment on lines 34 to 36
pub async fn fetch_zhihu_article(url: &str) -> anyhow::Result<String> {
if !url.contains("zhihu.com") {
anyhow::bail!("URL must be from zhihu.com");
}
assert_public_http_url(url)?;
let client = build_client()?;

Copilot AI Mar 31, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fetch_zhihu_article now only enforces assert_public_http_url and no longer checks that the URL is actually for zhihu.com. Since this is a domain-specific fetcher with specialized extraction and headers, it should reject URLs whose host isn’t zhihu.com (or a subdomain) to avoid incorrect behavior and align with the host-validation hardening.

Copilot uses AI. Check for mistakes.
Comment thread src/index/searcher.rs
Comment on lines +69 to +70
let snippet = if body.len() > 200 {
format!("{}...", &body[..body.floor_char_boundary(200)])

Copilot AI Mar 31, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The snippet logic is documented as “first 200 chars of body”, but the implementation uses body.len() > 200 (bytes) and slices at a UTF-8 boundary. For non-ASCII text this will often produce fewer than 200 characters, so either update the comment to say “bytes” or implement a true character-count truncation.

Suggested change
let snippet = if body.len() > 200 {
format!("{}...", &body[..body.floor_char_boundary(200)])
let snippet = if body.chars().count() > 200 {
let truncated: String = body.chars().take(200).collect();
format!("{}...", truncated)

Copilot uses AI. Check for mistakes.
- `github`: aliases `github repos`, `github repo`; optional `GITHUB_TOKEN`; searches repositories
- `github_code`: aliases `github code`; optional `GITHUB_TOKEN`; searches code files

### fetch_web_content Output Format

Copilot AI Mar 31, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the web_search section, the heading says fetch_web_content Output Format, but the examples shown are for web search output (single-engine / multi-engine search results). Renaming this heading to web_search Output Format would prevent confusion.

Suggested change
### fetch_web_content Output Format
### web_search Output Format

Copilot uses AI. Check for mistakes.
Comment thread README.md
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants