feat: add local document indexing and harden fetch validation by gqf2008 · Pull Request #1 · lispking/ferris-search

gqf2008 · 2026-03-31T16:01:48Z

What

add a clap-based CLI for search, fetch, config inspection, local indexing, and MCP startup
add Tantivy-backed local document indexing and search for markdown, text, HTML, and PDF files
tighten URL safety checks and host validation in fetchers
update README and skill docs for the new CLI workflow

Why

make ferris-search usable as both an MCP server and a standalone CLI
support local knowledge search without depending only on remote engines
reduce fetch-time safety and URL-matching risks

Validation

cargo test (80 passed)

Copilot

Pull request overview

This PR adds a Clap-based CLI entrypoint (while keeping MCP/stdin auto-detection), introduces Tantivy-backed local document indexing + search, and tightens fetch URL validation / host matching across tools and docs.

Changes:

Add CLI subcommands for web search, fetch, config inspection, engine listing, MCP startup, and local index/search workflows.
Implement local document collection, Tantivy indexing, and local search (Markdown/TXT/HTML/PDF).
Harden URL validation by adding host-matching helpers and improving safe truncation; update docs and add CLI/integration tests.

Reviewed changes

Copilot reviewed 26 out of 27 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
`tests/cli.rs`	Adds end-to-end CLI smoke tests (help/version, list/show-config, index-local/search-local, JSON validity).
`src/utils/url_safety.rs`	Adds `is_url_from_host` and expands URL safety unit tests.
`src/tools/mod.rs`	Exposes helpers publicly for reuse (e.g., CLI).
`src/tools/helpers.rs`	Centralizes engine list + engine resolution and adds multi-engine concurrent search helper + tests.
`src/tools/handlers.rs`	Updates MCP tool handler to use shared engine resolution and host-based URL checks.
`src/main.rs`	Replaces single-mode main with Clap CLI + MCP auto-detection; adds local indexing/search commands and config/engine listing.
`src/index/searcher.rs`	Implements Tantivy searcher + result struct and unit tests.
`src/index/schema.rs`	Defines Tantivy schema for local docs and tests schema shape.
`src/index/mod.rs`	Adds index module exports.
`src/index/indexer.rs`	Implements Tantivy index creation/open, document add, commit, and tests.
`src/index/collector.rs`	Adds document collector/extractors for md/txt/html/pdf, title extraction, normalization, and tests.
`src/fetchers/zhihu.rs`	Switches to `assert_public_http_url` for SSRF protection.
`src/fetchers/web.rs`	Fixes truncation to be UTF-8 safe via `floor_char_boundary`.
`src/fetchers/juejin.rs`	Switches to `assert_public_http_url` and simplifies URL validation.
`src/fetchers/github.rs`	Improves URL parsing (strip query/fragment) and adds unit tests.
`src/fetchers/csdn.rs`	Adds `assert_public_http_url` SSRF protection.
`src/config.rs`	Adds env-driven config for local index path and allowed extensions.
`src/cli.rs`	Introduces Clap CLI definition (subcommands + output formats).
`skills/ferris-search-tools/SKILL.md`	Updates skill docs for CLI + MCP clarification and host-based validation guidance.
`skills/ferris-search-tools/references/tools-api.md`	Refreshes tool API docs and URL safety rules for host-based validation.
`skills/ferris-search-setup/SKILL.md`	Updates setup skill to reflect CLI+MCP behavior and removed legacy env vars.
`skills/ferris-search-setup/references/configuration.md`	Updates configuration docs (adds LOCAL_DOCS_*; removes old transport env var section).
`skills/ferris-search-cli/SKILL.md`	Adds dedicated CLI skill documentation.
`skills/ferris-search-cli/references/cli-reference.md`	Adds comprehensive CLI reference (commands, routing, exit codes, behaviors).
`README.md`	Updates README to document CLI workflows, local indexing, and new configuration.
`Cargo.toml`	Adds dependencies for CLI, indexing, PDF extraction, and CLI integration testing.
`Cargo.lock`	Locks new dependency graph for clap/tantivy/pdf-extract and test deps.

Comments suppressed due to low confidence (1)

src/fetchers/csdn.rs:36

fetch_csdn_article now only calls assert_public_http_url and no longer verifies that the URL host is actually csdn.net (or a subdomain). Because this function also forces a Host: blog.csdn.net header, calling it with a non-CSDN URL will send a mismatched Host header and almost certainly produce incorrect behavior; it also weakens the intended fetcher hardening. Please add a host-based check (e.g., is_url_from_host(url, "csdn.net")) and fail fast when it doesn’t match.

pub async fn fetch_csdn_article(url: &str) -> anyhow::Result<String> {
    assert_public_http_url(url)?;
    let client = build_client()?;
    let resp = client
        .get(url)
        .headers(chrome_headers())
        .header("Host", "blog.csdn.net")
        .send()
        .await?;

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-31T16:07:41Z

+/// Check if a URL's host matches the given domain (or is a subdomain of it).
+pub fn is_url_from_host(raw: &str, domain: &str) -> bool {
+    let Ok(url) = Url::parse(raw) else {
+        return false;
+    };
+    match url.host_str() {
+        Some(host) => host == domain || host.ends_with(&format!(".{}", domain)),
+        None => false,
+    }


is_url_from_host only checks the parsed host string and ignores the URL scheme, so values like file://github.com/... (or other non-HTTP(S) schemes that still have a host component) will be treated as “from github.com”. Since this helper is used for fetcher validation, it should also enforce http/https (or alternatively call is_public_http_url / assert_public_http_url internally) before doing the host/subdomain comparison.

Copilot · 2026-03-31T16:07:41Z

    pub async fn fetch_github_readme_tool(&self, p: Parameters<GithubReadmeParams>) -> String {
        let url = p.0.url;
-        if !url.contains("github.com") {
+        if !is_url_from_host(&url, "github.com") {
            return "URL must be from github.com".into();
        }


The domain-specific fetch tools validate the host via is_url_from_host, but they do not enforce the SSRF/public-URL checks (is_public_http_url/assert_public_http_url). Given the docs state “All fetch tools enforce SSRF protection”, these handlers should reject non-public and non-HTTP(S) URLs before doing host/path validation (same applies to the other domain-specific fetch tools below).

Copilot · 2026-03-31T16:07:41Z

 pub async fn fetch_juejin_article(url: &str) -> anyhow::Result<String> {
-    // Validate it's a juejin post URL
-    if !url.contains("juejin.cn") || !url.contains("/post/") {
-        anyhow::bail!("URL must be from juejin.cn and contain /post/ path");
+    assert_public_http_url(url)?;
+    if !url.contains("/post/") {
+        anyhow::bail!("URL must contain /post/ path");
    }


fetch_juejin_article validates that the URL is public HTTP(S) and contains /post/, but it no longer validates that the host is juejin.cn (or a subdomain). This makes it possible to route arbitrary public sites through the Juejin extractor and contradicts the “host-based” validation described in the docs. Add a host check (preferably using is_url_from_host(url, "juejin.cn")) before fetching.

Copilot · 2026-03-31T16:07:42Z

 pub async fn fetch_zhihu_article(url: &str) -> anyhow::Result<String> {
-    if !url.contains("zhihu.com") {
-        anyhow::bail!("URL must be from zhihu.com");
-    }
+    assert_public_http_url(url)?;
    let client = build_client()?;


fetch_zhihu_article now only enforces assert_public_http_url and no longer checks that the URL is actually for zhihu.com. Since this is a domain-specific fetcher with specialized extraction and headers, it should reject URLs whose host isn’t zhihu.com (or a subdomain) to avoid incorrect behavior and align with the host-validation hardening.

Copilot · 2026-03-31T16:07:42Z

+            let snippet = if body.len() > 200 {
+                format!("{}...", &body[..body.floor_char_boundary(200)])


The snippet logic is documented as “first 200 chars of body”, but the implementation uses body.len() > 200 (bytes) and slices at a UTF-8 boundary. For non-ASCII text this will often produce fewer than 200 characters, so either update the comment to say “bytes” or implement a true character-count truncation.

Suggested change

let snippet = if body.len() > 200 {

format!("{}...", &body[..body.floor_char_boundary(200)])

let snippet = if body.chars().count() > 200 {

let truncated: String = body.chars().take(200).collect();

format!("{}...", truncated)

Copilot · 2026-03-31T16:07:42Z

+- `github`: aliases `github repos`, `github repo`; optional `GITHUB_TOKEN`; searches repositories
+- `github_code`: aliases `github code`; optional `GITHUB_TOKEN`; searches code files
+
+### fetch_web_content Output Format


In the web_search section, the heading says fetch_web_content Output Format, but the examples shown are for web search output (single-engine / multi-engine search results). Renaming this heading to web_search Output Format would prevent confusion.

Suggested change

### fetch_web_content Output Format

### web_search Output Format

feat: add local document indexing and harden fetch validation

0deaf18

Copilot AI review requested due to automatic review settings March 31, 2026 16:01

Copilot started reviewing on behalf of gqf2008 March 31, 2026 16:02 View session

Copilot AI reviewed Mar 31, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add local document indexing and harden fetch validation#1

feat: add local document indexing and harden fetch validation#1
gqf2008 wants to merge 1 commit into
lispking:mainfrom
gqf2008:feat/local-index-cli-and-url-hardening

gqf2008 commented Mar 31, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 31, 2026

Uh oh!

Copilot AI Mar 31, 2026

Uh oh!

Copilot AI Mar 31, 2026

Uh oh!

Copilot AI Mar 31, 2026

Uh oh!

Copilot AI Mar 31, 2026

Uh oh!

Copilot AI Mar 31, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		let snippet = if body.len() > 200 {
		format!("{}...", &body[..body.floor_char_boundary(200)])

-            let snippet = if body.len() > 200 {
-                format!("{}...", &body[..body.floor_char_boundary(200)])
+            let snippet = if body.chars().count() > 200 {
+                let truncated: String = body.chars().take(200).collect();
+                format!("{}...", truncated)

	### fetch_web_content Output Format
	### web_search Output Format

Conversation

gqf2008 commented Mar 31, 2026

What

Why

Validation

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants