feat: add local document indexing and harden fetch validation#1
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds a Clap-based CLI entrypoint (while keeping MCP/stdin auto-detection), introduces Tantivy-backed local document indexing + search, and tightens fetch URL validation / host matching across tools and docs.
Changes:
- Add CLI subcommands for web search, fetch, config inspection, engine listing, MCP startup, and local index/search workflows.
- Implement local document collection, Tantivy indexing, and local search (Markdown/TXT/HTML/PDF).
- Harden URL validation by adding host-matching helpers and improving safe truncation; update docs and add CLI/integration tests.
Reviewed changes
Copilot reviewed 26 out of 27 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
tests/cli.rs |
Adds end-to-end CLI smoke tests (help/version, list/show-config, index-local/search-local, JSON validity). |
src/utils/url_safety.rs |
Adds is_url_from_host and expands URL safety unit tests. |
src/tools/mod.rs |
Exposes helpers publicly for reuse (e.g., CLI). |
src/tools/helpers.rs |
Centralizes engine list + engine resolution and adds multi-engine concurrent search helper + tests. |
src/tools/handlers.rs |
Updates MCP tool handler to use shared engine resolution and host-based URL checks. |
src/main.rs |
Replaces single-mode main with Clap CLI + MCP auto-detection; adds local indexing/search commands and config/engine listing. |
src/index/searcher.rs |
Implements Tantivy searcher + result struct and unit tests. |
src/index/schema.rs |
Defines Tantivy schema for local docs and tests schema shape. |
src/index/mod.rs |
Adds index module exports. |
src/index/indexer.rs |
Implements Tantivy index creation/open, document add, commit, and tests. |
src/index/collector.rs |
Adds document collector/extractors for md/txt/html/pdf, title extraction, normalization, and tests. |
src/fetchers/zhihu.rs |
Switches to assert_public_http_url for SSRF protection. |
src/fetchers/web.rs |
Fixes truncation to be UTF-8 safe via floor_char_boundary. |
src/fetchers/juejin.rs |
Switches to assert_public_http_url and simplifies URL validation. |
src/fetchers/github.rs |
Improves URL parsing (strip query/fragment) and adds unit tests. |
src/fetchers/csdn.rs |
Adds assert_public_http_url SSRF protection. |
src/config.rs |
Adds env-driven config for local index path and allowed extensions. |
src/cli.rs |
Introduces Clap CLI definition (subcommands + output formats). |
skills/ferris-search-tools/SKILL.md |
Updates skill docs for CLI + MCP clarification and host-based validation guidance. |
skills/ferris-search-tools/references/tools-api.md |
Refreshes tool API docs and URL safety rules for host-based validation. |
skills/ferris-search-setup/SKILL.md |
Updates setup skill to reflect CLI+MCP behavior and removed legacy env vars. |
skills/ferris-search-setup/references/configuration.md |
Updates configuration docs (adds LOCAL_DOCS_*; removes old transport env var section). |
skills/ferris-search-cli/SKILL.md |
Adds dedicated CLI skill documentation. |
skills/ferris-search-cli/references/cli-reference.md |
Adds comprehensive CLI reference (commands, routing, exit codes, behaviors). |
README.md |
Updates README to document CLI workflows, local indexing, and new configuration. |
Cargo.toml |
Adds dependencies for CLI, indexing, PDF extraction, and CLI integration testing. |
Cargo.lock |
Locks new dependency graph for clap/tantivy/pdf-extract and test deps. |
Comments suppressed due to low confidence (1)
src/fetchers/csdn.rs:36
fetch_csdn_articlenow only callsassert_public_http_urland no longer verifies that the URL host is actuallycsdn.net(or a subdomain). Because this function also forces aHost: blog.csdn.netheader, calling it with a non-CSDN URL will send a mismatched Host header and almost certainly produce incorrect behavior; it also weakens the intended fetcher hardening. Please add a host-based check (e.g.,is_url_from_host(url, "csdn.net")) and fail fast when it doesn’t match.
pub async fn fetch_csdn_article(url: &str) -> anyhow::Result<String> {
assert_public_http_url(url)?;
let client = build_client()?;
let resp = client
.get(url)
.headers(chrome_headers())
.header("Host", "blog.csdn.net")
.send()
.await?;
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| /// Check if a URL's host matches the given domain (or is a subdomain of it). | ||
| pub fn is_url_from_host(raw: &str, domain: &str) -> bool { | ||
| let Ok(url) = Url::parse(raw) else { | ||
| return false; | ||
| }; | ||
| match url.host_str() { | ||
| Some(host) => host == domain || host.ends_with(&format!(".{}", domain)), | ||
| None => false, | ||
| } |
There was a problem hiding this comment.
is_url_from_host only checks the parsed host string and ignores the URL scheme, so values like file://github.com/... (or other non-HTTP(S) schemes that still have a host component) will be treated as “from github.com”. Since this helper is used for fetcher validation, it should also enforce http/https (or alternatively call is_public_http_url / assert_public_http_url internally) before doing the host/subdomain comparison.
| pub async fn fetch_github_readme_tool(&self, p: Parameters<GithubReadmeParams>) -> String { | ||
| let url = p.0.url; | ||
| if !url.contains("github.com") { | ||
| if !is_url_from_host(&url, "github.com") { | ||
| return "URL must be from github.com".into(); | ||
| } |
There was a problem hiding this comment.
The domain-specific fetch tools validate the host via is_url_from_host, but they do not enforce the SSRF/public-URL checks (is_public_http_url/assert_public_http_url). Given the docs state “All fetch tools enforce SSRF protection”, these handlers should reject non-public and non-HTTP(S) URLs before doing host/path validation (same applies to the other domain-specific fetch tools below).
| pub async fn fetch_juejin_article(url: &str) -> anyhow::Result<String> { | ||
| // Validate it's a juejin post URL | ||
| if !url.contains("juejin.cn") || !url.contains("/post/") { | ||
| anyhow::bail!("URL must be from juejin.cn and contain /post/ path"); | ||
| assert_public_http_url(url)?; | ||
| if !url.contains("/post/") { | ||
| anyhow::bail!("URL must contain /post/ path"); | ||
| } |
There was a problem hiding this comment.
fetch_juejin_article validates that the URL is public HTTP(S) and contains /post/, but it no longer validates that the host is juejin.cn (or a subdomain). This makes it possible to route arbitrary public sites through the Juejin extractor and contradicts the “host-based” validation described in the docs. Add a host check (preferably using is_url_from_host(url, "juejin.cn")) before fetching.
| pub async fn fetch_zhihu_article(url: &str) -> anyhow::Result<String> { | ||
| if !url.contains("zhihu.com") { | ||
| anyhow::bail!("URL must be from zhihu.com"); | ||
| } | ||
| assert_public_http_url(url)?; | ||
| let client = build_client()?; |
There was a problem hiding this comment.
fetch_zhihu_article now only enforces assert_public_http_url and no longer checks that the URL is actually for zhihu.com. Since this is a domain-specific fetcher with specialized extraction and headers, it should reject URLs whose host isn’t zhihu.com (or a subdomain) to avoid incorrect behavior and align with the host-validation hardening.
| let snippet = if body.len() > 200 { | ||
| format!("{}...", &body[..body.floor_char_boundary(200)]) |
There was a problem hiding this comment.
The snippet logic is documented as “first 200 chars of body”, but the implementation uses body.len() > 200 (bytes) and slices at a UTF-8 boundary. For non-ASCII text this will often produce fewer than 200 characters, so either update the comment to say “bytes” or implement a true character-count truncation.
| let snippet = if body.len() > 200 { | |
| format!("{}...", &body[..body.floor_char_boundary(200)]) | |
| let snippet = if body.chars().count() > 200 { | |
| let truncated: String = body.chars().take(200).collect(); | |
| format!("{}...", truncated) |
| - `github`: aliases `github repos`, `github repo`; optional `GITHUB_TOKEN`; searches repositories | ||
| - `github_code`: aliases `github code`; optional `GITHUB_TOKEN`; searches code files | ||
|
|
||
| ### fetch_web_content Output Format |
There was a problem hiding this comment.
In the web_search section, the heading says fetch_web_content Output Format, but the examples shown are for web search output (single-engine / multi-engine search results). Renaming this heading to web_search Output Format would prevent confusion.
| ### fetch_web_content Output Format | |
| ### web_search Output Format |
What
Why
Validation