I've noticed that when searches find Reddit links they're often actively blocked by Reddit even with headless chrome installed:
ketch search --scrape "ketch web search tool reddit"
warn: https://www.reddit.com/r/PiCodingAgent/comments/1sqk92y/what_websearch_webfetch_tool_are_you_using/ appears JS-rendered; configure browser for full content
---
url: https://www.reddit.com/r/PiCodingAgent/comments/1sqk92y/what_websearch_webfetch_tool_are_you_using/
title: Reddit - Please wait for verification
words: 0
---
However scraping old.reddit.com works fine:
Compare:
ketch scrape 'https://www.reddit.com/r/PiCodingAgent/comments/1sqk92y/what_websearch_webfetch_tool_are_you_using/'
and
ketch scrape 'https://old.reddit.com/r/PiCodingAgent/comments/1sqk92y/what_websearch_webfetch_tool_are_you_using/'
While you can add an instruction to the skill to always use the old URL, it's not able to affect search --scrape
Is it worth adding a configuration option to apply a regex replace on URLs before scraping? For news sites often fetching the RSS feed is cleaner than reading the main pages and so in those kinds of cases a manual transformation would be useful too.
For example:
https://www.theguardian.com/uk -> https://www.theguardian.com/uk/rss
(ketch's readability processing doesn't work on RSS, but the raw content still makes more sense to the LLM)
I've noticed that when searches find Reddit links they're often actively blocked by Reddit even with headless chrome installed:
However scraping old.reddit.com works fine:
Compare:
ketch scrape 'https://www.reddit.com/r/PiCodingAgent/comments/1sqk92y/what_websearch_webfetch_tool_are_you_using/'and
ketch scrape 'https://old.reddit.com/r/PiCodingAgent/comments/1sqk92y/what_websearch_webfetch_tool_are_you_using/'While you can add an instruction to the skill to always use the old URL, it's not able to affect
search --scrapeIs it worth adding a configuration option to apply a regex replace on URLs before scraping? For news sites often fetching the RSS feed is cleaner than reading the main pages and so in those kinds of cases a manual transformation would be useful too.
For example:
https://www.theguardian.com/uk -> https://www.theguardian.com/uk/rss(ketch's readability processing doesn't work on RSS, but the raw content still makes more sense to the LLM)