Skip to content

URL transformations in ketch #9

@doctorjames

Description

@doctorjames

I've noticed that when searches find Reddit links they're often actively blocked by Reddit even with headless chrome installed:

ketch search --scrape "ketch web search tool reddit"
warn: https://www.reddit.com/r/PiCodingAgent/comments/1sqk92y/what_websearch_webfetch_tool_are_you_using/ appears JS-rendered; configure browser for full content
---
url: https://www.reddit.com/r/PiCodingAgent/comments/1sqk92y/what_websearch_webfetch_tool_are_you_using/
title: Reddit - Please wait for verification
words: 0
---

However scraping old.reddit.com works fine:

Compare:

ketch scrape 'https://www.reddit.com/r/PiCodingAgent/comments/1sqk92y/what_websearch_webfetch_tool_are_you_using/'

and

ketch scrape 'https://old.reddit.com/r/PiCodingAgent/comments/1sqk92y/what_websearch_webfetch_tool_are_you_using/'

While you can add an instruction to the skill to always use the old URL, it's not able to affect search --scrape

Is it worth adding a configuration option to apply a regex replace on URLs before scraping? For news sites often fetching the RSS feed is cleaner than reading the main pages and so in those kinds of cases a manual transformation would be useful too.

For example:

https://www.theguardian.com/uk -> https://www.theguardian.com/uk/rss

(ketch's readability processing doesn't work on RSS, but the raw content still makes more sense to the LLM)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions