Exclude URLs based on HTML tags #259

MichaIng · 2021-06-11T13:08:26Z

Currently lychee seems to check each an every URL found in the HTML document.

Generally there is nothing bad about it. We previously used liche, which only checks specific tags and attributes.

While generally checking URLs from text is not bad, IMO, it would be great to be able to exclude certain elements, not just based on the URL itself. As HTML files are parsed already as such, it should be cheaply possible to exclude specific tags, e.g. <code> and <pre> tags. Allowing to specify URL excludes based on tags via an option, would be awesome.

The text was updated successfully, but these errors were encountered:

mre · 2021-06-11T13:34:38Z

For HTML that would work as the tags have names. Not sure about Markdown or Plaintext. If we add an option like --exclude-tags=pre,code, what would be the behavior for Markdown? Simply ignore it (which might be confusing) or throw an error (which might be frustrating) or something else?

In general, I wonder if it makes a difference whether a link occurs inside or outside a pre tag for example. Do you have an example for links that predominantly occur in special tags that you couldn't exclude with a regex? I just don't want to make the project unnecessarily complex.

MichaIng · 2021-06-11T16:46:02Z

I personally check links in Markdown only after building the HTML pages from it. But I agree it should be made clear, when adding such an option, that it can and does only apply for HTML input files, e.g. start with naming it --exclude-html-tags, or so. In Markdown things are much more complicated, due to the many extensions and ways to create code blocks and inline code, effective also based on indentation etc. So I guess practically it can only be used on HTML indeed, but that should cover the majority of scanned documents, especially as websites can be scanned wie URLs as well, and as this is currently the only way to scan internal links etc.

In general, I wonder if it makes a difference whether a link occurs inside or outside a pre tag for example.

We have no "links" in any pre or code tag, and you're right that this is not even possible, when you mean clickable links by this. Both tags take all content literally. But we have URLs inside of them, e.g. when telling users how to access a web interface of their locally running application, using loopback IP, localhost, LAN hostnames or IPs. In case of MkDocs, code tags are nice, as they allow easy copy&paste. But also when having a pre tag to show a scripts content, filled with LAN IPs/hostnames or example URLs.

I know the -E option, to exclude them all together (aside of invalid example URLs), but we cannot use it as me must check internal links via locally running webserver and use hence the loopback IP as base URL, which is then excluded as well. Actually, an alternative for at least our case, would be if -E excludes would not be applied to any URL that was generated for internal links with base-url. That actually makes sense anyway, as noone would define a base URL, to then exclude it from checks 😄.

lebensterben · 2021-06-11T22:19:47Z

lychee should first distinguish strict and fuzzy link detections.

strict means a string is a link if and only if it's rendered as a hyperlink in a HTML file. This naturally exclude tags like code or pre.

fuzzy means a string is a link if it's a valid URI of the scheme we support.

after implementing this, it's easier to offer finer control to users.

mre · 2021-06-21T14:20:00Z

Agree with both of your comments. I guess your ideas would fit together quite nicely.
The collector could return links with a tag field, that would be an HTML tag. We'd classify those as strict or fuzzy, so a pre would be fuzzy for example.
We could think about using different words than strict/fuzzy like hyperlink/plaintext or just link/text, but I like the proposal in general.

MichaIng · 2021-06-22T11:37:10Z

While checking our forum for outdated links, I recognised that also shortened URL text within anchor tags is checked. phpBB shortens long URLs. so that such an element is produced:

<a href="https://example.com/loong/looong/url/that/is/shorteneed/index.html" class="postlink">https://example.com/loong/looong/url/th ... index.html</a>

lychee now checks the URL itself but as well the text, where it identifies https://example.com/loong/looong/url/th as URL, which of course fails.

So once a basic functionality exists to exclude URLs/text based on HTML tags, the text content of tags/elements which have an explicit src/href/... attribute could be excluded by default, as AFAIK such shortening is quite common.

mre · 2021-06-22T20:25:42Z

That's odd. If this is a file with a .html extension, it should have skipped that shortened version. Can you double check that this is the case?

MichaIng · 2021-06-22T23:44:19Z

Ah this was when scanning URLs (with .php extensions and query string) and no local files. In case of URLs as input, it's of course harder to derive/guess the type 🤔. content-type: text/html response header could be used to interpret it as HTML, but I'm not sure how reliable this is, compared to a file extension. Otherwise additionally a contained <html> tag? But then it's getting complicated when currently the type/parser is obtained/chosen based on the extension only. Interpreting .php files as HTML in general also doesn't make sense as those could be PHP source scripts as well.

MichaIng · 2021-06-26T09:09:13Z

Adding this idea here: CSS selector syntax based excludes (and exclusive includes, if it turns out there is a wider use case for such) would be awesome. But I guess that's hard to achieve by hand, so only if the underlying HTML parser or other library supports such already.

mre · 2021-11-30T18:06:25Z

As a workaround for HTML tags, I wonder if lychee could be combined with htmlq or something similar.

htmlq --invert 'pre' | htmlq --invert 'code' | lychee

The --invert flag doesn't exist yet, but we could create an issue for that.

There's also pup, which has a PR open for that for a long time: ericchiang/pup#81.

MichaIng · 2021-11-30T19:08:34Z

But that would not work when using lychee's regular input arguments, especially complicated when wanting to scan URLs instead of local files.

It is not a that big issue for us. The -E flag + a small amount of --exclude's does the trick. It was more what we were used to from liche and code tags were hence a simple way to exclude example URLs or emails for web application access and login documentations, like (with Markdown `` converted to HTML code tags):

Access the web interface at: `http://<your.IP>:1234`
Username: `[email protected]`

We have at least one web application where indeed an invalid example email is required for initial login, from where it can be changed 😉.

mre · 2021-11-30T19:14:23Z

Wait, you say liche supported this? Can you post an example that you used before? Can't find much about it in their repo.

mre · 2021-11-30T19:15:43Z

We have at least one web application where indeed an invalid example email is required for initial login, from where it can be changed 😉.

That sounds more like a bug / weird feature on the site no? You could exclude that with --exclude-mail right now? (There is no way to exclude a single mail address yet, but that might be an easier thing to add.)

MichaIng · 2021-11-30T19:24:50Z

liche excludes code and pre tags OOTB. We did just call:

liche -c 64 -x 'https://(github.com|twitter.com|www.spigotmc.org|pi-hole.net|www.php.net)' -d build -r README.md build/docs

without any additional configuration. When switching to lychee we needed to add all URLs, IPs and links from code and pre tags as excludes. But as you can see, with liche we needed some other excludes which are not required with lychee anymore.

That sounds more like a bug / weird feature on the site no?

😄 well it is not uncommon that a web application freshly installed ships with a default login. And the particular one uses an invalid email address as default login, as it generally aims to have email addresses as usernames, that's all.

You could exclude that with --exclude-mail right now? (There is no way to exclude a single mail address yet, but that might be an easier thing to add.)

The regular --exclude works as well for emails, we do not want to exclude all emails but just the one particular case 😉.

mre · 2021-12-03T12:46:18Z

liche excludes code and pre tags OOTB

That's smart. Starting to work on that as part of #414.

mre · 2021-12-17T23:26:36Z

As part of #424 I added element and attribute to the Request struct now. It's not exposed to the CLI yet, but with that we can add exclusion based on tags and set defaults in the future.

I'm thinking to support an option like --exclude-element pre,code as well as setting const DEFAULT_ELEMENTS_EXCLUDED: &'static [&'static str] = &["pre", "code"];.

MichaIng · 2021-12-17T23:31:58Z

That sounds great, many thanks for working on this 👍.

mre · 2022-02-16T15:17:34Z

Examples for CSS selectors, that we could support eventually. I saw these being used in an internal tool.

a[href]
[href=*='*.js']
[src*='.js']

MichaIng · 2022-02-16T15:32:05Z

CSS selectors would be awesome, very flexible, and simple tags are just selected as simple as code/pre without needing to know about further CSS selectors syntax 👍.

mre · 2022-02-16T16:44:43Z

Yeah I think selectors are the better abstraction over filtering by element name. I'm a bit concerned about the overhead.

@untitaker would filtering by CSS selector still be within the scope of html5gum? Would be nice if elements could be filtered out as early as possible to avoid unnecessary allocs. 😅

See also

MichaIng · 2022-02-16T17:00:41Z

I'm a bit concerned about the overhead.

Dito. At least it shouldn't cause overhead when the option is not used and on non-HTML inputs. However, for our projects it was trivial to work around the need for this via -E and a small number of explicit URI excludes. I think it is generally still useful, and excluding <code> and <pre> even a reasonable default (often containing example or schematic URIs), but until issue/request triggers more comments by other users, it's fine to not work on it with high priority 🙂.

lebensterben · 2022-02-16T20:23:22Z

the overhead is proportional to the CSS path.

Either CSS path or XPath are just ways to specify a node in the HTML tree. Implementation wise that's really easy.

MichaIng · 2022-02-16T20:58:40Z

Make sense. I just have no idea how much relative overhead it would be to check expected CSS paths, like code or a.nocheck (I would expect or use myself relatively short ones, code tags and blocks and probably an explicit nocheck class), compared to parsing a tag and the contained URI in general, and doing an offline or online URI check. Basically would it make a notable difference, at least when doing offline-only checks, or would it be in the barely measurable area? Likely something we won't know for sure until doing the implementation 😄.

untitaker · 2022-02-19T16:50:22Z

@untitaker would filtering by CSS selector still be within the scope of html5gum? Would be nice if elements could be filtered out as early as possible to avoid unnecessary allocs. sweat_smile

that requires adding the tree building/dom logic to html5gum, which is all well in scope but a very big task. the spec for assembling tokens into a tree structure is bigger than the tokenization spec.

you could have a css selector engine that does not allow you to probe for hierarchy. e.g. if you disallow p > a, p a etc it should be possible to patch into lychee without changes to html5gum. i am not sure how generally useful that is, and whether it would be a good fit for html5gum depends on the exact api proposal imo

MichaIng · 2022-02-19T20:10:26Z

Probably I need some explanation how this tokenization actually works, especially when comparing html5gum and html5ever.

I mean lychee currently detects and checks specific HTML tags, respectively their src/href attributes (otherwise it wouldn't be able to detect relative/internal URLs), but it also detects URIs in any other tag and raw text. So since it detects tags and their attributes, it should be at least possible to jump over defined tags, or such with a defined class, isn't it? That would not allow full CSS selectors, unless a DOM tree is generated while parsing, but for practical use excluding specific tags or classes, or tags with classes should cover very most cases. So basically a minimal CSS selector syntax like [tag][.class] with both elements optional would be possible without doing the possibly heavy DOM tree generation.

lebensterben · 2022-02-19T20:38:32Z

We don't need upstream packages to incorporate filtering based on CSS path.
We only need to add additional checkings in extract module.

untitaker · 2022-02-20T18:11:56Z

@MichaIng there's no conceptual difference between html5gum tokenizer and html5ever tokenizer. outside of the tokenizer however, html5ever contains a full DOM tree builder while html5gum does not. You're entirely correct with your assessment as to which css selectors would be possible right now. It's also true that html5gum doesn't have to be changed in any way to support the simple version of the feature you described.

I think if somebody wants to seriously pursue this it might be worth checking whether lol-html can be used as parser (i.e. try to remove html5gum again) -- it features full css selector support. i feel like the library is otherwise rather purpose-built around cloudflare's usecase and hard to use for others.

mre · 2022-04-06T15:40:35Z

Quick update for anyone who'll stumble across this issue in the future:
Excluding code blocks and pre elements is now done by default.
There is a flag named --include-verbatim to also include links in these sections.

What's left to be done is support for custom exclusions via CSS selectors.
@MichaIng's proposal for [tag][.class] sounds reasonable.

untitaker · 2022-06-18T15:23:36Z

Excluding code blocks and pre elements is now done by default.

the next version of html5gum will include an option to not parse the <a> tag inside of <script><a href="https://google.com"></a></script>, see untitaker/hyperlink#148. that might be related to what we're doing here

ruzickap · 2022-11-27T16:43:46Z

I have a similar issue.

I'm using Jekyll + Hugo to generate web pages form Markdown:

Install [eksctl](https://eksctl.io/):

```bash
if ! command -v eksctl &> /dev/null; then
  # renovate: datasource=github-tags depName=weaveworks/eksctl
  EKSCTL_VERSION="0.118.0"
  curl -s -L "https://github.com/weaveworks/eksctl/releases/download/v${EKSCTL_VERSION}/eksctl_$(uname)_amd64.tar.gz" | sudo tar xz -C /usr/local/bin/
fi
```bash

which generates HTML code like:

<p>Install <a href="https://eksctl.io/">eksctl</a>:</p><div class="language-bash highlighter-rouge"><div class="code-header"> <span data-label-text="Shell"><i class="fas fa-code small"></i></span> <button aria-label="copy" data-title-succeed="Copied!"><i class="far fa-clipboard"></i></button></div><div class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
</pre><td class="rouge-code"><pre><span class="k">if</span> <span class="o">!</span> <span class="nb">command</span> <span class="nt">-v</span> eksctl &amp;&gt; /dev/null<span class="p">;</span> <span class="k">then</span>
  <span class="c"># renovate: datasource=github-tags depName=weaveworks/eksctl</span>
  <span class="nv">EKSCTL_VERSION</span><span class="o">=</span><span class="s2">"0.118.0"</span>
  curl <span class="nt">-s</span> <span class="nt">-L</span> <span class="s2">"https://github.com/weaveworks/eksctl/releases/download/v</span><span class="k">${</span><span class="nv">EKSCTL_VERSION</span><span class="k">}</span><span class="s2">/eksctl_</span><span class="si">$(</span><span class="nb">uname</span><span class="si">)</span><span class="s2">_amd64.tar.gz"</span> | <span class="nb">sudo tar </span>xz <span class="nt">-C</span> /usr/local/bin/
<span class="k">fi</span>
</pre></table></code></div></div>

Real example can be seen here: https://ruzickap-github-io.pages.dev/posts/cheapest-amazon-eks/

When I run lychee it reports URLs inside "code" as invalid for example: https://github.com/weaveworks/eksctl/releases/download/v${EKSCTL_VERSION}/eksctl_$(uname)_amd64.tar.gz - which is correct, but I would like to exclude any URLs inside "code" section to be scanned...

❯ lychee https://ruzickap-github-io.pages.dev/posts/cheapest-amazon-eks/
...
Issues found in 1 input. Find details below.

[https://ruzickap-github-io.pages.dev/posts/cheapest-amazon-eks/]:
✗ [404] https://github.com/weaveworks/eksctl/releases/download/v | Failed: Network error: Not Found
✗ [404] https://awscli.amazonaws.com/awscli-exe-linux-x86_64- | Failed: Network error: Not Found
✗ [404] https://stefanprodan.github.io/podinfo | Failed: Network error: Not Found
✗ [403] https://charts.bitnami.com/bitnami | Failed: Network error: Forbidden
✗ [404] https://ruzickap-github-io.pages.dev/assets/img/posts/2022/2022-11-27-cheapest-amazon-eks/https:/raw.githubusercontent.com/aws-samples/eks-workshop/65b766c494a5b4f5420b2912d8373c4957163541/static/images/icon-aws-amazon-eks.svg | Failed: Network error: Not Found
✗ [404] https://storage.googleapis.com/kubernetes-release/release/v | Failed: Network error: Not Found
✗ [ERR] file://tmp/ | Failed: Cannot find file

🔍 63 Total ✅ 52 OK 🚫 7 Errors (HTTP:7) 💤 4 Excluded

Can I instruct lychee to ignore the URLs which are part of the code sections somehow ?

In "code" sections there may be some variables / templates / substitutions (like above) which should be "excluded" from URL checks.

Thank you...

MichaIng · 2022-11-27T17:07:33Z

Actually code and pre tags are excluded automatically since a while. But what I recognised is that this does not apply for other tags inside code or pre tags, like in your case for span. This is often the case for Markdown code highlighting. Probably it's possible without much effort to exclude everything in verbatim tags recursively.

mre · 2022-11-28T17:28:36Z

Working on fix in #847

MichaIng mentioned this issue Jun 25, 2021

Select Markdown links based on their tag #270

Open

mre added the request-for-comments label Sep 4, 2021

MichaIng mentioned this issue Nov 30, 2021

Improve concurrency with streams #330

Merged

6 tasks

untitaker mentioned this issue Jan 31, 2022

Add html5gum as alternative link extractor #480

Merged

mre added enhancement New feature or request and removed request-for-comments labels Feb 4, 2022

mre mentioned this issue Nov 28, 2022

Recursively skip verbatim elements #847

Merged

mre closed this as completed in #847 Dec 12, 2022

jpoehnelt mentioned this issue Apr 4, 2024

a.textContent is checked even if the text is shortened for the link #1399

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exclude URLs based on HTML tags #259

Exclude URLs based on HTML tags #259

MichaIng commented Jun 11, 2021

mre commented Jun 11, 2021

MichaIng commented Jun 11, 2021

lebensterben commented Jun 11, 2021

mre commented Jun 21, 2021

MichaIng commented Jun 22, 2021 •

edited

Loading

mre commented Jun 22, 2021

MichaIng commented Jun 22, 2021

MichaIng commented Jun 26, 2021

mre commented Nov 30, 2021

MichaIng commented Nov 30, 2021 •

edited

Loading

mre commented Nov 30, 2021

mre commented Nov 30, 2021

MichaIng commented Nov 30, 2021 •

edited

Loading

mre commented Dec 3, 2021

mre commented Dec 17, 2021

MichaIng commented Dec 17, 2021

mre commented Feb 16, 2022

MichaIng commented Feb 16, 2022

mre commented Feb 16, 2022

MichaIng commented Feb 16, 2022 •

edited

Loading

lebensterben commented Feb 16, 2022

MichaIng commented Feb 16, 2022

untitaker commented Feb 19, 2022

MichaIng commented Feb 19, 2022 •

edited

Loading

lebensterben commented Feb 19, 2022

untitaker commented Feb 20, 2022 •

edited

Loading

mre commented Apr 6, 2022

untitaker commented Jun 18, 2022 •

edited

Loading

ruzickap commented Nov 27, 2022

MichaIng commented Nov 27, 2022

mre commented Nov 28, 2022

Exclude URLs based on HTML tags #259

Exclude URLs based on HTML tags #259

Comments

MichaIng commented Jun 11, 2021

mre commented Jun 11, 2021

MichaIng commented Jun 11, 2021

lebensterben commented Jun 11, 2021

mre commented Jun 21, 2021

MichaIng commented Jun 22, 2021 • edited Loading

mre commented Jun 22, 2021

MichaIng commented Jun 22, 2021

MichaIng commented Jun 26, 2021

mre commented Nov 30, 2021

MichaIng commented Nov 30, 2021 • edited Loading

mre commented Nov 30, 2021

mre commented Nov 30, 2021

MichaIng commented Nov 30, 2021 • edited Loading

mre commented Dec 3, 2021

mre commented Dec 17, 2021

MichaIng commented Dec 17, 2021

mre commented Feb 16, 2022

MichaIng commented Feb 16, 2022

mre commented Feb 16, 2022

MichaIng commented Feb 16, 2022 • edited Loading

lebensterben commented Feb 16, 2022

MichaIng commented Feb 16, 2022

untitaker commented Feb 19, 2022

MichaIng commented Feb 19, 2022 • edited Loading

lebensterben commented Feb 19, 2022

untitaker commented Feb 20, 2022 • edited Loading

mre commented Apr 6, 2022

untitaker commented Jun 18, 2022 • edited Loading

ruzickap commented Nov 27, 2022

MichaIng commented Nov 27, 2022

mre commented Nov 28, 2022

MichaIng commented Jun 22, 2021 •

edited

Loading

MichaIng commented Nov 30, 2021 •

edited

Loading

MichaIng commented Nov 30, 2021 •

edited

Loading

MichaIng commented Feb 16, 2022 •

edited

Loading

MichaIng commented Feb 19, 2022 •

edited

Loading

untitaker commented Feb 20, 2022 •

edited

Loading

untitaker commented Jun 18, 2022 •

edited

Loading