-
-
Notifications
You must be signed in to change notification settings - Fork 153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exclude URLs based on HTML tags #259
Comments
For HTML that would work as the tags have names. Not sure about Markdown or Plaintext. If we add an option like In general, I wonder if it makes a difference whether a link occurs inside or outside a |
I personally check links in Markdown only after building the HTML pages from it. But I agree it should be made clear, when adding such an option, that it can and does only apply for HTML input files, e.g. start with naming it
We have no "links" in any pre or code tag, and you're right that this is not even possible, when you mean clickable links by this. Both tags take all content literally. But we have URLs inside of them, e.g. when telling users how to access a web interface of their locally running application, using loopback IP, localhost, LAN hostnames or IPs. In case of MkDocs, code tags are nice, as they allow easy copy&paste. But also when having a pre tag to show a scripts content, filled with LAN IPs/hostnames or example URLs. I know the |
lychee should first distinguish strict and fuzzy link detections. strict means a string is a link if and only if it's rendered as a hyperlink in a HTML file. This naturally exclude tags like code or pre. fuzzy means a string is a link if it's a valid URI of the scheme we support. after implementing this, it's easier to offer finer control to users. |
Agree with both of your comments. I guess your ideas would fit together quite nicely. |
While checking our forum for outdated links, I recognised that also shortened URL text within anchor tags is checked. phpBB shortens long URLs. so that such an element is produced:
So once a basic functionality exists to exclude URLs/text based on HTML tags, the text content of tags/elements which have an explicit src/href/... attribute could be excluded by default, as AFAIK such shortening is quite common. |
That's odd. If this is a file with a |
Ah this was when scanning URLs (with |
Adding this idea here: CSS selector syntax based excludes (and exclusive includes, if it turns out there is a wider use case for such) would be awesome. But I guess that's hard to achieve by hand, so only if the underlying HTML parser or other library supports such already. |
As a workaround for HTML tags, I wonder if lychee could be combined with htmlq or something similar.
The There's also pup, which has a PR open for that for a long time: ericchiang/pup#81. |
But that would not work when using It is not a that big issue for us. The
We have at least one web application where indeed an invalid example email is required for initial login, from where it can be changed 😉. |
Wait, you say liche supported this? Can you post an example that you used before? Can't find much about it in their repo. |
That sounds more like a bug / weird feature on the site no? You could exclude that with |
without any additional configuration. When switching to
😄 well it is not uncommon that a web application freshly installed ships with a default login. And the particular one uses an invalid email address as default login, as it generally aims to have email addresses as usernames, that's all.
The regular |
That's smart. Starting to work on that as part of #414. |
As part of #424 I added I'm thinking to support an option like |
That sounds great, many thanks for working on this 👍. |
Examples for CSS selectors, that we could support eventually. I saw these being used in an internal tool. a[href]
[href=*='*.js']
[src*='.js'] |
CSS selectors would be awesome, very flexible, and simple tags are just selected as simple as |
Yeah I think selectors are the better abstraction over filtering by element name. I'm a bit concerned about the overhead. @untitaker would filtering by CSS selector still be within the scope of html5gum? Would be nice if elements could be filtered out as early as possible to avoid unnecessary allocs. 😅 See also |
Dito. At least it shouldn't cause overhead when the option is not used and on non-HTML inputs. However, for our projects it was trivial to work around the need for this via |
the overhead is proportional to the CSS path. Either CSS path or XPath are just ways to specify a node in the HTML tree. Implementation wise that's really easy. |
Make sense. I just have no idea how much relative overhead it would be to check expected CSS paths, like |
that requires adding the tree building/dom logic to html5gum, which is all well in scope but a very big task. the spec for assembling tokens into a tree structure is bigger than the tokenization spec. you could have a css selector engine that does not allow you to probe for hierarchy. e.g. if you disallow |
Probably I need some explanation how this tokenization actually works, especially when comparing html5gum and html5ever. I mean lychee currently detects and checks specific HTML tags, respectively their src/href attributes (otherwise it wouldn't be able to detect relative/internal URLs), but it also detects URIs in any other tag and raw text. So since it detects tags and their attributes, it should be at least possible to jump over defined tags, or such with a defined class, isn't it? That would not allow full CSS selectors, unless a DOM tree is generated while parsing, but for practical use excluding specific tags or classes, or tags with classes should cover very most cases. So basically a minimal CSS selector syntax like |
We don't need upstream packages to incorporate filtering based on CSS path. |
@MichaIng there's no conceptual difference between html5gum tokenizer and html5ever tokenizer. outside of the tokenizer however, html5ever contains a full DOM tree builder while html5gum does not. You're entirely correct with your assessment as to which css selectors would be possible right now. It's also true that html5gum doesn't have to be changed in any way to support the simple version of the feature you described. I think if somebody wants to seriously pursue this it might be worth checking whether lol-html can be used as parser (i.e. try to remove html5gum again) -- it features full css selector support. i feel like the library is otherwise rather purpose-built around cloudflare's usecase and hard to use for others. |
Quick update for anyone who'll stumble across this issue in the future: What's left to be done is support for custom exclusions via CSS selectors. |
the next version of html5gum will include an option to not parse the |
I have a similar issue. I'm using Jekyll + Hugo to generate web pages form Markdown: Install [eksctl](https://eksctl.io/):
```bash
if ! command -v eksctl &> /dev/null; then
# renovate: datasource=github-tags depName=weaveworks/eksctl
EKSCTL_VERSION="0.118.0"
curl -s -L "https://github.com/weaveworks/eksctl/releases/download/v${EKSCTL_VERSION}/eksctl_$(uname)_amd64.tar.gz" | sudo tar xz -C /usr/local/bin/
fi
```bash which generates HTML code like: <p>Install <a href="https://eksctl.io/">eksctl</a>:</p><div class="language-bash highlighter-rouge"><div class="code-header"> <span data-label-text="Shell"><i class="fas fa-code small"></i></span> <button aria-label="copy" data-title-succeed="Copied!"><i class="far fa-clipboard"></i></button></div><div class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
</pre><td class="rouge-code"><pre><span class="k">if</span> <span class="o">!</span> <span class="nb">command</span> <span class="nt">-v</span> eksctl &> /dev/null<span class="p">;</span> <span class="k">then</span>
<span class="c"># renovate: datasource=github-tags depName=weaveworks/eksctl</span>
<span class="nv">EKSCTL_VERSION</span><span class="o">=</span><span class="s2">"0.118.0"</span>
curl <span class="nt">-s</span> <span class="nt">-L</span> <span class="s2">"https://github.com/weaveworks/eksctl/releases/download/v</span><span class="k">${</span><span class="nv">EKSCTL_VERSION</span><span class="k">}</span><span class="s2">/eksctl_</span><span class="si">$(</span><span class="nb">uname</span><span class="si">)</span><span class="s2">_amd64.tar.gz"</span> | <span class="nb">sudo tar </span>xz <span class="nt">-C</span> /usr/local/bin/
<span class="k">fi</span>
</pre></table></code></div></div> Real example can be seen here: https://ruzickap-github-io.pages.dev/posts/cheapest-amazon-eks/ When I run lychee it reports URLs inside "code" as invalid for example: ❯ lychee https://ruzickap-github-io.pages.dev/posts/cheapest-amazon-eks/
...
Issues found in 1 input. Find details below.
[https://ruzickap-github-io.pages.dev/posts/cheapest-amazon-eks/]:
✗ [404] https://github.com/weaveworks/eksctl/releases/download/v | Failed: Network error: Not Found
✗ [404] https://awscli.amazonaws.com/awscli-exe-linux-x86_64- | Failed: Network error: Not Found
✗ [404] https://stefanprodan.github.io/podinfo | Failed: Network error: Not Found
✗ [403] https://charts.bitnami.com/bitnami | Failed: Network error: Forbidden
✗ [404] https://ruzickap-github-io.pages.dev/assets/img/posts/2022/2022-11-27-cheapest-amazon-eks/https:/raw.githubusercontent.com/aws-samples/eks-workshop/65b766c494a5b4f5420b2912d8373c4957163541/static/images/icon-aws-amazon-eks.svg | Failed: Network error: Not Found
✗ [404] https://storage.googleapis.com/kubernetes-release/release/v | Failed: Network error: Not Found
✗ [ERR] file://tmp/ | Failed: Cannot find file
🔍 63 Total ✅ 52 OK 🚫 7 Errors (HTTP:7) 💤 4 Excluded Can I instruct lychee to ignore the URLs which are part of the code sections somehow ? In "code" sections there may be some variables / templates / substitutions (like above) which should be "excluded" from URL checks. Thank you... |
Actually code and pre tags are excluded automatically since a while. But what I recognised is that this does not apply for other tags inside code or pre tags, like in your case for span. This is often the case for Markdown code highlighting. Probably it's possible without much effort to exclude everything in verbatim tags recursively. |
Working on fix in #847 |
Currently
lychee
seems to check each an every URL found in the HTML document.Generally there is nothing bad about it. We previously used
liche
, which only checks specific tags and attributes.While generally checking URLs from text is not bad, IMO, it would be great to be able to exclude certain elements, not just based on the URL itself. As HTML files are parsed already as such, it should be cheaply possible to exclude specific tags, e.g.
<code>
and<pre>
tags. Allowing to specify URL excludes based on tags via an option, would be awesome.The text was updated successfully, but these errors were encountered: