Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Select Markdown links based on their tag #270

Open
T145 opened this issue Jun 25, 2021 · 18 comments
Open

Select Markdown links based on their tag #270

T145 opened this issue Jun 25, 2021 · 18 comments

Comments

@T145
Copy link

T145 commented Jun 25, 2021

Something like:

./lychee README.md -t 3 -m 2 --exclude-mail -v --md-tags link

Would then exclusively select all links formatted as such:

[link](https://github.com/lycheeverse/lychee)
@MichaIng
Copy link
Member

Somehow similar to #259.

What you call "tag" here, is the link text <a href="https://github.com/lycheeverse/lychee">text</a>. I wonder if this is a generic enough component to base an include/exclude option on, as this text usually is different on every link, or you use <https://github.com/lycheeverse/lychee> to have it matching the URL itself, and even the brackets are optional for most interpreters.


EDIT: I'm just collecting a few probably dump ideas below about how to combine both requests, and probably more similar ones that may arise. But I'm too tired to come up with a clever one, I guess. I'll leave it and review tomorrow, maybe some approach is useful after all 😄 😴.

Using an attribute or class to exclude would be great IMO, but in Markdown such does not exist without extensions. Using the attribute extension allows to do something like [link](https://github.com/lycheeverse/lychee){: .exclude } to add the HTML element to "exclude" class in a resulting HTML document. But that is hard to parse, as the colon is optional and class="... exclude ..." can be used as well to add it to one or multiple classes. Difficult if there is no Markdown parser library with support for this extension, and of course not all Markdown files are converted respecting this extension/syntax or converted at all.

I'm thinking about a single option which covers HTML and Markdown (and probably all types of) documents, else it might get a never ending list of options...

  • Other linters often respect comments given in the document. So a flexible behaviour would be to e.g. let lychee skip (or explicitly include) the next URL, when a <!-- lychee skip --> line or similar is seen before, and other comment syntax respectively.
  • While it is flexible, it potentially requires to add a lot of comments to the code, while a tag/attribute/class-based include/exclude rule would be especially for HTML much nicer IMO, but difficult to find such for Markdown.

And you ask for an include option, while my request was for an exclude option. Probably it could be merged by using an inversion flag. Something like calling the options --filter 'excludeThis' and using --include --filter 'includeThis' so make all filter rules exclusive includes instead of excludes. And code-wise the filter values could then be used to decide whether an URL is checked or not, but in presence of the --include flag the result is simply inverted in all cases.

@T145
Copy link
Author

T145 commented Jun 25, 2021

@MichaIng There could be a similar --html-tags flag that when present only includes the elements specified. If you want to get gritty about specific elements, a valid tag option could be something like CSS syntax to select an element.

@MichaIng
Copy link
Member

CSS selector syntax would be awesome indeed. But I wouldn't want to put to burden onto the devs for implementing such a complex parser, so I guess it depends on whether there is a reliable library which can do it nicely.

@lebensterben
Copy link
Member

The implementation is not hard.
But in what kind of scenario would a typical user want to filter links by link text?

@T145
Copy link
Author

T145 commented Jun 26, 2021

Grabbing specifc links aids automation immensely. It really makes using this in a GitHub Action environment more favorable.

@lebensterben
Copy link
Member

In CI the normal use case is to blindly check any links if found, with optional filtering based on link pattern (that's already supported).

Adding the suggested function just doesn't add extra utility to normal CI users.

@lebensterben
Copy link
Member

There's an alternative solution which probably can suit your needs.
We can have lychee logs failed links in a file, and add an option to lychee or its CI workflow to 'resume' the previous job by only checking the failed links.

@T145
Copy link
Author

T145 commented Jun 26, 2021

Other utilities can be used to do that though. My point is that some level of taking away the "blindness" would be good in general. The reason that use case is the most you've seen is b/c anyone who needs just that picks up this utility. Anyone else who needs something different will immediately try to find a better solution. I haven't been able to find another All-In-One utility that can just pick out links that match a specific Markdown or HTML tag.

@MichaIng
Copy link
Member

MichaIng commented Jun 26, 2021

But is it really the link text that you want to match against? I mean to you have a lot of "Read more" links, and need to check those exclusively? I also can't really imagine a use case without at least a more generic identifier, like a class other other kind of mark, like mentioned above, which indeed is difficult in pure Markdown. This is also the reason why we do not check the Markdown files but the resulting HTML file, after generated. But that's not done in everyone's case, I agree.

So it would be interesting to hear or see an example about where and how you'd use this feature, to better understand a possible pattern of use cases.

@T145
Copy link
Author

T145 commented Jun 26, 2021

The link text wouldn't be what's matched: it'd be the tags.

Another cool thing I thought of could be mixing the html and md tag selection flags, and only have a single comma indicate when you want none. Usually there would be multiples delimited by comma. E.g. this:

lychee README.md --html-tags , --md-tags link

Would select all Markdown links in the [link](mylink) format and ignore all html links included in the document. Redundant yes, but it's just to illustrate the point in one command.

@lebensterben
Copy link
Member

this doesn't fit into standard *nix CLI style.

@MichaIng
Copy link
Member

MichaIng commented Jun 26, 2021

@T145
Did you try it? In [link](mylink), link is the link text. The concept of a "tag" doesn't really exist in Markdown:

  • [link](mylink) => link

In the translated HTML document, "<a>" itself is the tag, but not the text below start and end tag.

@T145
Copy link
Author

T145 commented Jun 26, 2021

Yes I'm aware. However it can be handled is fine w/ me, just so long as there can be some level that links can be selected at. Idk if this program parses raw Markdown or converts it into HTML first.

@MichaIng
Copy link
Member

I'm quite sure that Markdown is not converted by lychee (correct me if I'm wrong) and it is good that it does not even try it, as Markdown as mentioned has no hard syntax, but has different flavors and can be extended, which enables plenty different syntax for links, which practically cannot be reliably handled by lychee. So URLs are most likely found from the raw text input without interpreting Markdown in any special way (again correct me, if I am wrong).

just so long as there can be some level that links can be selected at

IMO it does not make sense to implement a feature only until another/better feature has been added, that would be a waste of development time. Without a convincing example of a document where selecting Markdown links by link text makes sense, I would vote against this, as there are IMO features with a wider use case requested. Adding an option to select and/or exclude links in HTML documents based on tags or CSS selectors would find more use, and can help in case of Markdown documents as well, when those are translated into HTML within the CI/CD pipeline with a defined Markdown parser and extensions.

@T145
Copy link
Author

T145 commented Jun 26, 2021

You can't see that the use cases are similar? A lot of assumptions are being made on either end about how this program works, so let the developers make their assessments on our respective recommendations.

@lebensterben
Copy link
Member

@MichaIng
You're right. lychee doesn't convert input file(s).

@mre
Copy link
Member

mre commented Dec 3, 2021

An alternative approach would be to use a Markdown command-line processor for extracting the tags and only use lychee on its output:

md --md-tags link | lychee -

Note that md does not exist. It would be a utility similar to jq and could be helpful for many use-cases. I wonder if there's a tool out there like this. At least a quick search didn't reveal anything. If not then it's either very hard or somebody should build it.

@mre mre added enhancement New feature or request workaround labels Feb 4, 2022
@mre
Copy link
Member

mre commented Jan 6, 2025

Today I heard of mdq, which is helpful for this task. https://github.com/yshavit/mdq

Perhaps people can try it out to see if they can extract links based on tags and feed that into lychee.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants