Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simple URI cache #243

Merged
merged 4 commits into from
May 4, 2021
Merged

Simple URI cache #243

merged 4 commits into from
May 4, 2021

Conversation

mre
Copy link
Member

@mre mre commented May 3, 2021

No description provided.

@mre mre changed the title WIP: First simple version of URI cache Simple URI cache May 4, 2021
@mre mre marked this pull request as ready for review May 4, 2021 11:18
@mre mre merged commit fe399c0 into master May 4, 2021
@mre mre deleted the cache branch May 4, 2021 11:28
@MichaIng
Copy link
Member

MichaIng commented Sep 4, 2021

I actually had the impression that repeated links were already only checked once before, respectively that multiple same URLs were handled as a single one already before, also visible as one one in the summary. Now when testing the local files support branch, this assumption got hardened: #262 (comment)
Before, a series of links like:

<a class="nav-link" href="/#features">Features</a>
<a class="nav-link" href="/#download">Download</a>
<a class="nav-link" href="/#gettingstarted">Getting Started</a>
<a class="nav-link" href="/#testimonials">Testimonials</a>

got collected as a single URI (lychee v0.7.0 and before already). Only on the local files branch, where #xyz was URL encoded and taken as part of the path, they were collected as four individual URIs.

Also the amount of checked links shown in the summary has not changed from v0.7.0 to v0.7.1.

Another point:

These links are all the same and should only be checked once:
https://example.org/
https://example.org/
https://example.org

IMO https://example.org/ and https://example.org are not the same and should be checked individually:

  • Usually https://example.org is redirected to https://example.org/, but not always, and the redirect itself is something one might want to check as well (treat as an error). E.g. I prefer to use URLs that do not imply a known/expected redirect, but the target of the redirect itself, to reduce unnecessary requests and for transparency as well.
  • And there are indeed cases where the trailing slash fails, or the having it omitted fails, and as this is a common typo, this is especially something I'd want to have covered with a link checker.

I did some tests to verify and investigate the v0.7.0 vs v0.7.1 behaviour:

2021-09-04 11:21:10 root@dietpi:/tmp# ./lychee-v0.7.0 test.txt
📝 Summary
---------------------
🔍 Total............1
✅ Successful.......1
⏳ Timeouts.........0
🔀 Redirected.......0
👻 Excluded.........0
🚫 Errors...........0
2021-09-04 11:21:22 root@dietpi:/tmp# ./lychee-v0.7.1 test.txt
📝 Summary
---------------------
🔍 Total............1
✅ Successful.......1
⏳ Timeouts.........0
🔀 Redirected.......0
👻 Excluded.........0
🚫 Errors...........0
2021-09-04 11:21:26 root@dietpi:/tmp# cat test.txt
https://dietpi.com
https://dietpi.com

Even with trailing slash it wasn't treated as multiple URIs before:

2021-09-04 11:21:58 root@dietpi:/tmp# ./lychee-v0.7.0 test.txt
📝 Summary
---------------------
🔍 Total............1
✅ Successful.......1
⏳ Timeouts.........0
🔀 Redirected.......0
👻 Excluded.........0
🚫 Errors...........0
2021-09-04 11:22:07 root@dietpi:/tmp# ./lychee-v0.7.1 test.txt
📝 Summary
---------------------
🔍 Total............1
✅ Successful.......1
⏳ Timeouts.........0
🔀 Redirected.......0
👻 Excluded.........0
🚫 Errors...........0
2021-09-04 11:22:09 root@dietpi:/tmp# cat test.txt
https://dietpi.com
https://dietpi.com
https://dietpi.com/

Ah I see now with a trailing slash after a path element, it is treated as different URLs, which is the biggest concern I had above:

2021-09-04 11:23:05 root@dietpi:/tmp# ./lychee-v0.7.0 test.txt
📝 Summary
---------------------
🔍 Total............3
✅ Successful.......3
⏳ Timeouts.........0
🔀 Redirected.......0
👻 Excluded.........0
🚫 Errors...........0
2021-09-04 11:23:07 root@dietpi:/tmp# ./lychee-v0.7.1 test.txt
📝 Summary
---------------------
🔍 Total............3
✅ Successful.......3
⏳ Timeouts.........0
🔀 Redirected.......0
👻 Excluded.........0
🚫 Errors...........0
2021-09-04 11:23:11 root@dietpi:/tmp# cat test.txt
https://dietpi.com
https://dietpi.com
https://dietpi.com/
https://dietpi.com/survey/
https://dietpi.com/survey

But so after all, nothing has changed. Probably the parser crate internally merges matching URLs as one already? Or did I misunderstand the goal of this PR?

@mre
Copy link
Member Author

mre commented Sep 4, 2021

The output should stay the same. This is just an internal optimization that speeds up the checking because already checked URLs get ignored.

About the trailing slash at the end, they are indeed technically different as you pointed out. The way equivalence is currently implemented is more of a design decision of the url crate however. I guess I just follow whatever they think is equivalent.

That said, at least Google and ahrefs say they are the same, so you could get some false positives no matter how you look at it.
https://ahrefs.com/blog/trailing-slash/
https://searchfacts.com/url-trailing-slash/

@MichaIng
Copy link
Member

MichaIng commented Sep 4, 2021

Yes I think it is okay and, despite the involved redirect, I agree with the statements done you linked, which match the behaviour of lychee:

Trailing slashes after the domain name don’t matter

domain.com = domain.com/

These URLs are treated exactly the same and it doesn’t matter which version you use.

Trailing slashes matter for other URLs

domain.com/pagedomain.com/page/

For every case besides the trailing slash directly after the root domain, a trailing slash will be treated as a separate URL.

So, while I kinda miss the ability to check for the redirect of domain.com => domain.com/ then, it is not at all important enough to ask for a change, and I'm too perfectionist instead 😅.

This is just an internal optimization that speeds up the checking because already checked URLs get ignored.

As said, I don't believe that an URL was ever checked a second time before, but I believe the list of URLs to process was one with unique URLs in the first place. I don't have enough data yet to verify this, but e.g. on our main website, the lychee execution time increased with v0.7.1 from a reliable 2 seconds to 7 seconds, so the pre-processing involved with either this PR or another commit of v0.7.1 seems to have added 5 seconds, while we have a lot of duplicate URLs on our website. Not that I'd care about 5 seconds 😄.

On our documentation branch, where we currently need to check each directory individually (until local files checking is ready 🙂), it takes ~~5 minutes (with a high fluctuation to be true), before and afterwards, while due to a lot of internal navigation URLs, TOC etc many URLs appears several times.

There was a hash array or so created before, did this probably already mean that already existing entries simply got overwritten by another URL with the same hash?

I'll count the amount of actual HTTP(S) requests by the two versions, tow be sure 🙂.

@MichaIng
Copy link
Member

MichaIng commented Sep 4, 2021

Okay to sum it up, I suspect the issue being in the following code block: https://github.com/lycheeverse/lychee/pull/262/files#diff-9974c1c6274d302fb56650636debf746561e468f46fccc30dec65f4345abe268R125-R128

fn create_uri_from_path(root: &Path, base: &Option<Base>, link: &str) -> Result<Url> {
    let link = url::remove_get_params(link);
    let path = path::resolve(root, &PathBuf::from(&link), base)?;
    Url::from_file_path(&path).map_err(|_e| ErrorKind::InvalidPath(path))

from_file_path likely URL encodes the file path, which is what one would expect when using any path that would work in a regular file manager. But here the "path" is not a path only but may contain query string and fragments, which both shall not be encoded but either stripped or taken as what it stands for in URLs.

One solution would be indeed to strip the fragment in the same code block, right after the params, as fragment comes after params, so that one step can be skipped when a query string is present 😉. But actually, I would prefer to have no URL encoding done at all on the link part, as we want to fail in cases where the input URL misses required URL encoding. I'm just not sure how to achieve this best here, probably use from_file_path on the root/base path only, and then appending the link literally or so, with query string + fragment still stripped of course?

@mre
Copy link
Member Author

mre commented Sep 4, 2021

As said, I don't believe that an URL was ever checked a second time before, but I believe the list of URLs to process was one with unique URLs in the first place.

That is true for single files, but now it's also true across files thanks to the cache. That was the intention.

So the pre-processing involved with either this PR or another commit of v0.7.1 seems to have added 5 seconds

That's bad

Not that I'd care about 5 seconds 😄.

I do.

Question is where this regression comes form. I doubt that it's a code change to be honest. While we do some more preprocessing now, I expect the impact to be in the microseconds. Certainly not seconds. If this is reproducible across different machines and different network connections then we have to find the root cause.

There was a hash array or so created before, did this probably already mean that already existing entries simply got overwritten by another URL with the same hash?

Yeah it was a simple set. It was only filtering out the duplicate links in a single file though, not across files.

as we want to fail in cases where the input URL misses required URL encoding

I'm not an expert on that. How would URL encoding work for paths on the filesystem?
For example how would slashes and spaces be encoded?
For a link like this

/foo/bar/this_is/a test

I would encode it like this for the web:

%2Ffoo%2Fbar%2Fthis_is%2Fa%20test

But like this for the filesystem:

/foo/bar/this_is/a\ test

Am I missing something here?

I think stripping the fragment would be enough. It was just an oversight on my end.

@MichaIng
Copy link
Member

MichaIng commented Sep 4, 2021

It was only filtering out the duplicate links in a single file though, not across files.

Ahh, this is what I missed 👍. The 2s vs 7s was btw based on a single workflow run (2s repeatedly before, but 7s only once (EDIT: Repeated it, see below) after lychee update), so I did some more controlled local tests now:

10 identical input files with each 5 URLs, some of them doubled, so that finally 3 URLs are checked in each file:

# cat test0.txt
https://dietpi.com/
https://dietpi.com
https://dietpi.com/
https://dietpi.com/survey
https://dietpi.com/survey/
# ./lychee-v0.7.0 -vX head test*.txt
✔ https://dietpi.com/ [200 OK]
✔ https://dietpi.com/survey/ [200 OK]
✔ https://dietpi.com/survey/ [200 OK]
✔ https://dietpi.com/ [200 OK]
✔ https://dietpi.com/survey/ [200 OK]
✔ https://dietpi.com/survey/ [200 OK]
✔ https://dietpi.com/survey/ [200 OK]
✔ https://dietpi.com/survey [200 OK]
✔ https://dietpi.com/ [200 OK]
✔ https://dietpi.com/ [200 OK]
✔ https://dietpi.com/survey [200 OK]
✔ https://dietpi.com/ [200 OK]
✔ https://dietpi.com/survey/ [200 OK]
✔ https://dietpi.com/survey/ [200 OK]
✔ https://dietpi.com/survey/ [200 OK]
✔ https://dietpi.com/survey [200 OK]
✔ https://dietpi.com/ [200 OK]
✔ https://dietpi.com/ [200 OK]
✔ https://dietpi.com/survey/ [200 OK]
✔ https://dietpi.com/ [200 OK]
✔ https://dietpi.com/survey [200 OK]
✔ https://dietpi.com/survey [200 OK]
✔ https://dietpi.com/ [200 OK]
✔ https://dietpi.com/survey/ [200 OK]
✔ https://dietpi.com/ [200 OK]
✔ https://dietpi.com/survey [200 OK]
✔ https://dietpi.com/survey [200 OK]
✔ https://dietpi.com/survey [200 OK]
✔ https://dietpi.com/survey [200 OK]
✔ https://dietpi.com/survey [200 OK]

📝 Summary
---------------------
🔍 Total...........30
✅ Successful......30
⏳ Timeouts.........0
🔀 Redirected.......0
👻 Excluded.........0
🚫 Errors...........0

I haven't found a way to count the number of HTTPS requests, but used tcpdump to count the number of outgoing packets to this destination instead:

# tcpdump --count 'dst dietpi.com and tcp port 443'
380 packets captured

Quite some overhead due to HTTPS etc, and it varies a little from run to run, between ~373 and ~385.

Doing this with lychee v0.7.1 leads to the same amount of packets.

Doing this for only 5 of those 10 files, halves the amount of captured packets with both versions.

So it doesn't seem to work yet in my case.

Processing time has too high fluctuation between 0.3 and 0.5 seconds in both cases, I cannot see any significant difference at least.


I set up a branch to test again lychee v0.7.0 on GitHub workflow:

But before it repeatedly only took 2 seconds, not sure:

I deployed the above tested website locally and couldn't see any significant difference in execution time with both lychee versions. So while skipping doubled URLs doesn't seem to work yet in my cases, it didn't make things worse. No idea what happened to the GitHub runners to coincidentally have this significant difference the same day when lychee v0.7.1 was released.

What I recognised locally is that sometimes lychee has the result table printed already but then still "hangs" some seconds before actually exiting. The overall time was always 10 - 12 seconds, but sometimes the table was printed at the end of that time, sometimes the table was printed already after ~4 seconds, but it took then still 6 seconds or more before it actually exited. Is there some post-processing done that sometimes runs parallel to the last open connections and sometimes after all checks have finished already?

@mre
Copy link
Member Author

mre commented Sep 4, 2021

That's weird. Can you set the output to verbose and check if you find anything special about the URLs printed after the status?

@lebensterben
Copy link
Member

maybe it's time to refactor the code again to make it faster.

@MichaIng
Copy link
Member

MichaIng commented Sep 5, 2021

Nothing is printed anymore, here a little screen recording:

lychee

It's exactly the same with both versions. After excluding my mail address I repeatedly get 10.2xx seconds with both versions. Somehow the mail address check sometimes takes longer, and it does not seem to react to the timeout.

But step by step, I think the first issue is make the URI cache do what it's supported to do. As of the packet count tests I made above, with this simple reproducible example it does not seem to work. That should indeed be an enhancement for the most heavy use cases where a lot of internal navigation links across several pages are involved.

@MichaIng
Copy link
Member

I just wanted to ask whether a new release would be feasible, but remembered this issue. Was someone able to replicate or derive that/why the URL cache is not working currently? Shall I open a new issue about this?

Above the tests I made in this regards: #243 (comment)

@mre
Copy link
Member Author

mre commented Sep 30, 2021

Yeah let's create a new issue so that this won't get forgotten. Also if you manage to create a failing test somehow that would be cool, but it's not mandatory. (I was thinking about mocking the cache and tracking the inserts.)

@MichaIng
Copy link
Member

Issue opened: #348
I would be able to extend the existing test to span across two files, IF in the output identical URIs in multiple files are aimed to contribute only +1 to the total. Else, I'm not sure how to do best, as the we would need to check internals or somehow count the number or requests or packets, as I did above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants