-
-
Notifications
You must be signed in to change notification settings - Fork 156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simple URI cache #243
Simple URI cache #243
Conversation
I actually had the impression that repeated links were already only checked once before, respectively that multiple same URLs were handled as a single one already before, also visible as one one in the summary. Now when testing the local files support branch, this assumption got hardened: #262 (comment) <a class="nav-link" href="/#features">Features</a>
<a class="nav-link" href="/#download">Download</a>
<a class="nav-link" href="/#gettingstarted">Getting Started</a>
<a class="nav-link" href="/#testimonials">Testimonials</a> got collected as a single URI (lychee v0.7.0 and before already). Only on the local files branch, where Also the amount of checked links shown in the summary has not changed from v0.7.0 to v0.7.1. Another point:
IMO
I did some tests to verify and investigate the v0.7.0 vs v0.7.1 behaviour:
Even with trailing slash it wasn't treated as multiple URIs before:
Ah I see now with a trailing slash after a path element, it is treated as different URLs, which is the biggest concern I had above:
But so after all, nothing has changed. Probably the parser crate internally merges matching URLs as one already? Or did I misunderstand the goal of this PR? |
The output should stay the same. This is just an internal optimization that speeds up the checking because already checked URLs get ignored. About the trailing slash at the end, they are indeed technically different as you pointed out. The way equivalence is currently implemented is more of a design decision of the url crate however. I guess I just follow whatever they think is equivalent. That said, at least Google and ahrefs say they are the same, so you could get some false positives no matter how you look at it. |
Yes I think it is okay and, despite the involved redirect, I agree with the statements done you linked, which match the behaviour of lychee:
So, while I kinda miss the ability to check for the redirect of
As said, I don't believe that an URL was ever checked a second time before, but I believe the list of URLs to process was one with unique URLs in the first place. I don't have enough data yet to verify this, but e.g. on our main website, the lychee execution time increased with v0.7.1 from a reliable 2 seconds to 7 seconds, so the pre-processing involved with either this PR or another commit of v0.7.1 seems to have added 5 seconds, while we have a lot of duplicate URLs on our website. Not that I'd care about 5 seconds 😄. On our documentation branch, where we currently need to check each directory individually (until local files checking is ready 🙂), it takes ~~5 minutes (with a high fluctuation to be true), before and afterwards, while due to a lot of internal navigation URLs, TOC etc many URLs appears several times. There was a hash array or so created before, did this probably already mean that already existing entries simply got overwritten by another URL with the same hash? I'll count the amount of actual HTTP(S) requests by the two versions, tow be sure 🙂. |
Okay to sum it up, I suspect the issue being in the following code block: https://github.com/lycheeverse/lychee/pull/262/files#diff-9974c1c6274d302fb56650636debf746561e468f46fccc30dec65f4345abe268R125-R128 fn create_uri_from_path(root: &Path, base: &Option<Base>, link: &str) -> Result<Url> {
let link = url::remove_get_params(link);
let path = path::resolve(root, &PathBuf::from(&link), base)?;
Url::from_file_path(&path).map_err(|_e| ErrorKind::InvalidPath(path)) from_file_path likely URL encodes the file path, which is what one would expect when using any path that would work in a regular file manager. But here the "path" is not a path only but may contain query string and fragments, which both shall not be encoded but either stripped or taken as what it stands for in URLs. One solution would be indeed to strip the fragment in the same code block, right after the params, as fragment comes after params, so that one step can be skipped when a query string is present 😉. But actually, I would prefer to have no URL encoding done at all on the |
That is true for single files, but now it's also true across files thanks to the cache. That was the intention.
That's bad
I do. Question is where this regression comes form. I doubt that it's a code change to be honest. While we do some more preprocessing now, I expect the impact to be in the microseconds. Certainly not seconds. If this is reproducible across different machines and different network connections then we have to find the root cause.
Yeah it was a simple set. It was only filtering out the duplicate links in a single file though, not across files.
I'm not an expert on that. How would URL encoding work for paths on the filesystem?
I would encode it like this for the web:
But like this for the filesystem:
Am I missing something here? I think stripping the fragment would be enough. It was just an oversight on my end. |
Ahh, this is what I missed 👍. The 2s vs 7s was btw based on a single workflow run (2s repeatedly before, but 7s only once (EDIT: Repeated it, see below) after lychee update), so I did some more controlled local tests now: 10 identical input files with each 5 URLs, some of them doubled, so that finally 3 URLs are checked in each file:
I haven't found a way to count the number of HTTPS requests, but used tcpdump to count the number of outgoing packets to this destination instead:
Quite some overhead due to HTTPS etc, and it varies a little from run to run, between ~373 and ~385. Doing this with lychee v0.7.1 leads to the same amount of packets. Doing this for only 5 of those 10 files, halves the amount of captured packets with both versions. So it doesn't seem to work yet in my case. Processing time has too high fluctuation between 0.3 and 0.5 seconds in both cases, I cannot see any significant difference at least. I set up a branch to test again lychee v0.7.0 on GitHub workflow:
But before it repeatedly only took 2 seconds, not sure:
I deployed the above tested website locally and couldn't see any significant difference in execution time with both lychee versions. So while skipping doubled URLs doesn't seem to work yet in my cases, it didn't make things worse. No idea what happened to the GitHub runners to coincidentally have this significant difference the same day when lychee v0.7.1 was released. What I recognised locally is that sometimes lychee has the result table printed already but then still "hangs" some seconds before actually exiting. The overall time was always 10 - 12 seconds, but sometimes the table was printed at the end of that time, sometimes the table was printed already after ~4 seconds, but it took then still 6 seconds or more before it actually exited. Is there some post-processing done that sometimes runs parallel to the last open connections and sometimes after all checks have finished already? |
That's weird. Can you set the output to verbose and check if you find anything special about the URLs printed after the status? |
maybe it's time to refactor the code again to make it faster. |
Nothing is printed anymore, here a little screen recording: It's exactly the same with both versions. After excluding my mail address I repeatedly get 10.2xx seconds with both versions. Somehow the mail address check sometimes takes longer, and it does not seem to react to the timeout. But step by step, I think the first issue is make the URI cache do what it's supported to do. As of the packet count tests I made above, with this simple reproducible example it does not seem to work. That should indeed be an enhancement for the most heavy use cases where a lot of internal navigation links across several pages are involved. |
I just wanted to ask whether a new release would be feasible, but remembered this issue. Was someone able to replicate or derive that/why the URL cache is not working currently? Shall I open a new issue about this? Above the tests I made in this regards: #243 (comment) |
Yeah let's create a new issue so that this won't get forgotten. Also if you manage to create a failing test somehow that would be cool, but it's not mandatory. (I was thinking about mocking the cache and tracking the inserts.) |
Issue opened: #348 |
No description provided.