-
Notifications
You must be signed in to change notification settings - Fork 686
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Non-HTML files given .html file extension with HTMLFirst enabled #267
Comments
Some relevant lines from logs etc. for one of the affected files (
|
I've done some digging, and haven't determined whether the 416 was caused by HTTrack submitting a dodgy request or by the server doing something wrong. As the file has the right contents and the log mentions the right MIME type, I think it's plausible that the response body was still the correct file and response header still had the right MIME type, and HTTrack automatically changed the file extension because the status code represented an error. |
Actually, that's not right - the line that mentions the 416 error mentions |
Definitely down to an intermittent fault as it's a different set of files affected when rerunning the mirroring process again. I've still not managed to access the server logs, so have no more information about whether it's down to malformed requests or incorrect handling of well-formed requests. |
I've looked a bit more, and it's apparently also affecting loads of gzips that didn't have the problem the first time I attempted the mirroring process. As far as I can tell (I committed to a Git repo after the first attempt, so it should be accurate), the only things that changed on HTTrack's end are the following
I should probably clarify that the Some of these tie in with lines in
In the
Looking in
I guess that debunks the theory that the 416 errors still led to the same file contents being served. It also makes it look like the presence of these files in the cache zip from the previous run poisons the next run, even if the file becomes available again. |
I've just noticed that I've been using Download web site(s) + questions instead of * Update existing download, which I imagine might not have been the best idea for the runs where I didn't delete everything first to start with a clean slate. |
Doing a fresh run with the same settings generated no 416 errors at first, then towards the end of the process, the first number in the Links scanned: 12345/12345 (+1234) bit reached the same value as the second number, and started counting again from zero. During this phase, a significant percentage of the files fetched generated 416 errors. As it was a clean run, this can't have been caused by the cache from a previous run poisoning the next run. I don't think this was the setting to deal with HTML files first as there are plenty of non-HTML files before this point in |
Today I tried running this again with the option to fetch HTML files first disabled, like it had been for my initial, successful run. I hit no HTTP 416 errors, and all files were given the correct extension. I still don't know whether the 416 errors were caused by malformed requests from HTTrack or by the server misbehaving, but at least this is no longer giving me grief. I also found it ran in about half the time with the option disabled, which isn't what the tooltip or documentation suggested. |
I'm archiving a website and won't need it to work offline, so got rid of the
+*.jpg
etc. rules that were causing externally-hosted images to be included. This changed a few images referenced by a relatively small number of pages to change their file extension from.ico
,.jpg
or.png
to.html
. I'm concerned about this as the archive will eventually be hosted on a webserver that uses file extensions to automatically determine MIME types, and I anticipate problems if there's a mismatch.The file extensions and MIME type provided by the original webserver are correct, so I don't know why it's decided to make up new ones.
The text was updated successfully, but these errors were encountered: