-
-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Blacklist requests that are duplicates of existing resources or bound to fail #28
Comments
Can you move your comment to #25 and close this? This is the scraper's repo. |
@rgaudin Moved it but I'd keep it open as this ticket is a little bit different. |
This one's better ; closing the other one but the problem raised there remains: where do we point to for stuff that we know exists? |
Is your question "in case there are several versions of the same zim" (e.g., Wikipedia mini/nopic/maxi)? The basic assumption here is that zimit provides a copy of the real thing, so we should send them the |
This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions. |
See also #33 |
I've started to document blacklist I encounter during maintenance tasks at https://docs.google.com/spreadsheets/d/1mBjWT0hLmeg6EqT4nNEfCzLU8hGSzYs4IgbWDInhPqA/edit?gid=0#gid=0 |
Should we add a link to it to the routine? Should we count them in some way? |
Added the link to the routine, indeed it would help to have the link at hand. |
That's why I asked. The value would be to distinguish the importance between them ; should the eventual actions have to be prioritized |
I've added two more to the list. |
The weekly infra routine (manual checks we do every week to ensure infra is up and running) |
Here is a list of 20 most requested sites over the August-December 2024 time period. I would say nearly half of them we already have (e.g. Wikipedia) or cannot do (reddit, github, youtube: these are not requests for specific pages but really for the entire website)
|
Please add them to the spreadsheet of #28 (comment) so that we have one single source of truth |
See #113 (comment) where we now have a more precise requirement, as well as #33 @Popolechien I need your help to figure out what we really want to show to the user, and how. Typically for every website of the spreadsheet, I need the precise message that will be displayed to the user instead of the generic explanation we currently have. Given today discussion, this could probably means that we will have to split lines like I do not need the list to be exhaustive yet (given the fact that we already have 1k ZIMs, it is anyway obvious that the list is far from complete), but I need a more precise understanding of the breadth of possibilities. Maybe we should do it together to avoid back-and-forth discussions. |
up |
I've updated the spreadsheet https://docs.google.com/spreadsheets/d/1mBjWT0hLmeg6EqT4nNEfCzLU8hGSzYs4IgbWDInhPqA/edit?gid=0#gid=0 - I've tried to adapt the message to I basically see two scenarios for refusal - either we already have the zim file, or we can't generate it (whatever the reason). I'm not sure it is worthwhile trying to explain technical limitations to someone that wants a zim of google.com or reddit, so the shorter the The only issue I could see with files we already have is that people are actually looking for a more recent copy (e.g. Wikipedia), but then again I would see more detailed explanations as likely to fall on deaf ears. |
No we don't, at least I don't know how to transform the URL someone gives into a nice URL to display in the message you propose, at least for now, hence the need to split the lines / add more / find another solution in the Excel sheet. For instance how do I (programmatically) transform I also find the |
And I forgot to mention another case we have to handle: website for which we already have a scraper and a pending ZIM request (e.g. all fandom websites where we know that mwoffliner is more appropriate and are going - hopefully - to soon ZIM them 'officialy'). |
I am not sure where you got this impression from but I disagree. The right thing to do next is to move on, or read the FAQ below that explains the limitations. Someone asking for google.com or the likes is not someone to be reasoned with. As for the other use cases you mention:
At the end of the day it is a free service, we provide best effort but should not go out of our way either. From our donation stats I see the service has brought zero revenue. People see this as a commodity and treat it as such. |
To me all these last comments are not aligned at all with @kelson42 PoV nor what has been discussed so far in last exchanges in issues, PRs and especially in live discussions meant to align us. You completely lost me in contradictory injections, let's discuss this live again. |
Following openzim/zimit#113, we should think about implementing a fairly easily editable list (hosted on drive.kiwix.org?) of blacklisted sites that can not be requested on zimit, e.g.
It's probably the matter of a separate ticket, but requests for websites we already have a scraper for (wikipedia, stackoverflow, etc.) should also be soft blocked and the user offered a direct link to the zim file.
The text was updated successfully, but these errors were encountered: