Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blacklist requests that are duplicates of existing resources or bound to fail #28

Open
Popolechien opened this issue Mar 2, 2022 · 21 comments · May be fixed by #124
Open

Blacklist requests that are duplicates of existing resources or bound to fail #28

Popolechien opened this issue Mar 2, 2022 · 21 comments · May be fixed by #124
Labels
enhancement New feature or request prio1

Comments

@Popolechien
Copy link
Contributor

Following openzim/zimit#113, we should think about implementing a fairly easily editable list (hosted on drive.kiwix.org?) of blacklisted sites that can not be requested on zimit, e.g.

  • kiwix.org subdomains (download and library);
  • very large corporate websites (e.g. Facebook, Twitter, Reddit, Youtube, etc.)
  • websites that have been scraped in the past and failed.

It's probably the matter of a separate ticket, but requests for websites we already have a scraper for (wikipedia, stackoverflow, etc.) should also be soft blocked and the user offered a direct link to the zim file.

@Popolechien Popolechien added the enhancement New feature or request label Mar 2, 2022
@rgaudin
Copy link
Member

rgaudin commented Mar 2, 2022

Can you move your comment to #25 and close this? This is the scraper's repo.

@Popolechien Popolechien transferred this issue from openzim/zimit Mar 2, 2022
@Popolechien
Copy link
Contributor Author

@rgaudin Moved it but I'd keep it open as this ticket is a little bit different.

@rgaudin
Copy link
Member

rgaudin commented Mar 2, 2022

This one's better ; closing the other one but the problem raised there remains: where do we point to for stuff that we know exists?

@Popolechien
Copy link
Contributor Author

Is your question "in case there are several versions of the same zim" (e.g., Wikipedia mini/nopic/maxi)?

The basic assumption here is that zimit provides a copy of the real thing, so we should send them the maxi zim file.

@stale
Copy link

stale bot commented May 3, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

@kelson42
Copy link
Contributor

kelson42 commented Nov 4, 2023

See also #33

@benoit74
Copy link
Collaborator

I've started to document blacklist I encounter during maintenance tasks at https://docs.google.com/spreadsheets/d/1mBjWT0hLmeg6EqT4nNEfCzLU8hGSzYs4IgbWDInhPqA/edit?gid=0#gid=0

@rgaudin
Copy link
Member

rgaudin commented Oct 28, 2024

Should we add a link to it to the routine? Should we count them in some way?

@benoit74
Copy link
Collaborator

Added the link to the routine, indeed it would help to have the link at hand.
About counting them, what would be the added value? (nothing against it, but I don't get why we would like to do this, and it seems to be cumbersome / complex to implement)

@rgaudin
Copy link
Member

rgaudin commented Oct 28, 2024

That's why I asked. The value would be to distinguish the importance between them ; should the eventual actions have to be prioritized

@Popolechien
Copy link
Contributor Author

I've added two more to the list.
Which routine are we talking about?

@benoit74
Copy link
Collaborator

The weekly infra routine (manual checks we do every week to ensure infra is up and running)

@Popolechien
Copy link
Contributor Author

Here is a list of 20 most requested sites over the August-December 2024 time period. I would say nearly half of them we already have (e.g. Wikipedia) or cannot do (reddit, github, youtube: these are not requests for specific pages but really for the entire website)

Website #of requests
shamela.ws 158
thegreatestbooks.org 70
en.wikipedia 48
youtube.com 45
w3schools.com 36
web.archive 27
accords-library.com 24
psdevwiki.com 22
minecraft.wiki 22
strategywiki.org 17
library.kiwix 16
geeksforgeeks.org 16
wikipedia.org 15
survivorlibrary.com 15
stardewvalleywiki.com 15
reddit.com 15
newadvent.org 15
google.com 13
developer.mozilla 13
vmayoclinic.org 12
github.com 12

@benoit74
Copy link
Collaborator

benoit74 commented Jan 7, 2025

Please add them to the spreadsheet of #28 (comment) so that we have one single source of truth

@benoit74
Copy link
Collaborator

See #113 (comment) where we now have a more precise requirement, as well as #33

@Popolechien I need your help to figure out what we really want to show to the user, and how. Typically for every website of the spreadsheet, I need the precise message that will be displayed to the user instead of the generic explanation we currently have. Given today discussion, this could probably means that we will have to split lines like **.wikipedia.org into multiple lines (one per supported language).

I do not need the list to be exhaustive yet (given the fact that we already have 1k ZIMs, it is anyway obvious that the list is far from complete), but I need a more precise understanding of the breadth of possibilities. Maybe we should do it together to avoid back-and-forth discussions.

@benoit74
Copy link
Collaborator

up

@Popolechien
Copy link
Contributor Author

Popolechien commented Feb 20, 2025

I've updated the spreadsheet https://docs.google.com/spreadsheets/d/1mBjWT0hLmeg6EqT4nNEfCzLU8hGSzYs4IgbWDInhPqA/edit?gid=0#gid=0 - I've tried to adapt the message to Please check https://library.kiwix.org/#lang=&q=websitename but I'm not sure that we can generate a catchall regexp that will correctly capture all requests.

I basically see two scenarios for refusal - either we already have the zim file, or we can't generate it (whatever the reason). I'm not sure it is worthwhile trying to explain technical limitations to someone that wants a zim of google.com or reddit, so the shorter the no, the better.

The only issue I could see with files we already have is that people are actually looking for a more recent copy (e.g. Wikipedia), but then again I would see more detailed explanations as likely to fall on deaf ears.

@benoit74
Copy link
Collaborator

but I'm not sure that we can generate a catchall regexp that will correctly capture all requests

No we don't, at least I don't know how to transform the URL someone gives into a nice URL to display in the message you propose, at least for now, hence the need to split the lines / add more / find another solution in the Excel sheet. For instance how do I (programmatically) transform https://en.wikipedia.org/wiki/Pompey into This website is already available for download! Please check https://library.kiwix.org/#lang=eng&q=wikipedia so that the user is not presented with 1235 ZIMs which is not what I would consider as useful.

I also find the It is not possible to ZIM this website with zimit. very broad. From what has been discussed so far, I understood we want to display as much explanation as possible to the user to give him hints on what to do next. Having the same message for download.kiwix.org (where the request purely makes no sense) and reddit.com or archive.org (where we might want to find funding to create a scraper) is quite compared to the vision communicated so far.

@benoit74
Copy link
Collaborator

And I forgot to mention another case we have to handle: website for which we already have a scraper and a pending ZIM request (e.g. all fandom websites where we know that mwoffliner is more appropriate and are going - hopefully - to soon ZIM them 'officialy').

@Popolechien
Copy link
Contributor Author

Popolechien commented Feb 21, 2025

I understood we want to display as much explanation as possible to the user to give him hints on what to do next

I am not sure where you got this impression from but I disagree. The right thing to do next is to move on, or read the FAQ below that explains the limitations. Someone asking for google.com or the likes is not someone to be reasoned with.

As for the other use cases you mention:

  • ziming a specific wiki page. Indeed zimit cannot do this. We could direct them to WP1 though (It looks like you want a specific page, have you considered looking into our WP1 tool?) -> I have updated the sheet accordingly.
  • websites for which we already have a pending zim requests but no zim are fair requests. The question arises for people wanting a more recent version of an existing zim. In that case I would let the requested run proceed if the recipe is on pause, and again direct them to the library if it is running according to schedule.

At the end of the day it is a free service, we provide best effort but should not go out of our way either. From our donation stats I see the service has brought zero revenue. People see this as a commodity and treat it as such.

@benoit74
Copy link
Collaborator

To me all these last comments are not aligned at all with @kelson42 PoV nor what has been discussed so far in last exchanges in issues, PRs and especially in live discussions meant to align us. You completely lost me in contradictory injections, let's discuss this live again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request prio1
Projects
None yet
4 participants