Blacklist requests that are duplicates of existing resources or bound to fail #28

Popolechien · 2022-03-02T10:15:01Z

Following openzim/zimit#113, we should think about implementing a fairly easily editable list (hosted on drive.kiwix.org?) of blacklisted sites that can not be requested on zimit, e.g.

kiwix.org subdomains (download and library);
very large corporate websites (e.g. Facebook, Twitter, Reddit, Youtube, etc.)
websites that have been scraped in the past and failed.

It's probably the matter of a separate ticket, but requests for websites we already have a scraper for (wikipedia, stackoverflow, etc.) should also be soft blocked and the user offered a direct link to the zim file.

rgaudin · 2022-03-02T10:16:58Z

Can you move your comment to #25 and close this? This is the scraper's repo.

Popolechien · 2022-03-02T10:20:10Z

@rgaudin Moved it but I'd keep it open as this ticket is a little bit different.

rgaudin · 2022-03-02T10:21:50Z

This one's better ; closing the other one but the problem raised there remains: where do we point to for stuff that we know exists?

Popolechien · 2022-03-03T16:27:09Z

Is your question "in case there are several versions of the same zim" (e.g., Wikipedia mini/nopic/maxi)?

The basic assumption here is that zimit provides a copy of the real thing, so we should send them the maxi zim file.

stale · 2022-05-03T01:50:18Z

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

kelson42 · 2023-11-04T17:00:54Z

See also #33

benoit74 · 2024-10-28T09:25:25Z

I've started to document blacklist I encounter during maintenance tasks at https://docs.google.com/spreadsheets/d/1mBjWT0hLmeg6EqT4nNEfCzLU8hGSzYs4IgbWDInhPqA/edit?gid=0#gid=0

rgaudin · 2024-10-28T09:47:03Z

Should we add a link to it to the routine? Should we count them in some way?

benoit74 · 2024-10-28T10:04:00Z

Added the link to the routine, indeed it would help to have the link at hand.
About counting them, what would be the added value? (nothing against it, but I don't get why we would like to do this, and it seems to be cumbersome / complex to implement)

rgaudin · 2024-10-28T10:28:17Z

That's why I asked. The value would be to distinguish the importance between them ; should the eventual actions have to be prioritized

Popolechien · 2024-10-28T14:19:29Z

I've added two more to the list.
Which routine are we talking about?

benoit74 · 2024-10-28T19:53:47Z

The weekly infra routine (manual checks we do every week to ensure infra is up and running)

Popolechien · 2025-01-07T10:35:04Z

Here is a list of 20 most requested sites over the August-December 2024 time period. I would say nearly half of them we already have (e.g. Wikipedia) or cannot do (reddit, github, youtube: these are not requests for specific pages but really for the entire website)

Website	#of requests
shamela.ws	158
thegreatestbooks.org	70
en.wikipedia	48
youtube.com	45
w3schools.com	36
web.archive	27
accords-library.com	24
psdevwiki.com	22
minecraft.wiki	22
strategywiki.org	17
library.kiwix	16
geeksforgeeks.org	16
wikipedia.org	15
survivorlibrary.com	15
stardewvalleywiki.com	15
reddit.com	15
newadvent.org	15
google.com	13
developer.mozilla	13
vmayoclinic.org	12
github.com	12

benoit74 · 2025-01-07T16:39:29Z

Please add them to the spreadsheet of #28 (comment) so that we have one single source of truth

benoit74 · 2025-02-11T14:50:45Z

See #113 (comment) where we now have a more precise requirement, as well as #33

@Popolechien I need your help to figure out what we really want to show to the user, and how. Typically for every website of the spreadsheet, I need the precise message that will be displayed to the user instead of the generic explanation we currently have. Given today discussion, this could probably means that we will have to split lines like **.wikipedia.org into multiple lines (one per supported language).

I do not need the list to be exhaustive yet (given the fact that we already have 1k ZIMs, it is anyway obvious that the list is far from complete), but I need a more precise understanding of the breadth of possibilities. Maybe we should do it together to avoid back-and-forth discussions.

benoit74 · 2025-02-20T14:30:51Z

up

Popolechien · 2025-02-20T14:55:21Z

I've updated the spreadsheet https://docs.google.com/spreadsheets/d/1mBjWT0hLmeg6EqT4nNEfCzLU8hGSzYs4IgbWDInhPqA/edit?gid=0#gid=0 - I've tried to adapt the message to Please check https://library.kiwix.org/#lang=&q=websitename but I'm not sure that we can generate a catchall regexp that will correctly capture all requests.

I basically see two scenarios for refusal - either we already have the zim file, or we can't generate it (whatever the reason). I'm not sure it is worthwhile trying to explain technical limitations to someone that wants a zim of google.com or reddit, so the shorter the no, the better.

The only issue I could see with files we already have is that people are actually looking for a more recent copy (e.g. Wikipedia), but then again I would see more detailed explanations as likely to fall on deaf ears.

benoit74 · 2025-02-20T17:17:46Z

but I'm not sure that we can generate a catchall regexp that will correctly capture all requests

No we don't, at least I don't know how to transform the URL someone gives into a nice URL to display in the message you propose, at least for now, hence the need to split the lines / add more / find another solution in the Excel sheet. For instance how do I (programmatically) transform https://en.wikipedia.org/wiki/Pompey into This website is already available for download! Please check https://library.kiwix.org/#lang=eng&q=wikipedia so that the user is not presented with 1235 ZIMs which is not what I would consider as useful.

I also find the It is not possible to ZIM this website with zimit. very broad. From what has been discussed so far, I understood we want to display as much explanation as possible to the user to give him hints on what to do next. Having the same message for download.kiwix.org (where the request purely makes no sense) and reddit.com or archive.org (where we might want to find funding to create a scraper) is quite compared to the vision communicated so far.

benoit74 · 2025-02-20T17:24:29Z

And I forgot to mention another case we have to handle: website for which we already have a scraper and a pending ZIM request (e.g. all fandom websites where we know that mwoffliner is more appropriate and are going - hopefully - to soon ZIM them 'officialy').

Popolechien · 2025-02-21T08:53:20Z

I understood we want to display as much explanation as possible to the user to give him hints on what to do next

I am not sure where you got this impression from but I disagree. The right thing to do next is to move on, or read the FAQ below that explains the limitations. Someone asking for google.com or the likes is not someone to be reasoned with.

As for the other use cases you mention:

ziming a specific wiki page. Indeed zimit cannot do this. We could direct them to WP1 though (It looks like you want a specific page, have you considered looking into our WP1 tool?) -> I have updated the sheet accordingly.
websites for which we already have a pending zim requests but no zim are fair requests. The question arises for people wanting a more recent version of an existing zim. In that case I would let the requested run proceed if the recipe is on pause, and again direct them to the library if it is running according to schedule.

At the end of the day it is a free service, we provide best effort but should not go out of our way either. From our donation stats I see the service has brought zero revenue. People see this as a commodity and treat it as such.

benoit74 · 2025-02-21T10:33:51Z

To me all these last comments are not aligned at all with @kelson42 PoV nor what has been discussed so far in last exchanges in issues, PRs and especially in live discussions meant to align us. You completely lost me in contradictory injections, let's discuss this live again.

Popolechien added the enhancement New feature or request label Mar 2, 2022

Popolechien transferred this issue from openzim/zimit Mar 2, 2022

rgaudin mentioned this issue Mar 2, 2022

Reject requests for Wikipedia zim files. #25

Closed

stale bot added the stale label May 3, 2022

rgaudin mentioned this issue Aug 21, 2023

Create blacklist of websites that won't be zimmed up. openzim/zimit#206

Closed

kelson42 added the prio1 label Nov 4, 2023

stale bot removed the stale label Nov 4, 2023

kelson42 pinned this issue Nov 4, 2023

Onyx2406 mentioned this issue Mar 10, 2024

GSoC 2024: Internationalization Implementation for Zimit Frontend #51

Closed

benoit74 mentioned this issue Jun 4, 2024

youzim-it is pending my requests too long ! #55

Closed

This was referenced Feb 4, 2025

Configure Zimit blacklist kiwix/operations#363

Open

Implement a blacklist of unwanted websites #113

Closed

benoit74 linked a pull request Feb 28, 2025 that will close this issue

Add capability to blacklist some websites and redirect them to library / Github issue #124

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Blacklist requests that are duplicates of existing resources or bound to fail #28

Blacklist requests that are duplicates of existing resources or bound to fail #28

Popolechien commented Mar 2, 2022

rgaudin commented Mar 2, 2022

Popolechien commented Mar 2, 2022

rgaudin commented Mar 2, 2022

Popolechien commented Mar 3, 2022

stale bot commented May 3, 2022

kelson42 commented Nov 4, 2023

benoit74 commented Oct 28, 2024

rgaudin commented Oct 28, 2024

benoit74 commented Oct 28, 2024

rgaudin commented Oct 28, 2024

Popolechien commented Oct 28, 2024

benoit74 commented Oct 28, 2024

Popolechien commented Jan 7, 2025

benoit74 commented Jan 7, 2025

benoit74 commented Feb 11, 2025

benoit74 commented Feb 20, 2025

Popolechien commented Feb 20, 2025 •

edited

Loading

benoit74 commented Feb 20, 2025

benoit74 commented Feb 20, 2025

Popolechien commented Feb 21, 2025 •

edited

Loading

benoit74 commented Feb 21, 2025

Blacklist requests that are duplicates of existing resources or bound to fail #28

Blacklist requests that are duplicates of existing resources or bound to fail #28

Comments

Popolechien commented Mar 2, 2022

rgaudin commented Mar 2, 2022

Popolechien commented Mar 2, 2022

rgaudin commented Mar 2, 2022

Popolechien commented Mar 3, 2022

stale bot commented May 3, 2022

kelson42 commented Nov 4, 2023

benoit74 commented Oct 28, 2024

rgaudin commented Oct 28, 2024

benoit74 commented Oct 28, 2024

rgaudin commented Oct 28, 2024

Popolechien commented Oct 28, 2024

benoit74 commented Oct 28, 2024

Popolechien commented Jan 7, 2025

benoit74 commented Jan 7, 2025

benoit74 commented Feb 11, 2025

benoit74 commented Feb 20, 2025

Popolechien commented Feb 20, 2025 • edited Loading

benoit74 commented Feb 20, 2025

benoit74 commented Feb 20, 2025

Popolechien commented Feb 21, 2025 • edited Loading

benoit74 commented Feb 21, 2025

Popolechien commented Feb 20, 2025 •

edited

Loading

Popolechien commented Feb 21, 2025 •

edited

Loading