Integrate zimfarm dev setup #1027

elfkuzco · 2025-10-20T14:33:27Z

Rationale

As part of cleanup in zimfarm API (openzim/zimfarm#1391), requests to create recipes/tasks now require an offliner definition version. This PR sets the version of the offliner definition from env variable and sets up zimfarm containers in a docker-compose graph. Previously, the API used "initial" as the definition versions but as scrapers evolve and arguments change, the definitions change too.

Changes

use mwoffliner definition version from env (default to image tag)
set up compose graph that includes zimfarm-containers. These are created with profiles: zimfarm and zimfarm-worker. The former starts up only the API and UI while the latter starts up the worker and receiver in addition.

codecov · 2025-10-20T18:39:01Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 92.90%. Comparing base (63b7a74) to head (53c3c0e).

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #1027   +/-   ##
=======================================
  Coverage   92.90%   92.90%           
=======================================
  Files          73       73           
  Lines        4229     4230    +1     
=======================================
+ Hits         3929     3930    +1     
  Misses        300      300

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

audiodude

Not sure about this approach.

wp1/zimfarm.py

README.md

wp1/credentials.py.example

docker-compose-dev.yml

audiodude

Can we also update the name of this PR to something like "Integrate Zimfarm dev setup"?

docker-compose-dev.yml

docker/zimfarm/create-warehouse-paths.sh

benoit74

See comments.

I also feel like nobody did ran this from end-to-end since the zimfarm worker resources were not adequate, or do I miss something?

We should really run this from end-to-end to ensure this setup works correctly.

And by end-to-end, I mean / propose following scenario:

user log in WP1 UI
user creates a simple selection (one or two WPEN article for instance, this detail is not important)
user requests this selection to be ZIMed
ZIM file is correctly created
WP1 UI displays ZIM location
user can download the ZIM

I totally understand that @elfkuzco might need help from @audiodude and myself regarding details on how to run this end-to-end test, but we should not close this issue / PR until we are sure that everything works from end-to-end. Otherwise this is mostly just a waste.

README.md

docker-compose-dev.yml

README.md

audiodude · 2025-10-28T14:45:10Z

We should really run this from end-to-end to ensure this setup works correctly.

Yes I can definitely help with that. I'll patch this PR and try setting up/running the zimfarm locally and confirm that I can create and download ZIMs.

elfkuzco · 2025-10-30T05:39:19Z

Updated the files with the recent changes:

added separate buckets for artifacts, logs and zims
updated the README to detail the worker resources and reason for the offliner definition
updated worker resources to 3 CPU, 20G RAM, 20G disk

benoit74 · 2025-10-30T07:24:01Z

Code LGTM, waiting for e2e test from @audiodude (if I get it correctly) to give my formal approval

audiodude · 2025-10-30T17:01:46Z

I made some minor tweaks to the PR, but it's still not working. My Zimfarm is still reporting the following for requests to http://localhost:8004/v2/schedules:

{"success":false,"message":"Offliner definition for offliner mwoffliner with version 1.17.2 does not exist"}

EDIT: This is after following the directions in the README and updating my local credentials.py

benoit74 · 2025-10-30T19:53:24Z

Hum, this is indeed a problem. To unblock you, please set 'definition_version': 'dev' in your local credentials.py, it should do the trick.

It is however not the proper way to solve this situation to merge this PR. We will continuously have new offliner definitions arriving, and all of them should be stored in the local Zimfarm DB so that dev can use mostly any mwoffliner version / definition version. I feel like the docker/zimfarm/create_offliners.sh should fetch all existing definitions from api.farm.openzim.org and populate the ones missing in local dev DB. Documentation would then state that developers should rerun this script on a regular basis to fetch new offliner definitions if they want to use them in their credentials.py.

audiodude · 2025-11-03T20:36:58Z

This is the command I used: docker compose -f docker-compose-dev.yml --profile zimfarm --profile zimfarm-worker up --pull always --build -d

Here are the logs:

tmoney@tmoney-linux:~/code/wp1/wp1-frontend$ docker logs zimfarm-worker-manager 
[2025-11-03 19:46:36,061: INFO] starting zimfarm worker-manager.
[2025-11-03 19:46:36,061: INFO] configuration:
	username=test_worker
	webapi_uris=['http://zimfarm-api:80/v2']
	workdir=/data
	worker_name=test_worker
	OFFLINERS=['mwoffliner', 'youtube', 'phet', 'gutenberg', 'sotoki', 'nautilus', 'ted', 'openedx', 'zimit', 'kolibri', 'wikihow', 'ifixit', 'freecodecamp', 'devdocs', 'mindtouch']
	PLATFORMS_TASKS={}
	poll_interval=10
	sleep_interval=5
	selfish=False
[2025-11-03 19:46:36,061: INFO] testing workdir at /data…
[2025-11-03 19:46:36,061: INFO] 	workdir is available and writable
[2025-11-03 19:46:36,061: INFO] testing private key at /etc/ssh/keys/zimfarm…
[2025-11-03 19:46:36,061: CRITICAL] 	private key is not a readable path

audiodude · 2025-11-03T20:43:53Z

Okay I think I know the problem. In the first step in the README, when I initially create the Docker graph, this path doesn't exist: ./docker/zimfarm/id_ed25519.

I've encountered this before, but at that point Docker creates that path as a directory. Then, when we run the create_worker script, it can't overwrite the directory with the private key.

elfkuzco · 2025-11-03T20:46:22Z

Yes. Oddly enough, it happened to me too. Would update the docs to prevent this from happening to anyone else.

audiodude · 2025-11-03T21:00:26Z

Just want to make sure. Is this line in the docker-compose file supposed to map a file to a file, or a directory to a directory?

volumes:
  - ./docker/zimfarm/id_ed25519:/etc/ssh/keys/zimfarm

If it's meant to map a file, we should simply do a touch docker/zimfarm/id_ed255519 before we start the first docker graph, so that it is initially mapped as an (empty) file that can then be overwritten. Also, I didn't even notice the line in the script that said "now copy the key blah blah". Can we just mv the key ourselves to that location within the script?

elfkuzco · 2025-11-03T21:03:57Z

It's supposed to map to a file. I will revise the shell script to mv the key to that path.

audiodude · 2025-11-03T21:18:55Z

Okay my tasks are being picked up by the worker now! But they are failing. I see this in "Scraper stderr":

[error] [2025-11-03T21:16:58.480Z] Failed to run mwoffliner after [0s]:
 Error: Unknown S3 region set
    at S3.setRegion (/tmp/mwoffliner/src/S3.ts:37:13)
    at new S3 (/tmp/mwoffliner/src/S3.ts:26:10)
    at Module.execute (/tmp/mwoffliner/src/mwoffliner.lib.ts:149:13)
    at <anonymous> (/tmp/mwoffliner/src/cli.ts:66:8)

I assume it's because the optimization cache URL I'm sending in is https://localhost:9000/?keyId=minio_key&secretAccessKey=minio_secret&bucketName=org-kiwix-dev-cache and it's trying to parse a region from the hostname?

EDIT: If so, I understand that this is an issue for the mwoffliner repo, of course.

elfkuzco · 2025-11-03T22:27:53Z

Can you use one similar to the minio one configured for the uploader?

elfkuzco · 2025-11-03T22:29:14Z

EDIT: If so, I understand that this is an issue for the mwoffliner repo, of course.

The thing is the container can't access localhost. You can use https://minio..... because the container can resolve the hostname minio since they all share the same network

elfkuzco · 2025-11-03T22:29:40Z

Or if you want, you can omit the optimization URL from your task.

audiodude · 2025-11-04T01:48:05Z

Okay I definitely think we can skip the S3 cache for dev scraping. After I got rid of that, I got a new error from mwoffliner, which was:

 Failed to read articleList from [http://localhost:5000/v1/builders/0b76807e-c1e3-44c0-a815-b0e8405a51e8/selection/latest.tsv]

This makes sense, since the worker is running inside of the docker compose network, while my WP1 web/api/backend is running on the host machine. In fact, this is the exact reason we need to have a zimfarm in dev anyways, because we've changed the logic for the ZIM creation to use a dynamic URL from WP1 itself rather than a static file list on S3.

I think at this point, I'm going to start working on putting the dev backend server into the docker compose graph as well, with all the updates to configuration and README that are required for that. I'd like to use this same PR and then just merge the whole thing once we have a working, consistent dev environment.

@benoit74 @elfkuzco WDYT?

elfkuzco · 2025-11-04T09:25:31Z

I agree with you.

benoit74 · 2025-11-04T21:13:34Z

Yes for dev we should skip the S3 cache, we will not gain much besides pain. And this is more an internal detail to mwoffliner operation, not really needed.

I like the idea of adding the backend to the docker graph in same PR. This is a great opportunity to nail down this dev setup issues and have a reproducible setup devs can use from e2e. No more excuses for not testing stuff once in a while from e2e. Also a great asset in term of documentation / learning base.

I would even suggest to also add web and api to the docker graph. With proper mount point and configuration it should be possible to have hot reload whenever dev changes something in the codebase, at least this is what we achieved to have in zimfarm, zimit-frontend and cms repos, and it is (mostly?) totally transparent in terms of performances. It free the developers from having anything to install on their dev machine besides Docker, and ensures there is no headaches due to bad versions and stuff like that. Quite important for everyone which is not a core maintainer and / or a bit lazy to setup stuff correctly on his machine (which includes myself ^^)

audiodude · 2025-11-10T00:22:46Z

Okay I've got the following in my docker:

^Ctmoney@tmoney-linux:~/code/wp1$ docker ps
CONTAINER ID   IMAGE                                           COMMAND                  CREATED         STATUS                   PORTS                                                             NAMES
e5a3e2025bc8   wp1-dev-dev-web                                 "flask --app wp1.web…"   5 days ago      Up 6 minutes             0.0.0.0:5000->5000/tcp, [::]:5000->5000/tcp                       wp1bot-web-dev
a133aa726bb8   ghcr.io/openzim/zimfarm-worker-manager:latest   "worker-manager --we…"   6 days ago      Up 6 days                                                                                  zimfarm-worker-manager
978375c302b7   ghcr.io/openzim/zimfarm-ui:latest               "/docker-entrypoint.…"   6 days ago      Up 6 days                127.0.0.1:8003->80/tcp                                            zimfarm-ui
055a203f63ff   wp1-dev-dev-workers                             "/bin/sh -c 'supervi…"   6 days ago      Up 5 minutes                                                                               wp1bot-workers-dev
5294b71f64f6   ghcr.io/openzim/zimfarm-backend:latest          "uvicorn zimfarm_bac…"   6 days ago      Up 6 days (healthy)      127.0.0.1:8004->80/tcp                                            zimfarm-api
590f8488d6f7   minio/minio                                     "/usr/bin/docker-ent…"   6 days ago      Up 6 minutes (healthy)   0.0.0.0:9000-9001->9000-9001/tcp, [::]:9000-9001->9000-9001/tcp   wp1bot-minio-dev
343148f4b8bc   postgres:17.3-bookworm                          "docker-entrypoint.s…"   6 days ago      Up 6 days (healthy)      127.0.0.1:2345->5432/tcp                                          zimfarm-postgresdb
92261129c194   redis                                           "docker-entrypoint.s…"   6 days ago      Up 6 minutes (healthy)   0.0.0.0:9736->6379/tcp, [::]:9736->6379/tcp                       wp1bot-redis-dev
1f0cd8e54a2f   wp1-dev-dev-database                            "docker-entrypoint.s…"   6 days ago      Up 6 minutes             0.0.0.0:6300->3306/tcp, [::]:6300->3306/tcp                       wp1bot-db-dev
1d466e07260f   mariadb:10.4                                    "docker-entrypoint.s…"   20 months ago   Up 3 weeks               0.0.0.0:6600->3306/tcp, [::]:6600->3306/tcp                       wp1bot-test-db
7f06c4c77a50   5b0542ad1e77                                    "docker-entrypoint.s…"   20 months ago   Up 3 weeks               0.0.0.0:9777->6379/tcp, [::]:9777->6379/tcp                       wp1bot-test-redis

I've changed the URL for the article list we send to Zimfarm to try and use the WP1 API that's running in docker, so I'm using http://web-dev:5000/v1/builders/94330657-fe26-4aea-8f14-f959ede293a0/selection/latest.tsv. But I get the following error:

[error] [2025-11-10T00:17:35.346Z] Failed to read articleList from [http://web-dev:5000/v1/builders/94330657-fe26-4aea-8f14-f959ede293a0/selection/latest.tsv] Error: Failed to read articleList from URL: http://web-dev:5000/v1/builders/94330657-fe26-4aea-8f14-f959ede293a0/selection/latest.tsv

I understand that this is a network connectivity issue, and I need to use the right domain for the WP1 API. However, the part I don't understand is the network topology for worker/worker-manger/mwoffliner/etc and where the mwoffliner is actually running on the network. What should I put for http://web-dev:5000? Thanks!

audiodude · 2025-11-10T02:42:06Z

Also tried with wp1bot-web-dev:

[error] [2025-11-10T02:41:00.706Z] Failed to read articleList from [http://wp1bot-web-dev:5000/v1/builders/6a1f2ee7-5947-4222-8e12-b043cf376af4/selection/latest.tsv] Error: Failed to read articleList from URL: http://wp1bot-web-dev:5000/v1/builders/6a1f2ee7-5947-4222-8e12-b043cf376af4/selection/latest.tsv

It's reachable from zimfarm-api:

tmoney@tmoney-linux:~/code/wp1$ docker exec -it zimfarm-api bash
root@5294b71f64f6:/# curl http://wp1bot-web-dev:5000
<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="utf-8" />
    <title>WP 1.0 API</title>
   ....<SNIP>

elfkuzco · 2025-11-10T08:38:42Z

Could you push your recent changes with the wp1 api in the docker graph?

…e credentials config variable for URLs sent to zimfarm

audiodude · 2025-11-10T15:46:15Z

@elfkuzco Done.

elfkuzco · 2025-11-11T02:14:57Z

@benoit74 , is it possible that my IP is blocked for mwoffliner jobs? I keep getting

[error] [2025-11-11T02:05:39.982Z] Failed to run mwoffliner after [0s]:
 Error: mwUrl [https://en.wikipedia.org/] is not valid.
    at <anonymous> (/tmp/mwoffliner/src/sanitize-argument.ts:189:11)
    at process.processTicksAndRejections (node:internal/process/task_queues:105:5)
    at async sanitize_mwUrl (/tmp/mwoffliner/src/sanitize-argument.ts:188:3)
    at async sanitize_all (/tmp/mwoffliner/src/sanitize-argument.ts:75:3)

benoit74 · 2025-11-11T05:56:42Z

It is possible, but quite unlikely ; I would first start by updating the mwoffliner version you use by pulling ghcr.io/openzim/mwoffliner:latest image again, the message you get is from an old mwoffliner version, this could help.

elfkuzco · 2025-11-12T10:17:09Z

@audiodude , don't worry about this anymore so we don't duplicate efforts. I will try and create a reproducible workflow and publish by the end of today and push an updated commit.

elfkuzco · 2025-11-12T13:41:39Z

So, it's working as expected

Here are some of the things that made it problematic:

[error] [2025-11-10T02:41:00.706Z] Failed to read articleList from [http://wp1bot-web-dev:5000/v1/builders/6a1f2ee7-5947-4222-8e12-b043cf376af4/selection/latest.tsv] Error: Failed to read articleList from URL: http://wp1bot-web-dev:5000/v1/builders/6a1f2ee7-5947-4222-8e12-b043cf376af4/selection/latest.tsv

The issue seems to be from upstream in that the dnscache image isn't connected to the same netowork. I made some small fixes to my local worker and pointed it to use that image and that's how I got the results. Will make a PR upstream for this to get the dnscache connected to the same network as the rest of the containers when in development mode.Here's what I used to test as the s3 part of the CLIENT_URL: http://minio:9000/org-kiwix-dev-wp1 that worked for me.

If you look at the latest commit, I removed the directive to overwrite the command. Zimfarm is changing fast these days and it's best to not have that directive there since it might be obsolete. Best to rely on the latest image to do the right thing. I set the command to --pull always during build but it seems it's been removed. Do you think I should bring it back?

I will make additional PRs here to point hte script to fetch the latest offliner versions from farm.openzim.org instead of the scraper repository itself too.

audiodude · 2025-11-12T15:26:19Z

Awesome, thanks so much for all of your work on this!

--pull always during build but it seems it's been removed.

I'm not sure I like pull always for starting the entire docker graph, because it means I have to download things like Redis and MariaDB every time I start the server, which makes things slow and wastes resources. Is there a way we can set that option in the docker compose file for just the zimfarm components?

I will make additional PRs here to point hte script to fetch the latest offliner versions from farm.openzim.org instead of the scraper repository itself too.

I will wait for your changes and then approve and merge.

elfkuzco · 2025-11-12T18:09:17Z

I'm not sure I like pull always for starting the entire docker graph, because it means I have to download things like Redis and MariaDB every time I start the server, which makes things slow and wastes resources. Is there a way we can set that option in the docker compose file for just the zimfarm components?

I think this is probably because they aren't pinned to any specific version. Ideally, only zimfarm components should have latest

benoit74 · 2025-11-13T10:42:31Z

Regarding --pull always, I think this is indeed not that a good idea to do it so unconditionally. Especially because it means WP1 developers could suddenly have lots of moving parts everytime they restart the stack. Which is usually the cause of many headaches (you are fixing a bug in WP1, you restart the stack, you get another totally unrelated error because Zimfarm image has been updated ... hard to realize it is unrelated in general).

I would recommend to not add this flag in commands but add a very clear message saying that developers should regularly pull latest Docker images from the Zimfarm and other components with --pull always (with full command sample). It will also make other commands shorter which helps to understand them.

elfkuzco · 2025-11-14T00:36:13Z

@audiodude , See latest commit made recently now. Everything works as expected. Here's my settings for the ZIMFARM part of the credentials.py

"ZIMFARM": {
    "url": "http://zimfarm-api/v2",
    "s3_url": "http://localhost:9000/org-kiwix-dev-zims",
    "user": "admin",
    "password": "admin",
    "hook_token": None,
    "definition_version": "1.17.2",
    "image": "ghcr.io/openzim/mwoffliner:1.17.2"
}

Also updated the shell scripts to fetch offliner definitions from farm.openzim.org and populate your local zimfarm API. This would allow you to have much more versions to test against.

Review and let me know if there's any additional thing to fix

benoit74 · 2025-11-14T10:15:55Z

I recommend this ZIMFARM (or any updated version of it) is commit in the DEV credentials.

My goal for this PR is that:

any new developer arriving in WP1 can test ZIM generation from end-to-end without making any modification to the repo, just by running through instructions as clear and as minimal as possible
any "experienced" developer working on WP1 knows how to update its (Zimfarm) dev stack as things are moving in zimfarm repository / mwoffliner releases

pass offliner definition version while creating tasks

1993b08

elfkuzco requested review from audiodude and benoit74 October 20, 2025 14:33

elfkuzco added 2 commits October 20, 2025 15:53

document compose setup

fae7176

remove keys

de7afb9

elfkuzco force-pushed the pass-offliner-definition-version branch from 62d06b6 to de7afb9 Compare October 20, 2025 14:58

add version in zimfarm test params

fc088ff

elfkuzco force-pushed the pass-offliner-definition-version branch from 8176974 to fc088ff Compare October 20, 2025 18:36

audiodude requested changes Oct 21, 2025

View reviewed changes

wp1/zimfarm.py Show resolved Hide resolved

README.md Show resolved Hide resolved

add definition version in example credentials

001366e

benoit74 requested changes Oct 23, 2025

View reviewed changes

wp1/credentials.py.example Outdated Show resolved Hide resolved

docker-compose-dev.yml Outdated Show resolved Hide resolved

This was referenced Oct 23, 2025

Define a convention on TCP/UDP ports used by development stacks openzim/overview#24

Open

omit blank settings in requests view openzim/zimit-frontend#177

Open

update definition version in examples

812f745

audiodude approved these changes Oct 24, 2025

View reviewed changes

docker-compose-dev.yml Outdated Show resolved Hide resolved

docker-compose-dev.yml Outdated Show resolved Hide resolved

docker-compose-dev.yml Show resolved Hide resolved

docker/zimfarm/create-warehouse-paths.sh Outdated Show resolved Hide resolved

audiodude changed the title ~~pass offliner definition version while creating tasks~~ Integrate zimfarm dev setup Oct 27, 2025

use minio bucket to receive files/artifacts/logs

9b97249

elfkuzco requested a review from benoit74 October 28, 2025 00:07

benoit74 requested changes Oct 28, 2025

View reviewed changes

update README and worker configuration

55f06fc

update worker resources in compose environment

45bf765

elfkuzco requested a review from benoit74 October 30, 2025 05:41

Minor tweaks

20f1f6d

Mention that jq is required

a3b3b35

Add WP1 api ("web") to dev docker compose graph. Also provide separat…

466e2e9

…e credentials config variable for URLs sent to zimfarm

update zimfarm-api startup command

5277375

elfkuzco mentioned this pull request Nov 13, 2025

Make dnscache container optional openzim/zimfarm#1515

Closed

fetch offliner definitions from farm.openzim.org

b1eb250

elfkuzco requested a review from audiodude November 14, 2025 00:37

Uh oh!

Integrate zimfarm dev setup #1027

Are you sure you want to change the base?

Integrate zimfarm dev setup #1027

Uh oh!

Conversation

elfkuzco commented Oct 20, 2025

Rationale

Changes

Uh oh!

codecov bot commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

audiodude left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

audiodude left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

benoit74 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

audiodude commented Oct 28, 2025

Uh oh!

elfkuzco commented Oct 30, 2025

Uh oh!

benoit74 commented Oct 30, 2025

Uh oh!

audiodude commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

benoit74 commented Oct 30, 2025

Uh oh!

audiodude commented Nov 3, 2025

Uh oh!

audiodude commented Nov 3, 2025

Uh oh!

elfkuzco commented Nov 3, 2025

Uh oh!

audiodude commented Nov 3, 2025

Uh oh!

elfkuzco commented Nov 3, 2025

Uh oh!

audiodude commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elfkuzco commented Nov 3, 2025

Uh oh!

elfkuzco commented Nov 3, 2025

Uh oh!

elfkuzco commented Nov 3, 2025

Uh oh!

audiodude commented Nov 4, 2025

Uh oh!

elfkuzco commented Nov 4, 2025

Uh oh!

benoit74 commented Nov 4, 2025

Uh oh!

audiodude commented Nov 10, 2025

Uh oh!

codecov bot commented Oct 20, 2025 •

edited

Loading

benoit74 left a comment •

edited

Loading

audiodude commented Oct 30, 2025 •

edited

Loading

audiodude commented Nov 3, 2025 •

edited

Loading

audiodude commented Nov 10, 2025 •

edited

Loading

elfkuzco commented Nov 11, 2025 •

edited

Loading