Skip to content

Conversation

@elfkuzco
Copy link
Collaborator

Rationale

As part of cleanup in zimfarm API (openzim/zimfarm#1391), requests to create recipes/tasks now require an offliner definition version. This PR sets the version of the offliner definition from env variable and sets up zimfarm containers in a docker-compose graph. Previously, the API used "initial" as the definition versions but as scrapers evolve and arguments change, the definitions change too.

Changes

  • use mwoffliner definition version from env (default to image tag)
  • set up compose graph that includes zimfarm-containers. These are created with profiles: zimfarm and zimfarm-worker. The former starts up only the API and UI while the latter starts up the worker and receiver in addition.

@elfkuzco elfkuzco force-pushed the pass-offliner-definition-version branch from 62d06b6 to de7afb9 Compare October 20, 2025 14:58
@elfkuzco elfkuzco force-pushed the pass-offliner-definition-version branch from 8176974 to fc088ff Compare October 20, 2025 18:36
@codecov
Copy link

codecov bot commented Oct 20, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 92.90%. Comparing base (63b7a74) to head (53c3c0e).

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1027   +/-   ##
=======================================
  Coverage   92.90%   92.90%           
=======================================
  Files          73       73           
  Lines        4229     4230    +1     
=======================================
+ Hits         3929     3930    +1     
  Misses        300      300           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Member

@audiodude audiodude left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about this approach.

Copy link
Member

@audiodude audiodude left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also update the name of this PR to something like "Integrate Zimfarm dev setup"?

@audiodude audiodude changed the title pass offliner definition version while creating tasks Integrate zimfarm dev setup Oct 27, 2025
@elfkuzco elfkuzco requested a review from benoit74 October 28, 2025 00:07
Copy link
Contributor

@benoit74 benoit74 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments.

I also feel like nobody did ran this from end-to-end since the zimfarm worker resources were not adequate, or do I miss something?

We should really run this from end-to-end to ensure this setup works correctly.

And by end-to-end, I mean / propose following scenario:

  • user log in WP1 UI
  • user creates a simple selection (one or two WPEN article for instance, this detail is not important)
  • user requests this selection to be ZIMed
  • ZIM file is correctly created
  • WP1 UI displays ZIM location
  • user can download the ZIM

I totally understand that @elfkuzco might need help from @audiodude and myself regarding details on how to run this end-to-end test, but we should not close this issue / PR until we are sure that everything works from end-to-end. Otherwise this is mostly just a waste.

@audiodude
Copy link
Member

We should really run this from end-to-end to ensure this setup works correctly.

Yes I can definitely help with that. I'll patch this PR and try setting up/running the zimfarm locally and confirm that I can create and download ZIMs.

@elfkuzco
Copy link
Collaborator Author

Updated the files with the recent changes:

  • added separate buckets for artifacts, logs and zims
  • updated the README to detail the worker resources and reason for the offliner definition
  • updated worker resources to 3 CPU, 20G RAM, 20G disk

@elfkuzco elfkuzco requested a review from benoit74 October 30, 2025 05:41
@benoit74
Copy link
Contributor

Code LGTM, waiting for e2e test from @audiodude (if I get it correctly) to give my formal approval

@audiodude
Copy link
Member

audiodude commented Oct 30, 2025

I made some minor tweaks to the PR, but it's still not working. My Zimfarm is still reporting the following for requests to http://localhost:8004/v2/schedules:

{"success":false,"message":"Offliner definition for offliner mwoffliner with version 1.17.2 does not exist"}

EDIT: This is after following the directions in the README and updating my local credentials.py

@benoit74
Copy link
Contributor

Hum, this is indeed a problem. To unblock you, please set 'definition_version': 'dev' in your local credentials.py, it should do the trick.

It is however not the proper way to solve this situation to merge this PR. We will continuously have new offliner definitions arriving, and all of them should be stored in the local Zimfarm DB so that dev can use mostly any mwoffliner version / definition version. I feel like the docker/zimfarm/create_offliners.sh should fetch all existing definitions from api.farm.openzim.org and populate the ones missing in local dev DB. Documentation would then state that developers should rerun this script on a regular basis to fetch new offliner definitions if they want to use them in their credentials.py.

@audiodude
Copy link
Member

This is the command I used: docker compose -f docker-compose-dev.yml --profile zimfarm --profile zimfarm-worker up --pull always --build -d

Here are the logs:

tmoney@tmoney-linux:~/code/wp1/wp1-frontend$ docker logs zimfarm-worker-manager 
[2025-11-03 19:46:36,061: INFO] starting zimfarm worker-manager.
[2025-11-03 19:46:36,061: INFO] configuration:
	username=test_worker
	webapi_uris=['http://zimfarm-api:80/v2']
	workdir=/data
	worker_name=test_worker
	OFFLINERS=['mwoffliner', 'youtube', 'phet', 'gutenberg', 'sotoki', 'nautilus', 'ted', 'openedx', 'zimit', 'kolibri', 'wikihow', 'ifixit', 'freecodecamp', 'devdocs', 'mindtouch']
	PLATFORMS_TASKS={}
	poll_interval=10
	sleep_interval=5
	selfish=False
[2025-11-03 19:46:36,061: INFO] testing workdir at /data…
[2025-11-03 19:46:36,061: INFO] 	workdir is available and writable
[2025-11-03 19:46:36,061: INFO] testing private key at /etc/ssh/keys/zimfarm…
[2025-11-03 19:46:36,061: CRITICAL] 	private key is not a readable path

@audiodude
Copy link
Member

Okay I think I know the problem. In the first step in the README, when I initially create the Docker graph, this path doesn't exist: ./docker/zimfarm/id_ed25519.

I've encountered this before, but at that point Docker creates that path as a directory. Then, when we run the create_worker script, it can't overwrite the directory with the private key.

@elfkuzco
Copy link
Collaborator Author

elfkuzco commented Nov 3, 2025

Yes. Oddly enough, it happened to me too. Would update the docs to prevent this from happening to anyone else.

@audiodude
Copy link
Member

Just want to make sure. Is this line in the docker-compose file supposed to map a file to a file, or a directory to a directory?

volumes:
  - ./docker/zimfarm/id_ed25519:/etc/ssh/keys/zimfarm

If it's meant to map a file, we should simply do a touch docker/zimfarm/id_ed255519 before we start the first docker graph, so that it is initially mapped as an (empty) file that can then be overwritten. Also, I didn't even notice the line in the script that said "now copy the key blah blah". Can we just mv the key ourselves to that location within the script?

@elfkuzco
Copy link
Collaborator Author

elfkuzco commented Nov 3, 2025

It's supposed to map to a file. I will revise the shell script to mv the key to that path.

@audiodude
Copy link
Member

audiodude commented Nov 3, 2025

Okay my tasks are being picked up by the worker now! But they are failing. I see this in "Scraper stderr":

[error] [2025-11-03T21:16:58.480Z] Failed to run mwoffliner after [0s]:
 Error: Unknown S3 region set
    at S3.setRegion (/tmp/mwoffliner/src/S3.ts:37:13)
    at new S3 (/tmp/mwoffliner/src/S3.ts:26:10)
    at Module.execute (/tmp/mwoffliner/src/mwoffliner.lib.ts:149:13)
    at <anonymous> (/tmp/mwoffliner/src/cli.ts:66:8)

I assume it's because the optimization cache URL I'm sending in is https://localhost:9000/?keyId=minio_key&secretAccessKey=minio_secret&bucketName=org-kiwix-dev-cache and it's trying to parse a region from the hostname?

EDIT: If so, I understand that this is an issue for the mwoffliner repo, of course.

@elfkuzco
Copy link
Collaborator Author

elfkuzco commented Nov 3, 2025

Can you use one similar to the minio one configured for the uploader?

@elfkuzco
Copy link
Collaborator Author

elfkuzco commented Nov 3, 2025

EDIT: If so, I understand that this is an issue for the mwoffliner repo, of course.

The thing is the container can't access localhost. You can use https://minio..... because the container can resolve the hostname minio since they all share the same network

@elfkuzco
Copy link
Collaborator Author

elfkuzco commented Nov 3, 2025

Or if you want, you can omit the optimization URL from your task.

@audiodude
Copy link
Member

Okay I definitely think we can skip the S3 cache for dev scraping. After I got rid of that, I got a new error from mwoffliner, which was:

 Failed to read articleList from [http://localhost:5000/v1/builders/0b76807e-c1e3-44c0-a815-b0e8405a51e8/selection/latest.tsv]

This makes sense, since the worker is running inside of the docker compose network, while my WP1 web/api/backend is running on the host machine. In fact, this is the exact reason we need to have a zimfarm in dev anyways, because we've changed the logic for the ZIM creation to use a dynamic URL from WP1 itself rather than a static file list on S3.

I think at this point, I'm going to start working on putting the dev backend server into the docker compose graph as well, with all the updates to configuration and README that are required for that. I'd like to use this same PR and then just merge the whole thing once we have a working, consistent dev environment.

@benoit74 @elfkuzco WDYT?

@elfkuzco
Copy link
Collaborator Author

elfkuzco commented Nov 4, 2025

I agree with you.

@benoit74
Copy link
Contributor

benoit74 commented Nov 4, 2025

Yes for dev we should skip the S3 cache, we will not gain much besides pain. And this is more an internal detail to mwoffliner operation, not really needed.

I like the idea of adding the backend to the docker graph in same PR. This is a great opportunity to nail down this dev setup issues and have a reproducible setup devs can use from e2e. No more excuses for not testing stuff once in a while from e2e. Also a great asset in term of documentation / learning base.

I would even suggest to also add web and api to the docker graph. With proper mount point and configuration it should be possible to have hot reload whenever dev changes something in the codebase, at least this is what we achieved to have in zimfarm, zimit-frontend and cms repos, and it is (mostly?) totally transparent in terms of performances. It free the developers from having anything to install on their dev machine besides Docker, and ensures there is no headaches due to bad versions and stuff like that. Quite important for everyone which is not a core maintainer and / or a bit lazy to setup stuff correctly on his machine (which includes myself ^^)

@audiodude
Copy link
Member

Okay I've got the following in my docker:

^Ctmoney@tmoney-linux:~/code/wp1$ docker ps
CONTAINER ID   IMAGE                                           COMMAND                  CREATED         STATUS                   PORTS                                                             NAMES
e5a3e2025bc8   wp1-dev-dev-web                                 "flask --app wp1.web…"   5 days ago      Up 6 minutes             0.0.0.0:5000->5000/tcp, [::]:5000->5000/tcp                       wp1bot-web-dev
a133aa726bb8   ghcr.io/openzim/zimfarm-worker-manager:latest   "worker-manager --we…"   6 days ago      Up 6 days                                                                                  zimfarm-worker-manager
978375c302b7   ghcr.io/openzim/zimfarm-ui:latest               "/docker-entrypoint.…"   6 days ago      Up 6 days                127.0.0.1:8003->80/tcp                                            zimfarm-ui
055a203f63ff   wp1-dev-dev-workers                             "/bin/sh -c 'supervi…"   6 days ago      Up 5 minutes                                                                               wp1bot-workers-dev
5294b71f64f6   ghcr.io/openzim/zimfarm-backend:latest          "uvicorn zimfarm_bac…"   6 days ago      Up 6 days (healthy)      127.0.0.1:8004->80/tcp                                            zimfarm-api
590f8488d6f7   minio/minio                                     "/usr/bin/docker-ent…"   6 days ago      Up 6 minutes (healthy)   0.0.0.0:9000-9001->9000-9001/tcp, [::]:9000-9001->9000-9001/tcp   wp1bot-minio-dev
343148f4b8bc   postgres:17.3-bookworm                          "docker-entrypoint.s…"   6 days ago      Up 6 days (healthy)      127.0.0.1:2345->5432/tcp                                          zimfarm-postgresdb
92261129c194   redis                                           "docker-entrypoint.s…"   6 days ago      Up 6 minutes (healthy)   0.0.0.0:9736->6379/tcp, [::]:9736->6379/tcp                       wp1bot-redis-dev
1f0cd8e54a2f   wp1-dev-dev-database                            "docker-entrypoint.s…"   6 days ago      Up 6 minutes             0.0.0.0:6300->3306/tcp, [::]:6300->3306/tcp                       wp1bot-db-dev
1d466e07260f   mariadb:10.4                                    "docker-entrypoint.s…"   20 months ago   Up 3 weeks               0.0.0.0:6600->3306/tcp, [::]:6600->3306/tcp                       wp1bot-test-db
7f06c4c77a50   5b0542ad1e77                                    "docker-entrypoint.s…"   20 months ago   Up 3 weeks               0.0.0.0:9777->6379/tcp, [::]:9777->6379/tcp                       wp1bot-test-redis

I've changed the URL for the article list we send to Zimfarm to try and use the WP1 API that's running in docker, so I'm using http://web-dev:5000/v1/builders/94330657-fe26-4aea-8f14-f959ede293a0/selection/latest.tsv. But I get the following error:

[error] [2025-11-10T00:17:35.346Z] Failed to read articleList from [http://web-dev:5000/v1/builders/94330657-fe26-4aea-8f14-f959ede293a0/selection/latest.tsv] Error: Failed to read articleList from URL: http://web-dev:5000/v1/builders/94330657-fe26-4aea-8f14-f959ede293a0/selection/latest.tsv

I understand that this is a network connectivity issue, and I need to use the right domain for the WP1 API. However, the part I don't understand is the network topology for worker/worker-manger/mwoffliner/etc and where the mwoffliner is actually running on the network. What should I put for http://web-dev:5000? Thanks!

@audiodude
Copy link
Member

audiodude commented Nov 10, 2025

Also tried with wp1bot-web-dev:

[error] [2025-11-10T02:41:00.706Z] Failed to read articleList from [http://wp1bot-web-dev:5000/v1/builders/6a1f2ee7-5947-4222-8e12-b043cf376af4/selection/latest.tsv] Error: Failed to read articleList from URL: http://wp1bot-web-dev:5000/v1/builders/6a1f2ee7-5947-4222-8e12-b043cf376af4/selection/latest.tsv

It's reachable from zimfarm-api:

tmoney@tmoney-linux:~/code/wp1$ docker exec -it zimfarm-api bash
root@5294b71f64f6:/# curl http://wp1bot-web-dev:5000
<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="utf-8" />
    <title>WP 1.0 API</title>
   ....<SNIP>

@elfkuzco
Copy link
Collaborator Author

Could you push your recent changes with the wp1 api in the docker graph?

…e credentials config variable for URLs sent to zimfarm
@audiodude
Copy link
Member

@elfkuzco Done.

@elfkuzco
Copy link
Collaborator Author

elfkuzco commented Nov 11, 2025

@benoit74 , is it possible that my IP is blocked for mwoffliner jobs? I keep getting

[error] [2025-11-11T02:05:39.982Z] Failed to run mwoffliner after [0s]:
 Error: mwUrl [https://en.wikipedia.org/] is not valid.
    at <anonymous> (/tmp/mwoffliner/src/sanitize-argument.ts:189:11)
    at process.processTicksAndRejections (node:internal/process/task_queues:105:5)
    at async sanitize_mwUrl (/tmp/mwoffliner/src/sanitize-argument.ts:188:3)
    at async sanitize_all (/tmp/mwoffliner/src/sanitize-argument.ts:75:3)

@benoit74
Copy link
Contributor

It is possible, but quite unlikely ; I would first start by updating the mwoffliner version you use by pulling ghcr.io/openzim/mwoffliner:latest image again, the message you get is from an old mwoffliner version, this could help.

@elfkuzco
Copy link
Collaborator Author

@audiodude , don't worry about this anymore so we don't duplicate efforts. I will try and create a reproducible workflow and publish by the end of today and push an updated commit.

@elfkuzco
Copy link
Collaborator Author

So, it's working as expected

Screenshot_20251112_141545 Screenshot_20251112_141340

Here are some of the things that made it problematic:

The issue seems to be from upstream in that the dnscache image isn't connected to the same netowork. I made some small fixes to my local worker and pointed it to use that image and that's how I got the results. Will make a PR upstream for this to get the dnscache connected to the same network as the rest of the containers when in development mode.Here's what I used to test as the s3 part of the CLIENT_URL: http://minio:9000/org-kiwix-dev-wp1 that worked for me.

If you look at the latest commit, I removed the directive to overwrite the command. Zimfarm is changing fast these days and it's best to not have that directive there since it might be obsolete. Best to rely on the latest image to do the right thing. I set the command to --pull always during build but it seems it's been removed. Do you think I should bring it back?

I will make additional PRs here to point hte script to fetch the latest offliner versions from farm.openzim.org instead of the scraper repository itself too.

@audiodude
Copy link
Member

Awesome, thanks so much for all of your work on this!

--pull always during build but it seems it's been removed.

I'm not sure I like pull always for starting the entire docker graph, because it means I have to download things like Redis and MariaDB every time I start the server, which makes things slow and wastes resources. Is there a way we can set that option in the docker compose file for just the zimfarm components?

I will make additional PRs here to point hte script to fetch the latest offliner versions from farm.openzim.org instead of the scraper repository itself too.

I will wait for your changes and then approve and merge.

@elfkuzco
Copy link
Collaborator Author

I'm not sure I like pull always for starting the entire docker graph, because it means I have to download things like Redis and MariaDB every time I start the server, which makes things slow and wastes resources. Is there a way we can set that option in the docker compose file for just the zimfarm components?

I think this is probably because they aren't pinned to any specific version. Ideally, only zimfarm components should have latest

@benoit74
Copy link
Contributor

Regarding --pull always, I think this is indeed not that a good idea to do it so unconditionally. Especially because it means WP1 developers could suddenly have lots of moving parts everytime they restart the stack. Which is usually the cause of many headaches (you are fixing a bug in WP1, you restart the stack, you get another totally unrelated error because Zimfarm image has been updated ... hard to realize it is unrelated in general).

I would recommend to not add this flag in commands but add a very clear message saying that developers should regularly pull latest Docker images from the Zimfarm and other components with --pull always (with full command sample). It will also make other commands shorter which helps to understand them.

@elfkuzco
Copy link
Collaborator Author

@audiodude , See latest commit made recently now. Everything works as expected. Here's my settings for the ZIMFARM part of the credentials.py

"ZIMFARM": {
    "url": "http://zimfarm-api/v2",
    "s3_url": "http://localhost:9000/org-kiwix-dev-zims",
    "user": "admin",
    "password": "admin",
    "hook_token": None,
    "definition_version": "1.17.2",
    "image": "ghcr.io/openzim/mwoffliner:1.17.2"
}

Also updated the shell scripts to fetch offliner definitions from farm.openzim.org and populate your local zimfarm API. This would allow you to have much more versions to test against.

Review and let me know if there's any additional thing to fix

@elfkuzco elfkuzco requested a review from audiodude November 14, 2025 00:37
@benoit74
Copy link
Contributor

I recommend this ZIMFARM (or any updated version of it) is commit in the DEV credentials.

My goal for this PR is that:

  • any new developer arriving in WP1 can test ZIM generation from end-to-end without making any modification to the repo, just by running through instructions as clear and as minimal as possible
  • any "experienced" developer working on WP1 knows how to update its (Zimfarm) dev stack as things are moving in zimfarm repository / mwoffliner releases

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants