ADR004: Cascade Features #23

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

tmi wants to merge 3 commits into main from adr002/cascadeFeatures

+88 −0

Contributor

tmi commented Jul 28, 2025

just adding a record + extra content on top of the original architecture document in the project's repo

tmi requested review from HCookie, corentincarton, jameshawkes and tlmquintino

July 28, 2025 09:07

jameshawkes approved these changes

View reviewed changes

Contributor

jameshawkes commented Oct 16, 2025

@tlmquintino @corentincarton @HCookie
Please review so we can close this ADR.

Contributor

jameshawkes commented Oct 16, 2025

ADR002 would need to given a new number

HCookie approved these changes

View reviewed changes

ADR/ADR-002-Cascade-Features.md Outdated Show resolved Hide resolved

tlmquintino requested changes

View reviewed changes

Member

tlmquintino left a comment

I think we need to change the target file to make a sequential number in the ADR.
You should also try to follow the ADR template, in particular the Analysis:

Analysis

[ ] Compared options against relevant criteria (cost, performance, maintainability, risk, etc)
[ ] Explained trade-offs and their implications
[ ] Referenced any prototypes, benchmarks, or research conducted (if applicable)
[ ] Considered both immediate and long-term impacts

I think this is missing. The ADR should be somewhat self-contained. You can point to a detailed outside analysis, but I should be able to understand the why by reading this ADR, and it is not clear at the moment.

ADR/ADR-002-Cascade-Features.md Outdated Show resolved Hide resolved

tmi force-pushed the adr002/cascadeFeatures branch from 772db00 to 63bc491 Compare

November 10, 2025 09:50

tmi requested a review from tlmquintino

November 10, 2025 09:51

HCookie changed the title ~~ADR002: Cascade Features~~ ADR004: Cascade Features

tlmquintino requested changes

View reviewed changes

Member

tlmquintino left a comment

The current title, 'Cascade Features' is somewhat misleading. This ADR isn't about the feature list but about why we build cascade workflow engine.
We should describe more the features, or change the title to something like 'Build Strategy for Cascade Workflow Engine'.

tlmquintino reviewed

View reviewed changes

ADR/ADR-004-Cascade-Features.md Outdated Show resolved Hide resolved

tlmquintino requested changes

View reviewed changes

ADR/ADR-004-Cascade-Features.md Outdated

+. utilize selected internals while pluging in our own frontends/backends where needed,
+. extending Dask itself with what we were missing.
+              The option 1 was deemed unacceptable -- at the least, we need efficient memory sharing between concurrent computing entities (processes/threads) due to the structure of our jobs, as well as generator-like semantics for tasks due to how forecast models output individidual steps.

Member

tlmquintino Nov 25, 2025

can you briefly explain why the generator-like semantics is a dealbreaker?

Contributor Author

tmi Dec 4, 2025

updated

ADR/ADR-004-Cascade-Features.md Outdated


		## Consequences

		It required a few manmonths of investment -- though not more than initially expected, and not order-of-magnitude more than an adapter to Dask would have required.

Member

tlmquintino Nov 25, 2025

Can we give an approximate estimate of the man-months of investment it took? I think it would make it less vague and support the reasoning.

Contributor Author

tmi Dec 4, 2025

expanded

ADR/ADR-004-Cascade-Features.md Outdated Show resolved Hide resolved

ADR/ADR-004-Cascade-Features.md Outdated Show resolved Hide resolved

ADR/ADR-004-Cascade-Features.md Outdated Show resolved Hide resolved

ADR/ADR-004-Cascade-Features.md Outdated Show resolved Hide resolved

ADR/ADR-004-Cascade-Features.md Outdated

+              ### Options Considered
+              Firstly, we considered using Dask as a wholesome offering, and analyzed it in greater detail.
+              We were aware of other solutions like Ray, Daft, Spark; but didn't thorougly analyse them due to resource constraints as well as assumed similarity to Dask.

Member

tlmquintino Nov 25, 2025

I think we need some stronger reasoning why Ray, Daft and Spark were not taken.
For example (please check accuracy of arguments below):

Ray is a compute engine that uses a flexible Task / Actor model with an efficient shared-memory object store. Better memory handling than Dask but is unsuitable because its core primitives would require significant refactoring to accommodate stateful, generator-like (streaming) semantics of the AIFS forecast models.
Spark is mature, heavy-duty distributed processing system primarily for large-scale batch processing and data analytics. Its "streaming" would be based on micro-batches with possible significant operational overhead. Moreover adds a JVM dependency.
Daft is specialised, high-performance distributed DataFrame library built for multimodal data (images, video, etc). It was not considered a viable alternative because it operates at the data processing abstraction layer and is not designed for low-level workflow scheduling and execution engine like Cascade. It uses Ray for large clusters.

Contributor Author

tmi Dec 4, 2025

I think we need some stronger reasoning why Ray, Daft and Spark were not taken.

are we revisiting the decision, or improving the decision recording process? If the latter, I should not be putting here results of an analysis we haven't made at that time :) I chose not to explore any of the mentioned ones deeply because of a "nah won't work" gut feeling, not because of having a strong reasoning supported by numbers. I have expanded the text with my overall architecture experience on which that "nah won't work" was based on at that time and which makes it somehow justified, but shouldn't be putting here things I haven't known at that time or research for things which I don't know even now

or we can of course say, for the sake of being sure we made the right decision, let's do the analysis thoroughly now -- which could result either in an objective confirmation of the gut feeling, or "it later turned out that Ray would have performed acceptably and would not have been hard to tweak, and our decision at the time was thus sub-optimal" amendment instead. I am currently not doing this, but would happily do so if you believe it worth it. Note Ray got adopted by torch foundation a few weeks ago, so presumably we'll see more open source development there, compared to situation last year, where anyscale, the company behind ray, was actually deprecating originally-opensourced components! So this is not necessarily an unstrategic move. Similarly, torch monarch didn't exist back then, py3.14 wasn't released -- all those are potential game changers. Thus we will want to repeat this analysis -- the question is, now or in a year+, ie, within this adr or within a next one

tmi added 3 commits

December 4, 2025 15:02


          ADR002: Cascade Features

d2b0e3e


          ADR002->ADR004, and added analysis

d818d11


          [ADR004] further analysis notes

f295244

tmi force-pushed the adr002/cascadeFeatures branch from 63bc491 to f295244 Compare

December 4, 2025 14:57

tmi requested review from jameshawkes and tlmquintino

December 4, 2025 15:09

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet