Skip to content

Conversation

@tmi
Copy link
Contributor

@tmi tmi commented Jul 28, 2025

just adding a record + extra content on top of the original architecture document in the project's repo

@jameshawkes
Copy link
Contributor

@tlmquintino @corentincarton @HCookie
Please review so we can close this ADR.

@jameshawkes
Copy link
Contributor

ADR002 would need to given a new number

Copy link
Member

@tlmquintino tlmquintino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to change the target file to make a sequential number in the ADR.
You should also try to follow the ADR template, in particular the Analysis:

Analysis

[ ] Compared options against relevant criteria (cost, performance, maintainability, risk, etc)
[ ] Explained trade-offs and their implications
[ ] Referenced any prototypes, benchmarks, or research conducted (if applicable)
[ ] Considered both immediate and long-term impacts

I think this is missing. The ADR should be somewhat self-contained. You can point to a detailed outside analysis, but I should be able to understand the why by reading this ADR, and it is not clear at the moment.

@tmi tmi force-pushed the adr002/cascadeFeatures branch from 772db00 to 63bc491 Compare November 10, 2025 09:50
@tmi tmi requested a review from tlmquintino November 10, 2025 09:51
@HCookie HCookie changed the title ADR002: Cascade Features ADR004: Cascade Features Nov 10, 2025
Copy link
Member

@tlmquintino tlmquintino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current title, 'Cascade Features' is somewhat misleading. This ADR isn't about the feature list but about why we build cascade workflow engine.
We should describe more the features, or change the title to something like 'Build Strategy for Cascade Workflow Engine'.

2. utilize selected internals while pluging in our own frontends/backends where needed,
3. extending Dask itself with what we were missing.

The option 1 was deemed unacceptable -- at the least, we need efficient memory sharing between concurrent computing entities (processes/threads) due to the structure of our jobs, as well as generator-like semantics for tasks due to how forecast models output individidual steps.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you briefly explain why the generator-like semantics is a dealbreaker?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated


## Consequences

It required a few manmonths of investment -- though not more than initially expected, and not order-of-magnitude more than an adapter to Dask would have required.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we give an approximate estimate of the man-months of investment it took? I think it would make it less vague and support the reasoning.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expanded

### Options Considered

Firstly, we considered using Dask as a wholesome offering, and analyzed it in greater detail.
We were aware of other solutions like Ray, Daft, Spark; but didn't thorougly analyse them due to resource constraints as well as assumed similarity to Dask.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need some stronger reasoning why Ray, Daft and Spark were not taken.
For example (please check accuracy of arguments below):

  • Ray is a compute engine that uses a flexible Task / Actor model with an efficient shared-memory object store. Better memory handling than Dask but is unsuitable because its core primitives would require significant refactoring to accommodate stateful, generator-like (streaming) semantics of the AIFS forecast models.
  • Spark is mature, heavy-duty distributed processing system primarily for large-scale batch processing and data analytics. Its "streaming" would be based on micro-batches with possible significant operational overhead. Moreover adds a JVM dependency.
  • Daft is specialised, high-performance distributed DataFrame library built for multimodal data (images, video, etc). It was not considered a viable alternative because it operates at the data processing abstraction layer and is not designed for low-level workflow scheduling and execution engine like Cascade. It uses Ray for large clusters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need some stronger reasoning why Ray, Daft and Spark were not taken.

are we revisiting the decision, or improving the decision recording process? If the latter, I should not be putting here results of an analysis we haven't made at that time :) I chose not to explore any of the mentioned ones deeply because of a "nah won't work" gut feeling, not because of having a strong reasoning supported by numbers. I have expanded the text with my overall architecture experience on which that "nah won't work" was based on at that time and which makes it somehow justified, but shouldn't be putting here things I haven't known at that time or research for things which I don't know even now

or we can of course say, for the sake of being sure we made the right decision, let's do the analysis thoroughly now -- which could result either in an objective confirmation of the gut feeling, or "it later turned out that Ray would have performed acceptably and would not have been hard to tweak, and our decision at the time was thus sub-optimal" amendment instead. I am currently not doing this, but would happily do so if you believe it worth it. Note Ray got adopted by torch foundation a few weeks ago, so presumably we'll see more open source development there, compared to situation last year, where anyscale, the company behind ray, was actually deprecating originally-opensourced components! So this is not necessarily an unstrategic move. Similarly, torch monarch didn't exist back then, py3.14 wasn't released -- all those are potential game changers. Thus we will want to repeat this analysis -- the question is, now or in a year+, ie, within this adr or within a next one

@tmi tmi force-pushed the adr002/cascadeFeatures branch from 63bc491 to f295244 Compare December 4, 2025 14:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants