-
Notifications
You must be signed in to change notification settings - Fork 0
ADR004: Cascade Features #23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
@tlmquintino @corentincarton @HCookie |
|
ADR002 would need to given a new number |
tlmquintino
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to change the target file to make a sequential number in the ADR.
You should also try to follow the ADR template, in particular the Analysis:
Analysis
[ ] Compared options against relevant criteria (cost, performance, maintainability, risk, etc)
[ ] Explained trade-offs and their implications
[ ] Referenced any prototypes, benchmarks, or research conducted (if applicable)
[ ] Considered both immediate and long-term impacts
I think this is missing. The ADR should be somewhat self-contained. You can point to a detailed outside analysis, but I should be able to understand the why by reading this ADR, and it is not clear at the moment.
772db00 to
63bc491
Compare
tlmquintino
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current title, 'Cascade Features' is somewhat misleading. This ADR isn't about the feature list but about why we build cascade workflow engine.
We should describe more the features, or change the title to something like 'Build Strategy for Cascade Workflow Engine'.
ADR/ADR-004-Cascade-Features.md
Outdated
| 2. utilize selected internals while pluging in our own frontends/backends where needed, | ||
| 3. extending Dask itself with what we were missing. | ||
|
|
||
| The option 1 was deemed unacceptable -- at the least, we need efficient memory sharing between concurrent computing entities (processes/threads) due to the structure of our jobs, as well as generator-like semantics for tasks due to how forecast models output individidual steps. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you briefly explain why the generator-like semantics is a dealbreaker?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
ADR/ADR-004-Cascade-Features.md
Outdated
|
|
||
| ## Consequences | ||
|
|
||
| It required a few manmonths of investment -- though not more than initially expected, and not order-of-magnitude more than an adapter to Dask would have required. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we give an approximate estimate of the man-months of investment it took? I think it would make it less vague and support the reasoning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
expanded
ADR/ADR-004-Cascade-Features.md
Outdated
| ### Options Considered | ||
|
|
||
| Firstly, we considered using Dask as a wholesome offering, and analyzed it in greater detail. | ||
| We were aware of other solutions like Ray, Daft, Spark; but didn't thorougly analyse them due to resource constraints as well as assumed similarity to Dask. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need some stronger reasoning why Ray, Daft and Spark were not taken.
For example (please check accuracy of arguments below):
- Ray is a compute engine that uses a flexible Task / Actor model with an efficient shared-memory object store. Better memory handling than Dask but is unsuitable because its core primitives would require significant refactoring to accommodate stateful, generator-like (streaming) semantics of the AIFS forecast models.
- Spark is mature, heavy-duty distributed processing system primarily for large-scale batch processing and data analytics. Its "streaming" would be based on micro-batches with possible significant operational overhead. Moreover adds a JVM dependency.
- Daft is specialised, high-performance distributed DataFrame library built for multimodal data (images, video, etc). It was not considered a viable alternative because it operates at the data processing abstraction layer and is not designed for low-level workflow scheduling and execution engine like Cascade. It uses Ray for large clusters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need some stronger reasoning why Ray, Daft and Spark were not taken.
are we revisiting the decision, or improving the decision recording process? If the latter, I should not be putting here results of an analysis we haven't made at that time :) I chose not to explore any of the mentioned ones deeply because of a "nah won't work" gut feeling, not because of having a strong reasoning supported by numbers. I have expanded the text with my overall architecture experience on which that "nah won't work" was based on at that time and which makes it somehow justified, but shouldn't be putting here things I haven't known at that time or research for things which I don't know even now
or we can of course say, for the sake of being sure we made the right decision, let's do the analysis thoroughly now -- which could result either in an objective confirmation of the gut feeling, or "it later turned out that Ray would have performed acceptably and would not have been hard to tweak, and our decision at the time was thus sub-optimal" amendment instead. I am currently not doing this, but would happily do so if you believe it worth it. Note Ray got adopted by torch foundation a few weeks ago, so presumably we'll see more open source development there, compared to situation last year, where anyscale, the company behind ray, was actually deprecating originally-opensourced components! So this is not necessarily an unstrategic move. Similarly, torch monarch didn't exist back then, py3.14 wasn't released -- all those are potential game changers. Thus we will want to repeat this analysis -- the question is, now or in a year+, ie, within this adr or within a next one
63bc491 to
f295244
Compare
just adding a record + extra content on top of the original architecture document in the project's repo