Distributed processing and scaling up #550
Replies: 3 comments
-
Hi @rom1504, thanks for starting this discussion!
Thanks for your work on those! We've used some of your tools and included them in some of our components. We opted to copy in part of the tools instead of using them as dependency because of some dependency conflict issues. I'll open an issue on one of your repos with more info.
There's indeed some similarities with Jina, but I think the angle is a bit different. Jina focuses on hosting your models, and allows you to chain them into pipelines. Fondant focuses on reusable data processing, with the possibility to host models inside components. It will be interesting to see how much those angles converge. Our focus on reusable data processing is also the reason for docker and our well defined component structure. It helps with the interoperability and reusability of the components.
There's a couple of aspects to this:
When running on Linux, there is virtually no overhead. On Mac or Windows there might be, but that's not really relevant for scaling. The only place where there might be overhead is on the networking side. We've noticed that we don't achieve the same performance downloading images as img2dataset, but we haven't had the time to properly investigate where the issue lies. We think it might be due to docker, but if it is, it might be resolved by proper network configuration (although that might not be possible on every orchestrator). If you want to get a feel yourself for the points mentioned above, I recommend giving our local runner a spin (see our Getting started docs.
Design Proper design will still be important to be able to scale pipelines and components. We've seen this for instance when extracting URLs from common crawl, where we had to combine a lot of steps into a single component to prevent large data movement, while smaller parts of the flow could have been reusable for other use cases. Or when doing global deduplication, where we had to cluster the data in a first component, to then do local deduplication per cluster in a second component, splitting a single logical step into two components. Supporting nested pipelines (including a pipeline as a component in a larger pipeline) might offer some benefits here in the future. But mainly the reusability of components will. Since a component can be implemented once and reused many times, it only needs to be designed properly once. Dask We currently get two things from Dask:
You can see both of these combined in our Concurrency on a single core is not handled by Dask, and currently needs to be implemented manually, but we might offer an abstraction for this in Fondant in the future. Dask might not always be the best choice for every data type or transformation, so we made sure to encapsulate the Dask-related code in specific dataIO classes. This allows us to support additional frameworks in the future. Kubeflow pipelines Kubeflow pipelines is only one of the orchestrators that we support, but we can use it as an example, as the other orchestrators we (will) support are very similar. The defining features are:
This is also currently the limit to our scaling.
I can see paths to move beyond this single-machine limit though:
This became quite long 😅, but it was helpful for myself as well to write this down in a (hopefully) structured way. Looking forward to your feedback. |
Beta Was this translation helpful? Give feedback.
-
Turning this into a discussion so we can track the progress towards distributed execution in #549 |
Beta Was this translation helpful? Give feedback.
-
Hey, sorry for delay, thanks for the long answer!
That makes sense, is that true as well if you run maybe 100 different fondant components ? how do they get instantiated ? are they all just functions in a single python process ? not sure how this works in this case.
That's the part I think would be most interesting to figure out sooner.
That would mean multiple jobs (for example dask jobs) for one fondant pipeline. That may makes sense for some cases (if some component require very different resources for example) but I bet in many case being able to run one job for all components may make the most sense.
Do you mean fondant would re-implement its own dask/spark equivalent ? that means handling failure and retry for example (spark does that by storing intermediary data, being able to recompute only what's necessary etc) Overall I think 2 points are really important for this kind of general data processing tooling. I think they are interesting to figure out and keep improving:
I'll keep following up what you are doing here, and also thinking for myself on ways to achieve the above. Good luck on this, and happy to give feedback if that's useful! |
Beta Was this translation helpful? Give feedback.
-
Hello!
I like the goals you have with fondant. I also believe data processing at scale is quite important for ML.
I have similar goals with img2dataset, clip-retrieval, video2dataset and cc2dataset and these tools worked pretty well at scale.
As you've seen, a lot of different filtering and transformation are possible and making those modular and reusable is nice. It's true for text and images and it's even more the case for bigger modalities such as video, 3d and bio data.
You made the choice here to package everything with docker and use kubeflow and dask and have a well defined component structure. I find the architecture to be quite similar to https://github.com/jina-ai/jina
I am wondering what you found in term of
Thank you for any insights!
PS: maybe this should go to the discussion tab; feel free to move it there if it makes more sense
Beta Was this translation helpful? Give feedback.
All reactions