-
Notifications
You must be signed in to change notification settings - Fork 157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
File exists: '/000000_epoch_shape' when using the ddp strategy from pytorch lightning #767
Comments
hello @elbamos are you able to loop through the dataloader by itself? (meaning, just a pure for loop, no trainer involved). If so, does this share mem problem show up consistently? and does it show up with other trainer/launcher? |
Thanks, @XiaohanZhangCMU I'm actually able to train fine as long as I'm training on one GPU. The problem arises when I try to train on multiple gpus using the |
I'm not sure how to do that, because the env vars are only set when inside the call to At the beginning of the call to
which makes me think it may be setting |
Yes, they're setting |
@XiaohanZhangCMU just tagging you to make sure you saw the messages above... Thank you in advance for your help with this. |
I never used lightning before, I am asking a few folks on the team who may have done that and can share the experience. On the other hand, if you cannot change anything on the lightning end, maybe try monkeypatch this file to derive the missing env vars from Lightning? For example,
|
Yeah, that explains why the file exists error. Streaming relies on rank to detect workers, nodes etc. |
Actually - I think I solved this. The |
@elbamos Great. Before closing the issue, can you elaborate a bit more what was the root cause and the resolution you arrived? I'm sure it is valuable learning for other users as well. Thank you! |
The root cause of the issue is that pytorch lightning doesn't properly set the I have a partial solution with two parts:
While this runs on hardware with 4 gpus, performance is seriously degraded. I get 3-4 it/s on one GPU, I get .8 it/s on 4 GPUs. It isn't clear to me whether this is caused by a misconfiguration of mosaic streaming, or whether its to be expected from the On 8 gpus, however, the call to instantiate the
where the number preceding "locals" changes each run. The stack trace is:
For those reasons, I'm leaving this open, and tagging @XiaohanZhangCMU one more time to see if he has any advice? |
One amendment: Adding
to the callback enabled it to launch on 8 gpus, but performance fell to .26 it/s. |
@elbamos sorry not many of us have hands-on experience with lightning, so not much insights can offer here. (do you consider switching to composer?) Streaming uses SharedMemory and resource_tracker to orchestrate processes and manipulate shared arrays/scalars etc. I am not very sure whether "create a pytorch lightning DataModule that instantiates the StreamingDataset" would comply with that design, which may be the main source of performance degradation. |
I am considering switching to composer; I'm not sure if I can run composer on multiple gpus from a notebook though? Using the |
That messes up with streaming's initialization. If you are running a notebook, have you tried torchdistributor + lightning? E.g.,
|
Thank you for the torch distributor suggestion. That looks like a potentially promising approach. I was able to get it running with some work. But - If I create the |
Hi, I use lightning with Mosaic Streaming. The trick is to launch your training script with |
@elbamos Can you try trochrun as what @jbohnslav suggested? Let us know if it works. |
I've been trying that this morning, thank you to both of you. Executing @jbohnslav can you share any more details of your configuration? Are you building the |
I think you're seeing two separate issues: if you can't get streaming dataset to work at all with pytorch lightning, then
I can't help with a databricks notebook environment. If you can't call torchrun at a command line, you can just import it like so:
We are building the dataset in a |
Also getting something similar
|
@elbamos As mentioned, torchrun or torch distributor work with StreamingDataset, in addition to Composer. From a Databricks notebook, torch distributor should make launching your job easy. @jbohnslav Regarding:
|
@AugustDev You filed #781, correct? @XiaohanZhangCMU's recommendations there make sense to me -- you can see the currently running processes with |
Environment
To reproduce
Steps to reproduce the behavior:
Expected behavior
I'd expect training to begin.
Additional context
The text was updated successfully, but these errors were encountered: