File exists: '/000000_epoch_shape' when using the ddp strategy from pytorch lightning #767

elbamos · 2024-08-25T23:04:50Z

Environment

OS: Databricks runtime 15.3ML with mosaicml streaming 0.8.1.
Hardware (GPU, or instance type): g4dn.12xlarge

To reproduce

Steps to reproduce the behavior:

def get_dataloader_with_mosaic(path, batch_size, shuffle=False):
  # Utility function to clean up stale shared memory during distributed training
  clean_stale_shared_memory()

  # Creating the `StreamingDataset` object and the `StreamingDataLoader` object.
  dataset = StreamingDataset(local=path, shuffle=shuffle, batch_size=batch_size)
  return StreamingDataLoader(dataset, batch_size=batch_size, num_workers=31, drop_last=True, persistent_workers=True), dataset

eval_dataloader, eval_dataset = get_dataloader_with_mosaic(f"{data_storage_location}/mds_{experiment_name}_val", batch_size=256, shuffle=False)
train_dataloader, train_dataset = get_dataloader_with_mosaic(f"{data_storage_location}/mds_{experiment_name}_train", batch_size=32, shuffle=True)

trainer = pl.Trainer(
    accelerator='gpu', 
    devices=4, 
    strategy='ddp_notebook',
    max_epochs=10, 
    num_sanity_val_steps=0
)

trainer.fit(pretrainer, train_dataloader, val_dataloaders=eval_dataloader)

Expected behavior

I'd expect training to begin.

Additional context

-- Process 2 terminated with the following error:
Traceback (most recent call last):
  File "/databricks/python/lib/python3.11/site-packages/torch/multiprocessing/spawn.py", line 75, in _wrap
    fn(i, *args)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-dc84ac28-3e23-4bab-908e-384148539e68/lib/python3.11/site-packages/pytorch_lightning/strategies/launchers/multiprocessing.py", line 173, in _wrapping_function
    results = function(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-dc84ac28-3e23-4bab-908e-384148539e68/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 574, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-dc84ac28-3e23-4bab-908e-384148539e68/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 981, in _run
    results = self._run_stage()
              ^^^^^^^^^^^^^^^^^
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-dc84ac28-3e23-4bab-908e-384148539e68/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 1025, in _run_stage
    self.fit_loop.run()
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-dc84ac28-3e23-4bab-908e-384148539e68/lib/python3.11/site-packages/pytorch_lightning/loops/fit_loop.py", line 205, in run
    self.advance()
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-dc84ac28-3e23-4bab-908e-384148539e68/lib/python3.11/site-packages/pytorch_lightning/loops/fit_loop.py", line 363, in advance
    self.epoch_loop.run(self._data_fetcher)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-dc84ac28-3e23-4bab-908e-384148539e68/lib/python3.11/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 140, in run
    self.advance(data_fetcher)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-dc84ac28-3e23-4bab-908e-384148539e68/lib/python3.11/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 212, in advance
    batch, _, __ = next(data_fetcher)
                   ^^^^^^^^^^^^^^^^^^
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-dc84ac28-3e23-4bab-908e-384148539e68/lib/python3.11/site-packages/pytorch_lightning/loops/fetchers.py", line 133, in __next__
    batch = super().__next__()
            ^^^^^^^^^^^^^^^^^^
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-dc84ac28-3e23-4bab-908e-384148539e68/lib/python3.11/site-packages/pytorch_lightning/loops/fetchers.py", line 60, in __next__
    batch = next(self.iterator)
            ^^^^^^^^^^^^^^^^^^^
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-dc84ac28-3e23-4bab-908e-384148539e68/lib/python3.11/site-packages/pytorch_lightning/utilities/combined_loader.py", line 341, in __next__
    out = next(self._iterator)
          ^^^^^^^^^^^^^^^^^^^^
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-dc84ac28-3e23-4bab-908e-384148539e68/lib/python3.11/site-packages/pytorch_lightning/utilities/combined_loader.py", line 78, in __next__
    out[i] = next(self.iterators[i])
             ^^^^^^^^^^^^^^^^^^^^^^^
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-dc84ac28-3e23-4bab-908e-384148539e68/lib/python3.11/site-packages/streaming/base/dataloader.py", line 58, in __iter__
    for batch in super().__iter__():
  File "/databricks/python/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 631, in __next__
    data = self._next_data()
           ^^^^^^^^^^^^^^^^^
  File "/databricks/python/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1346, in _next_data
    return self._process_data(data)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/databricks/python/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1372, in _process_data
    data.reraise()
  File "/databricks/python/lib/python3.11/site-packages/torch/_utils.py", line 705, in reraise
    raise exception
FileExistsError: Caught FileExistsError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/databricks/python/lib/python3.11/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
    data = fetcher.fetch(index)  # type: ignore[possibly-undefined]
           ^^^^^^^^^^^^^^^^^^^^
  File "/databricks/python/lib/python3.11/site-packages/torch/utils/data/_utils/fetch.py", line 32, in fetch
    data.append(next(self.dataset_iter))
                ^^^^^^^^^^^^^^^^^^^^^^^
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-dc84ac28-3e23-4bab-908e-384148539e68/lib/python3.11/site-packages/streaming/base/dataset.py", line 1501, in __iter__
    sample_ids = self._get_work(epoch, sample_in_epoch)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-dc84ac28-3e23-4bab-908e-384148539e68/lib/python3.11/site-packages/streaming/base/dataset.py", line 1038, in _get_work
    shape_shm, data_shm = self._share_work(epoch_sample_ids)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-dc84ac28-3e23-4bab-908e-384148539e68/lib/python3.11/site-packages/streaming/base/dataset.py", line 953, in _share_work
    shape_shm = SharedMemory(name=name, create=True, size=size, auto_cleanup=False)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-dc84ac28-3e23-4bab-908e-384148539e68/lib/python3.11/site-packages/streaming/base/shared/memory.py", line 41, in __init__
    shm = BuiltinSharedMemory(name, create, size)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/multiprocessing/shared_memory.py", line 104, in __init__
    self._fd = _posixshmem.shm_open(
               ^^^^^^^^^^^^^^^^^^^^^
FileExistsError: [Errno 17] File exists: '/000000_epoch_shape'

The text was updated successfully, but these errors were encountered:

XiaohanZhangCMU · 2024-08-26T16:32:14Z

hello @elbamos are you able to loop through the dataloader by itself? (meaning, just a pure for loop, no trainer involved). If so, does this share mem problem show up consistently? and does it show up with other trainer/launcher?
Having that information would help us isolate the issues further. thanks!

elbamos · 2024-08-26T16:43:27Z

Thanks, @XiaohanZhangCMU I'm actually able to train fine as long as I'm training on one GPU. The problem arises when I try to train on multiple gpus using the ddp_notebook strategy, which launches additional processes by forking. What I suspect is going on is pytorch / pytorch lightning is not setting the environment variables that mosaicml is expecting in the forked processes.

XiaohanZhangCMU · 2024-08-26T16:49:08Z

@elbamos yes, I agree, that's a reasonable hypothesis. Can you compare the env vars on your platform with the ones that streaming expects (listed here)?

elbamos · 2024-08-26T17:59:30Z

I'm not sure how to do that, because the env vars are only set when inside the call to .fit().

At the beginning of the call to fit, lightning outputs:

LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3]

which makes me think it may be setting LOCAL_RANK instead of RANK. I'm going to try to walk through the lightning code to confirm that. Unless you have advice about how to verify the env vars during the call to fit()?

elbamos · 2024-08-26T19:00:20Z

Yes, they're setting LOCAL_RANK and NODE_RANK but not RANK. https://lightning.ai/docs/pytorch/stable/accelerators/gpu_intermediate.html#distributed-data-parallel. Is there any way to make this compatible from the mosaicml side, or is this going to require a change by the lightning folks?

elbamos · 2024-08-26T21:50:27Z

@XiaohanZhangCMU just tagging you to make sure you saw the messages above... Thank you in advance for your help with this.

XiaohanZhangCMU · 2024-08-26T22:47:07Z

I never used lightning before, I am asking a few folks on the team who may have done that and can share the experience.

On the other hand, if you cannot change anything on the lightning end, maybe try monkeypatch this file to derive the missing env vars from Lightning? For example,

def get_rank() -> int:
    """Returns the rank of the current process, which is on ``[0; WORLD_SIZE - 1]``.

    Returns:
        int: The rank.
    """
    #return int(os.environ.get('RANK', 0))
    return os.environ.get('NODE_RANK') * (num of gpus per node) + os.environ.get('local_rank')

XiaohanZhangCMU · 2024-08-26T22:53:43Z

Yes, they're setting LOCAL_RANK and NODE_RANK but not RANK. https://lightning.ai/docs/pytorch/stable/accelerators/gpu_intermediate.html#distributed-data-parallel. Is there any way to make this compatible from the mosaicml side, or is this going to require a change by the lightning folks?

Yeah, that explains why the file exists error. Streaming relies on rank to detect workers, nodes etc.

elbamos · 2024-08-26T23:40:35Z

Actually - I think I solved this. The StreamingDataset needs to be initialized in the forked process rather than in the master process and pickled. Then it runs properly. Sorry for the misdirection.

XiaohanZhangCMU · 2024-08-27T01:10:25Z

@elbamos Great. Before closing the issue, can you elaborate a bit more what was the root cause and the resolution you arrived? I'm sure it is valuable learning for other users as well. Thank you!

elbamos · 2024-08-28T20:33:32Z

The root cause of the issue is that pytorch lightning doesn't properly set the RANK environment variable in processes launched in ddp_notebook mode.

I have a partial solution with two parts:

Instead of instantiating the StreamingDataset in the master process and serializing it to the subprocesses, create a pytorch lightning DataModule that instantiates the StreamingDataset in its setup method.
Add a callback to set the appropriate environment variables:

class EarlyEnvironmentSetter(Callback):
    def __init__(self):
        super().__init__()
        self.rank_set = False
    
    def setup(self, trainer, pl_module, stage):
        if not self.rank_set: 
            world_size = trainer.num_devices
            local_rank = trainer.strategy.local_rank

            os.environ['WORLD_SIZE'] = str(world_size)
            os.environ['LOCAL_WORLD_SIZE'] = str(world_size)
            os.environ['LOCAL_RANK'] = str(local_rank)
            os.environ['RANK'] = str(local_rank)

            self.rank_set = True

While this runs on hardware with 4 gpus, performance is seriously degraded. I get 3-4 it/s on one GPU, I get .8 it/s on 4 GPUs. It isn't clear to me whether this is caused by a misconfiguration of mosaic streaming, or whether its to be expected from the ddp_notebook strategy.

On 8 gpus, however, the call to instantiate the StreamingDataset fails with this error:

FileExistsError: [Errno 17] File exists: '/000012_locals'

where the number preceding "locals" changes each run.

The stack trace is:

  File "/root/.ipykernel/3774/command-3760228790545520-2502499197", line 23, in setup
    self.train_dataset = StreamingDataset(local=f"{data_storage_location}/mds_{experiment_name}_train", shuffle=True, batch_size=self.train_batch_size)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-636537cb-4a5f-463f-be27-bc3277f07b7a/lib/python3.11/site-packages/streaming/base/dataset.py", line 529, in __init__
    self._shm_prefix_int, self._locals_shm = get_shm_prefix(streams_local, streams_remote,
                                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-636537cb-4a5f-463f-be27-bc3277f07b7a/lib/python3.11/site-packages/streaming/base/shared/prefix.py", line 192, in get_shm_prefix
    shm = SharedMemory(name, True, len(data))
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-636537cb-4a5f-463f-be27-bc3277f07b7a/lib/python3.11/site-packages/streaming/base/shared/memory.py", line 41, in __init__
    shm = BuiltinSharedMemory(name, create, size)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/multiprocessing/shared_memory.py", line 104, in __init__
    self._fd = _posixshmem.shm_open(
               ^^^^^^^^^^^^^^^^^^^^^

For those reasons, I'm leaving this open, and tagging @XiaohanZhangCMU one more time to see if he has any advice?

elbamos · 2024-08-28T20:45:00Z

One amendment:

Adding

            os.environ['MASTER_ADDR'] = '127.0.0.1'
            os.environ['MASTER_PORT'] = str('12355')

to the callback enabled it to launch on 8 gpus, but performance fell to .26 it/s.

XiaohanZhangCMU · 2024-08-28T20:48:59Z

@elbamos sorry not many of us have hands-on experience with lightning, so not much insights can offer here. (do you consider switching to composer?)

Streaming uses SharedMemory and resource_tracker to orchestrate processes and manipulate shared arrays/scalars etc. I am not very sure whether "create a pytorch lightning DataModule that instantiates the StreamingDataset" would comply with that design, which may be the main source of performance degradation.

elbamos · 2024-08-28T21:01:29Z

I am considering switching to composer; I'm not sure if I can run composer on multiple gpus from a notebook though?

Using the DataModule means that the call to create the StreamingDataset() happens multiple times, once in each spawned process.

XiaohanZhangCMU · 2024-08-28T21:40:27Z

Using the DataModule means that the call to create the StreamingDataset() happens multiple times, once in each spawned process.

That messes up with streaming's initialization.

If you are running a notebook, have you tried torchdistributor + lightning? E.g.,

def main_training_loop(log_path, num_gpus, num_nodes):
  import torch
  from lightning import Trainer
  torch.set_float32_matmul_precision(precision="medium")
  
  # device_stats = DeviceStatsMonitor()
  trainer = Trainer(accelerator="gpu",
                    devices=num_gpus,
                    num_nodes=num_nodes, 
                    strategy="ddp_notebook") 
  trainer.fit(model=..., datamodule=....)

from pyspark.ml.torch.distributor import TorchDistributor 
NUM_PROCESSES = 2 # 2 gpus
output = TorchDistributor(num_processes=NUM_PROCESSES, local_mode=True, use_gpu=True)\
  .run(main_training_loop)

elbamos · 2024-08-28T23:58:16Z

Thank you for the torch distributor suggestion. That looks like a potentially promising approach. I was able to get it running with some work. But -

If I create the StreamingDataset directly inside main_training_loop, I get an NCCL error. If I use a DataModule and create the StreamingDataset from the setup function, training does begin, but performance drops to 0.02 it/s (on 8 GPUs).

jbohnslav · 2024-08-29T00:48:38Z

Hi, I use lightning with Mosaic Streaming. The trick is to launch your training script with torchrun. Then everything more or less works.

XiaohanZhangCMU · 2024-08-29T16:06:11Z

@elbamos Can you try trochrun as what @jbohnslav suggested? Let us know if it works.

elbamos · 2024-08-29T17:04:46Z

I've been trying that this morning, thank you to both of you.

Executing torchrun directly in the Databricks notebook environment doesn't work, because it doesn't see manually installed Python packages. Calling the pyspark torch distributor with the path to a file instead of with a function, however, according to the documentation, calls torchrun under the hood. I've been trying that. The code executes, but I'm still seeing the performance drop, to 0.02 it/s on 4 GPUs. (It does go up to .36 it/s if I set the number of workers to 0. It isn't clear to me from the streaming documentation whether the number of workers should be 0 or the number of available cores / num_gpus, so I've tried it both ways. Interestingly, the validation speed is 11 it/s with one worker per core, and 0.04 it/s with the number of workers set to 0.)

@jbohnslav can you share any more details of your configuration? Are you building the StreamingDataset inside a DataModule? Using local_mode? Have arguments to torchrun? Using lightning CLI?

jbohnslav · 2024-08-30T13:41:51Z

I think you're seeing two separate issues: if you can't get streaming dataset to work at all with pytorch lightning, then torchrun is our solution. If you're having throughput issues, configuring the Streaming Dataset for optimal performance is a pretty complex endeavor with lots of things to try.

Executing torchrun directly in the Databricks notebook environment doesn't work, because it doesn't see manually installed Python packages.

I can't help with a databricks notebook environment. If you can't call torchrun at a command line, you can just import it like so: from torch.distributed.run import main as torchrun.

Are you building the StreamingDataset inside a DataModule? Using local_mode? Have arguments to torchrun? Using lightning CLI?

We are building the dataset in a DataModule. I'm not sure what local_mode is. We have arguments to torchrun depending on the number of GPUs, nodes, etc. We're using the c10d backend. We're launching from our own python script, not torchrun at the command line or the lightning CLI.

gluonfield · 2024-09-15T14:04:50Z

Also getting something similar

[rank1]: Traceback (most recent call last):
[rank1]:   File "/home/august/cfdx/ai/dna_fm/lightning_gym.py", line 59, in <module>
[rank1]:     main()
[rank1]:   File "/home/august/cfdx/ai/dna_fm/lightning_gym.py", line 55, in main
[rank1]:     run_experiment(config)
[rank1]:   File "/home/august/cfdx/ai/dna_fm/lightning_gym.py", line 24, in run_experiment
[rank1]:     trainer.fit(
[rank1]:   File "/opt/conda/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 538, in fit
[rank1]:     call._call_and_handle_interrupt(
[rank1]:   File "/opt/conda/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 46, in _call_and_handle_interrupt
[rank1]:     return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
[rank1]:   File "/opt/conda/lib/python3.10/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 105, in launch
[rank1]:     return function(*args, **kwargs)
[rank1]:   File "/opt/conda/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 574, in _fit_impl
[rank1]:     self._run(model, ckpt_path=ckpt_path)
[rank1]:   File "/opt/conda/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 943, in _run
[rank1]:     call._call_setup_hook(self)  # allow user to set up LightningModule in accelerator environment
[rank1]:   File "/opt/conda/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 102, in _call_setup_hook
[rank1]:     _call_lightning_datamodule_hook(trainer, "setup", stage=fn)
[rank1]:   File "/opt/conda/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 189, in _call_lightning_datamodule_hook
[rank1]:     return fn(*args, **kwargs)
[rank1]:   File "/home/august/cfdx/ai/dna_fm/datasources/lightning_data_module.py", line 71, in setup
[rank1]:     datasets = DatasetFactory.create_dataset(self.data_args, self.model_args, self.tokenizer)
[rank1]:   File "/home/august/cfdx/ai/dna_fm/datasources/dataset_factory.py", line 289, in create_dataset
[rank1]:     return get_mosaic_dataset(
[rank1]:   File "/home/august/cfdx/ai/dna_fm/datasources/dataset_factory.py", line 201, in get_mosaic_dataset
[rank1]:     result[set_name] = MosaicDatasetWithProcessing(
[rank1]:   File "/home/august/cfdx/ai/dna_fm/datasources/dataset_factory.py", line 122, in __init__
[rank1]:     super().__init__(
[rank1]:   File "/opt/conda/lib/python3.10/site-packages/streaming/base/dataset.py", line 529, in __init__
[rank1]:     self._shm_prefix_int, self._locals_shm = get_shm_prefix(streams_local, streams_remote,
[rank1]:   File "/opt/conda/lib/python3.10/site-packages/streaming/base/shared/prefix.py", line 192, in get_shm_prefix
[rank1]:     shm = SharedMemory(name, True, len(data))
[rank1]:   File "/opt/conda/lib/python3.10/site-packages/streaming/base/shared/memory.py", line 41, in __init__
[rank1]:     shm = BuiltinSharedMemory(name, create, size)
[rank1]:   File "/opt/conda/lib/python3.10/multiprocessing/shared_memory.py", line 104, in __init__
[rank1]:     self._fd = _posixshmem.shm_open(
[rank1]: FileExistsError: [Errno 17] File exists: '/000006_locals'
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 10 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000003_shard_states': [Errno 2] No such file or directory: '/000003_shard_states'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000006_shard_access_times': [Errno 2] No such file or directory: '/000006_shard_access_times'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000003_cache_usage': [Errno 2] No such file or directory: '/000003_cache_usage'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000003_next_epoch': [Errno 2] No such file or directory: '/000003_next_epoch'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000003_barrier': [Errno 2] No such file or directory: '/000003_barrier'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000006_next_epoch': [Errno 2] No such file or directory: '/000006_next_epoch'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000006_cache_usage': [Errno 2] No such file or directory: '/000006_cache_usage'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000006_barrier': [Errno 2] No such file or directory: '/000006_barrier'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000006_shard_states': [Errno 2] No such file or directory: '/000006_shard_states'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000003_shard_access_times': [Errno 2] No such file or directory: '/000003_shard_access_times'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
[rank: 1] Child process with PID 11951 terminated with code 1. Forcefully terminating all other processes to avoid zombies 🧟
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 13 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000001_shard_states': [Errno 2] No such file or directory: '/000001_shard_states'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000005_shard_states': [Errno 2] No such file or directory: '/000005_shard_states'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000005_barrier': [Errno 2] No such file or directory: '/000005_barrier'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000006_locals': [Errno 2] No such file or directory: '/000006_locals'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000001_barrier': [Errno 2] No such file or directory: '/000001_barrier'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000001_next_epoch': [Errno 2] No such file or directory: '/000001_next_epoch'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000005_cache_usage': [Errno 2] No such file or directory: '/000005_cache_usage'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000001_shard_access_times': [Errno 2] No such file or directory: '/000001_shard_access_times'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000005_next_epoch': [Errno 2] No such file or directory: '/000005_next_epoch'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000005_shard_access_times': [Errno 2] No such file or directory: '/000005_shard_access_times'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000001_cache_usage': [Errno 2] No such file or directory: '/000001_cache_usage'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 13 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000004_barrier': [Errno 2] No such file or directory: '/000004_barrier'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000004_shard_access_times': [Errno 2] No such file or directory: '/000004_shard_access_times'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000005_locals': [Errno 2] No such file or directory: '/000005_locals'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000004_next_epoch': [Errno 2] No such file or directory: '/000004_next_epoch'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000007_locals': [Errno 2] No such file or directory: '/000007_locals'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000007_shard_states': [Errno 2] No such file or directory: '/000007_shard_states'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000007_next_epoch': [Errno 2] No such file or directory: '/000007_next_epoch'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000007_shard_access_times': [Errno 2] No such file or directory: '/000007_shard_access_times'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000007_cache_usage': [Errno 2] No such file or directory: '/000007_cache_usage'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000004_cache_usage': [Errno 2] No such file or directory: '/000004_cache_usage'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000004_shard_states': [Errno 2] No such file or directory: '/000004_shard_states'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000007_barrier': [Errno 2] No such file or directory: '/000007_barrier'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 12 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000004_locals': [Errno 2] No such file or directory: '/000004_locals'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000000_cache_usage': [Errno 2] No such file or directory: '/000000_cache_usage'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000004_barrier': [Errno 2] No such file or directory: '/000004_barrier'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000004_cache_usage': [Errno 2] No such file or directory: '/000004_cache_usage'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000004_shard_access_times': [Errno 2] No such file or directory: '/000004_shard_access_times'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000000_shard_states': [Errno 2] No such file or directory: '/000000_shard_states'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000000_shard_access_times': [Errno 2] No such file or directory: '/000000_shard_access_times'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000004_next_epoch': [Errno 2] No such file or directory: '/000004_next_epoch'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000000_barrier': [Errno 2] No such file or directory: '/000000_barrier'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000000_next_epoch': [Errno 2] No such file or directory: '/000000_next_epoch'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000004_shard_states': [Errno 2] No such file or directory: '/000004_shard_states'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 12 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000006_next_epoch': [Errno 2] No such file or directory: '/000006_next_epoch'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000003_next_epoch': [Errno 2] No such file or directory: '/000003_next_epoch'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000003_shard_states': [Errno 2] No such file or directory: '/000003_shard_states'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000003_shard_access_times': [Errno 2] No such file or directory: '/000003_shard_access_times'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 13 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000006_shard_states': [Errno 2] No such file or directory: '/000006_shard_states'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000003_cache_usage': [Errno 2] No such file or directory: '/000003_cache_usage'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000003_barrier': [Errno 2] No such file or directory: '/000003_barrier'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000006_shard_access_times': [Errno 2] No such file or directory: '/000006_shard_access_times'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000005_cache_usage': [Errno 2] No such file or directory: '/000005_cache_usage'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000003_locals': [Errno 2] No such file or directory: '/000003_locals'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000006_locals': [Errno 2] No such file or directory: '/000006_locals'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000001_cache_usage': [Errno 2] No such file or directory: '/000001_cache_usage'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000006_cache_usage': [Errno 2] No such file or directory: '/000006_cache_usage'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000006_barrier': [Errno 2] No such file or directory: '/000006_barrier'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000001_shard_states': [Errno 2] No such file or directory: '/000001_shard_states'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000005_locals': [Errno 2] No such file or directory: '/000005_locals'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000001_next_epoch': [Errno 2] No such file or directory: '/000001_next_epoch'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000005_shard_states': [Errno 2] No such file or directory: '/000005_shard_states'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000005_barrier': [Errno 2] No such file or directory: '/000005_barrier'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000005_shard_access_times': [Errno 2] No such file or directory: '/000005_shard_access_times'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000005_next_epoch': [Errno 2] No such file or directory: '/000005_next_epoch'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000001_locals': [Errno 2] No such file or directory: '/000001_locals'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000001_barrier': [Errno 2] No such file or directory: '/000001_barrier'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000001_shard_access_times': [Errno 2] No such file or directory: '/000001_shard_access_times'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 7 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000002_next_epoch': [Errno 2] No such file or directory: '/000002_next_epoch'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000002_barrier': [Errno 2] No such file or directory: '/000002_barrier'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000002_shard_states': [Errno 2] No such file or directory: '/000002_shard_states'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000002_cache_usage': [Errno 2] No such file or directory: '/000002_cache_usage'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000002_shard_access_times': [Errno 2] No such file or directory: '/000002_shard_access_times'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000001_locals': [Errno 2] No such file or directory: '/000001_locals'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000002_locals': [Errno 2] No such file or directory: '/000002_locals'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/000004_locals': [Errno 2] No such file or directory: '/000004_locals'
  warnings.warn('resource_tracker: %r: %s' % (name, e))

snarayan21 · 2024-09-16T14:23:26Z

@elbamos As mentioned, torchrun or torch distributor work with StreamingDataset, in addition to Composer. From a Databricks notebook, torch distributor should make launching your job easy.

@jbohnslav Regarding:

If you're having throughput issues, configuring the Streaming Dataset for optimal performance is a pretty complex endeavor with lots of things to try.
We've built the Streaming simulator for exactly this issue -- if you're seeing dataloader bottlenecks or want to optimize dataloading performance, we highly recommend using it.

snarayan21 · 2024-09-16T14:24:38Z

@AugustDev You filed #781, correct? @XiaohanZhangCMU's recommendations there make sense to me -- you can see the currently running processes with top and kill them. Then clear your stale shared memory and rerun training.

elbamos added the bug Something isn't working label Aug 25, 2024

deepanshu-a2z mentioned this issue Oct 13, 2024

Running into "FileExistsError: [Errno 17] File exists: '/000000_epoch_shape'" even with single GPU #802

Closed

nicolasj92 mentioned this issue Mar 10, 2025

StreamingDataset gives FileExistsError when called with multiprocessing #884

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

File exists: '/000000_epoch_shape' when using the ddp strategy from pytorch lightning #767

File exists: '/000000_epoch_shape' when using the ddp strategy from pytorch lightning #767

elbamos commented Aug 25, 2024 •

edited

Loading

XiaohanZhangCMU commented Aug 26, 2024

elbamos commented Aug 26, 2024

XiaohanZhangCMU commented Aug 26, 2024

elbamos commented Aug 26, 2024

elbamos commented Aug 26, 2024

elbamos commented Aug 26, 2024

XiaohanZhangCMU commented Aug 26, 2024 •

edited

Loading

XiaohanZhangCMU commented Aug 26, 2024

elbamos commented Aug 26, 2024

XiaohanZhangCMU commented Aug 27, 2024

elbamos commented Aug 28, 2024

elbamos commented Aug 28, 2024

XiaohanZhangCMU commented Aug 28, 2024

elbamos commented Aug 28, 2024

XiaohanZhangCMU commented Aug 28, 2024 •

edited

Loading

elbamos commented Aug 28, 2024

jbohnslav commented Aug 29, 2024

XiaohanZhangCMU commented Aug 29, 2024

elbamos commented Aug 29, 2024

jbohnslav commented Aug 30, 2024

gluonfield commented Sep 15, 2024

snarayan21 commented Sep 16, 2024

snarayan21 commented Sep 16, 2024

File exists: '/000000_epoch_shape' when using the ddp strategy from pytorch lightning #767

File exists: '/000000_epoch_shape' when using the ddp strategy from pytorch lightning #767

Comments

elbamos commented Aug 25, 2024 • edited Loading

Environment

To reproduce

Expected behavior

Additional context

XiaohanZhangCMU commented Aug 26, 2024

elbamos commented Aug 26, 2024

XiaohanZhangCMU commented Aug 26, 2024

elbamos commented Aug 26, 2024

elbamos commented Aug 26, 2024

elbamos commented Aug 26, 2024

XiaohanZhangCMU commented Aug 26, 2024 • edited Loading

XiaohanZhangCMU commented Aug 26, 2024

elbamos commented Aug 26, 2024

XiaohanZhangCMU commented Aug 27, 2024

elbamos commented Aug 28, 2024

elbamos commented Aug 28, 2024

XiaohanZhangCMU commented Aug 28, 2024

elbamos commented Aug 28, 2024

XiaohanZhangCMU commented Aug 28, 2024 • edited Loading

elbamos commented Aug 28, 2024

jbohnslav commented Aug 29, 2024

XiaohanZhangCMU commented Aug 29, 2024

elbamos commented Aug 29, 2024

jbohnslav commented Aug 30, 2024

gluonfield commented Sep 15, 2024

snarayan21 commented Sep 16, 2024

snarayan21 commented Sep 16, 2024

elbamos commented Aug 25, 2024 •

edited

Loading

XiaohanZhangCMU commented Aug 26, 2024 •

edited

Loading

XiaohanZhangCMU commented Aug 28, 2024 •

edited

Loading