Skip to content

Added in the revised DL components#127

Open
choosehappy wants to merge 22 commits intov2.0from
v2.0-dlfeat
Open

Added in the revised DL components#127
choosehappy wants to merge 22 commits intov2.0from
v2.0-dlfeat

Conversation

@choosehappy
Copy link
Owner

Looks to be done - haven't actually run it though since i don't have the environment setup on my computer

hoping you can do a full review, make whatever modifications, and then we take it from there!

While there is a lot of added functionality - the main core remained essentially the same

Copy link
Collaborator

@jacksonjacobs1 jacksonjacobs1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks mostly good to me, but as it stands the code causes several errors on my side.

I left a few questions - please let me know what you think, I can resolve the bugs on my side.

# Only yield patches with sufficient mask coverage
if patch_mask.sum() > 0:
# Compute HV map for this patch
hv_map = compute_hv_map(patch_mask)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why this is done on 512x512 patches rather than on the full tile? The maximum size of a tile is 2048x2048, and it seems like this function grows linearly wrt. mask image area.

either way, I would also recommend caching the hv_map. Do you agree?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

512 x 512 has a number of benefits, primarily it enriches the dataset for regions which actually have annotations. if there are a few 512 x 512 which have annotations, then you as well get a stronger less noisy gradient/derivative. many of the introduces losses are more valuable when there is at least 1 positive region present. as well, by breaking it into smaller patches, we have better control over memory consumption (batch size here is now 4 instead of 1, but looks like it could as well be increased). i don't really expect tile size to be much larger than 2k x 2k.

i am unsure about caching the hv_map - here it is being computed on only the postiive patches, but we don't cache positive patches, we cache tiles. so the options would have to be to either cache patches (not attractive) or cache tile + hv_map at the tile level - however, if most of the tile is not anntoated anyway, (i.e., there are only a few positive objects), then we end up doing a lot of hv_map computation for a small of the data. note that if there is no anntoation present, hv_map == 0, so it is typically quite sparce. i suspect when the system is just starting, computing it dynamically on the small subset of patches where it is valid is more computationally efficient than computing it on the entire tile. i haven't thought about how that pattern changes when a large % of the data is annotated - but at the same time, the system should be working well at that point, so the impact on the user is limited. thoughts?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The generation of the hv_map doesn't seem to be substantially slowing down training, so I'm happy to leave caching as an optimization todo.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

: ) great, a problem for future us : )

masks=[patch_mask, hv_map]
)
patch_img = augmented['image']
patch_mask, hv_map = augmented['masks']
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After transforms, is patch_mask still a binary ndarray? I'm getting errors like this:

(DLActor pid=135147)   File "/opt/QuickAnnotator/quickannotator/dl/loss.py", line 679, in _hierarchical_prototype_loss
(DLActor pid=135147)     neg_idx = (~mask_flat).nonzero(as_tuple=True)[0]
(DLActor pid=135147) TypeError: ~ (operator.invert) is only implemented on integer and Boolean-type tensors

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed, this is strange - it should be either an integer [0,1] or a boolean - this error seems quite strange - what type is it of?

hv_maps = batch_data[2]

# Move to device and normalize images to [0, 1]
images = images.half().to(device) / 255.0
Copy link
Collaborator

@jacksonjacobs1 jacksonjacobs1 Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept getting the following error on my side:

(DLActor pid=219235)   File "/opt/QuickAnnotator/quickannotator/dl/loss.py", line 628, in forward
(DLActor pid=219235)     raise ValueError(f"Loss is NaN/Inf! Check individual loss components for issues.")
(DLActor pid=219235) ValueError: Loss is NaN/Inf! Check individual loss components for issues.
(DLActor pid=219235) Wrote the latest version of all result files and experiment state to '/home/ray/ray_results/TorchTrainer_2026-01-29_21-46-44' in 0.0051s.
(DLActor pid=219235) Trials did not complete: [TorchTrainer_0f0af_00000]
(DLActor pid=219235) 
(DLActor pid=219235) Training errored after 0 iterations at 2026-01-29 21:49:26. Total running time: 2min 41s

I narrowed down the issue and saw that the model's pred output sometimes included tensors filled with NaN values, causing downstream problems when computing loss.

I believe that this problem was due to incorrect usage of the autocast method. According to the pytorch documentation:

When entering an autocast-enabled region, Tensors may be any type. You should not call half() or bfloat16() on your model(s) or inputs when using autocasting.

I then removed .half(), and am no longer receiving NaN errors.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm okay - according to chatgpt it should be possible to have" For training, usually: float32 model + half images + autocast() is safe and gives memory/throughput gains." , but i suppose given our system we should aim for robustness if possible. would be interesting to know which of the losses is causing the Nan - this is another reason why i seperated them in the tensorboard, when a NAN is encountered, it puts a triangle - just need to spot the first triangle one and that tells you where the error is. could be an edge case that needs some if/edge catch statments or whatnot

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am occasionally still seeing NaN predictions even after removing the .half() call.

Looking at the tensorboard, I don't see any NAN loss values being shown, meaning that the total loss becomes NaN all at once, triggering the ValueError and shutting down the DLActor.

This happens because the model prediction itself (model_output) is nan for some select tiles. All of the loss functions operate on some model output, so a NAN input causes failure in all of them.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at least from my experience, i've not seen the sort of behavior being described here.

its always a NaN somewhere in an isolated loss, which when added causes the total loss to NaN, then triggers a NaN backgradient, which then causes NaNs on the subsequent forward pass.

a "sane" model should never be able to produce a Nan on a forward pass - the only way this is possible is if a weight is a Nan, which implies something has to have made it invalid in the previous backpass. i.e., the forward pass should essentially be "read-only"

i would add assertion checks on each loss individually to identify which one is triggering this cascade and to save the exact forward tiles which are causing the Nan. debugging should be fairly straightforward after that i hope?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried monitoring individual losses and am now unable to reproduce NAN values for any individual loss function, let alone the total loss.

If this issue is rare, it may be okay to leave alone. I realize that once something catastrophic like this has happened, the system won't recover on its own. The only way to recover the system if this happens is to unload the model from memory (something we don't explicitly allow) and delete all available checkpoint files. This is pretty disruptive

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very strange - how many times did you try? i.e., was this a 50% failure rate, or you tried many times it was only one of the first ones which failed? ultimately - this clearly needs to be rock solid, and there shouldn't be any case within which we end up in this situation. that is to say, if it does happen its a failure of the equations and not e.g., a software "bug" - there should be no Nans overall possible. if we can't figure out when it happens - perhaps at minimum we should add some catching statements -- which essentially says "if this sub-loss is Nan -- set it to zero", this may give some robustness if you see what i mean?


if return_obj_emb: # Object-level embeddings (low-resolution, coarse)
obj_emb = self.obj_proj(features[-1])
obj_emb = F.normalize(obj_emb, dim=1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ChatGPT pointed out that a vector with zero magnitude will cause the normalization operation to produce NaN values.

However, I observed that the NaN values are being produced earlier, by the decoder.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could add a check here if you'd like, and just set it to zero if obj_emb becomes undefined. however, this should be a very very rare edge case - as if it is undefined, the system has experienced catastrophic collapse

looking at the obj_proj:

  self.obj_proj = nn.Conv2d(self.model.encoder.out_channels[-1], embedding_dim, 1)

the only way this could be all zeros is if (a) an input image is embedded to all zeros, and/or (b) the weights of this single convolution layer are all 0, in which case it will be impossible to "learn" ones way out of that slump since the weights would all be correlated and receive the same weight adjustments

do you see what i mean?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes i see what you mean. If the model ends up in this state it's not worth continuing to train it.

max_patches_per_image: int = 200

# Dataset
num_workers: int = -1 #AJ: was 0 but -1 will use all available cores, which is generally what we want for data loading
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is correct? Getting the following error on my side.

ray.exceptions.RayTaskError(ValueError): �[36mray::_RayTrainWorker__execute.get_next()�[39m (pid=8107, ip=172.21.0.4, actor_id=16aa885b9818cb0a16d9237102000000, repr=<ray.train._internal.worker_group.RayTrainWorker object at 0x757438643940>)
 File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/train/_internal/worker_group.py", line 33, in __execute
 raise skipped from exception_cause(skipped)
 File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/train/_internal/utils.py", line 176, in discard_return_wrapper
 train_func(*args, **kwargs)
 File "/opt/QuickAnnotator/quickannotator/dl/training.py", line 107, in train_pred_loop
 dataloader = DataLoader(
 File "/home/ray/anaconda3/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 239, in __init__
 raise ValueError('num_workers option should be non-negative; '
ValueError: num_workers option should be non-negative; use num_workers=0 to disable multiprocessing.
Wrote the latest version of all result files and experiment state to '/home/ray/ray_results/TorchTrainer_2026-02-18_07-39-14' in 0.0040s.
Trials did not complete: [TorchTrainer_fa867_00000]

Copy link
Owner Author

@choosehappy choosehappy Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at least on my machine, setting num_workers causes it to default to creating 1 worker per CPU core. been using this for years (shoud be able to even see it in the original blog post DL code) - but perhaps ray etc doesn't support this? can be replaced with something like os.cpu_count()

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I'm looking at, pytorch's dataloader does not support a num_workers value of -1:
https://docs.pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader

https://chatgpt.com/share/6995e55f-fdd8-8003-b9dc-9bdf2f9bce03

Am I missing something?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm now wondering if i hallucinated that......

ah!

https://github.com/choosehappy/PatchSorter/blob/77456834f76414e7dc4b4e9a49bf7f4e368672f8/patchsorter/approaches/simclr/train_dl_simclr.py#L165

its actually a feature we added into patchsorter! wow... okay.

in practice, its rare to want to use the same exact # of cores for the dataloaders - since you likely want at least 1 core to orchestrate and one to like...do the DL, which is likely why i haven't updated my mental model. weird.

# Gamma
if config.random_gamma:
transforms.append(
A.RandomGamma(p=config.random_gamma_prob, gamma_limit=config.gamma_limit, eps=1e-7)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Invalid args for transforms

/opt/QuickAnnotator/quickannotator/dl/dl_config.py:258: UserWarning: Argument(s) 'var_limit' are not valid for transform GaussNoise
 A.GaussNoise(p=config.gauss_noise_prob, var_limit=config.gauss_var_limit)
/opt/QuickAnnotator/quickannotator/dl/dl_config.py:284: UserWarning: Argument(s) 'eps' are not valid for transform RandomGamma
 A.RandomGamma(p=config.random_gamma_prob, gamma_limit=config.gamma_limit, eps=1e-7)

https://albumentations.ai/docs/api-reference/albumentations/augmentations/pixel/transforms/#RandomGamma

Note that the docstring for randomGamma documents an incorrect argument signature. Seems to be an artifact of an API change that fooled the LLM?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, didn't see that these are just warnings - disregard.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are augmentations that i pulled forward from other projects + previous versions -- if possible it'd be good to remove what causes the warnings. i'm sure our users would get confused as well

max_samples=dl_config.loss.max_samples,
pos_thresh=dl_config.loss.pos_thresh,
post_process_pseudo=dl_config.loss.post_process_pseudo,
min_size=dl_config.loss.min_size,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(DLActor pid=3369) ray.exceptions.RayTaskError(AttributeError): ray::_RayTrainWorker__execute.get_next() (pid=19885, ip=172.21.0.4, actor_id=b4d169a64d9a3c4aedf902c706000000, repr=<ray.train._internal.worker_group.RayTrainWorker object at 0x796df388b940>)
(DLActor pid=3369)   File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/train/_internal/worker_group.py", line 33, in __execute
(DLActor pid=3369)     raise skipped from exception_cause(skipped)
(DLActor pid=3369)   File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/train/_internal/utils.py", line 176, in discard_return_wrapper
(DLActor pid=3369)     train_func(*args, **kwargs)
(DLActor pid=3369)   File "/opt/QuickAnnotator/quickannotator/dl/training.py", line 149, in train_pred_loop
(DLActor pid=3369)     min_size=dl_config.loss.min_size,
(DLActor pid=3369) AttributeError: 'LossConfig' object has no attribute 'min_size'

seems to be another DL hallucination, parameter mismatch

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was but i suspect you may be looking at an old version - this has already been resolved- have you pulled recently? bb5e64c#diff-a21a6238159dba58bb186094cdf14bcc4aa14517db518d6f062738c3ff75f8b4R82

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell I posted this comment after the commit was made. I'm also still getting the same error with the updated code.

In the commit you sent, min_size is still being used in training.py despite being removed from dl_config.py

min_size=dl_config.loss.min_size,

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah i see - missed that one - just pushed the change

max_patches_per_image: int = 200

# Dataset
num_workers: int = 4
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using multiple workers here eventually caused an error on my end.

(DLActor pid=3393)     train_func(*args, **kwargs)
(DLActor pid=3393)   File "/opt/QuickAnnotator/quickannotator/dl/training.py", line 205, in train_pred_loop
(DLActor pid=3393)     batch_data = next(iter(dataloader))
(DLActor pid=3393)   File "/home/ray/anaconda3/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 631, in __next__
(DLActor pid=3393)     data = self._next_data()
(DLActor pid=3393)   File "/home/ray/anaconda3/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1346, in _next_data
(DLActor pid=3393)     return self._process_data(data)
(DLActor pid=3393)   File "/home/ray/anaconda3/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1372, in _process_data
(DLActor pid=3393)     data.reraise()
(DLActor pid=3393)   File "/home/ray/anaconda3/lib/python3.10/site-packages/torch/_utils.py", line 704, in reraise
(DLActor pid=3393)     raise RuntimeError(msg) from None
(DLActor pid=3393) RuntimeError: Caught DatabaseError in DataLoader worker process 0.
(DLActor pid=3393) Original Traceback (most recent call last):
(DLActor pid=3393)   File "/home/ray/anaconda3/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1967, in _exec_single_context
(DLActor pid=3393)     self.dialect.do_execute(
(DLActor pid=3393)   File "/home/ray/anaconda3/lib/python3.10/site-packages/sqlalchemy/engine/default.py", line 952, in do_execute
(DLActor pid=3393)     cursor.execute(statement, parameters)
(DLActor pid=3393) psycopg2.DatabaseError: error with status PGRES_TUPLES_OK and no message from the libpq

This seems to be due to an unsafe practice with our sqlalchemy engine - when processes are forked, the engine is copied across processes, including the existing connections.

The sqlalchemy docs provide some guidance on how to resolve this:
https://docs.sqlalchemy.org/en/20/core/pooling.html#using-connection-pools-with-multiprocessing-or-os-fork

The first solution is the lightest lift, but means that connections will no longer be reused within a process. This is not very efficient.

The recommended solution is good, though it will require engine.dispose(close=False) to be called in the init function of the Dataset class and in the init functions of all ray actors. Thoughts?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah thats right - i can understand why this happens, and something essentially the same happens in e.g., patchsorter even when using pytables. the dataloader init function is called once per worker, so perhaps within that init function one can create + safe an engine in self.engine, and then use that in the iter? in this way each worker indeed has its own engine reference and no "sharing" is taking place. if we have to modify the init functions anyway, perhaps its straightforward to do it more correctly so we don't loose the multi-worker performance and have the .dispose always going off? the 3rd option is to document a link to this thread next to num_workers, and set num_workers = 0 and call it a problem for another day. thoughts?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, makes sense to me, though I think .dispose(close=False) is pretty cheap as it only dereferences the previous connections rather than shutting down the engine

https://docs.sqlalchemy.org/en/21/core/connections.html#sqlalchemy.engine.Engine.dispose.params.close

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i like cheap and easy - if you think its easiest to implement and has minimal impact - its okay for me

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed - in reading this doc more closely - this seems a lot lighter than i had assumed it would be. if this is a 1-line fixer - i'm all for it : )

for i in range(batch_size):
m = mask_np[i, 0]
# Remove small objects
m = remove_small_objects(m, max_size=max_size)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TypeError: remove_small_objects() got an unexpected keyword argument 'max_size'

Should this be remove_small_objects(m, min_size=max_size)?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for remove_small_holes

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Currently 0.24 is installed in the docker container, presumably as a dependency of another package since we don't list it in our requirements.

We can't use 0.26 with the version of python that ships with the ray docker container. Will have to use the 0.24 scikit-image API

(base) root@de9cc2c11c9b:/opt/QuickAnnotator# uv pip install .
Using Python 3.10.18 environment at: /home/ray/anaconda3
  × No solution found when resolving dependencies:
  ╰─▶ Because the current Python version (3.10.18) does not satisfy Python>=3.11 and
      scikit-image==0.26.0 depends on Python>=3.11, we can conclude that scikit-image==0.26.0
      cannot be used.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay - since you have the environment, can you modify those to be inline with that version?

the larger issue here: python 3.10 has EOL in 2026-10 :-\ so it seems there is some refactoring work for v2.1. i'd suggest for PS we simply start with the later version

fyi you can use johnnydep to check dependencies: https://pypi.org/project/johnnydep/

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, agreed. 3.10 happens to be the latest python version that ships with the ray-ml docker container.

I did not realize that ray stopped updating the ray-ml containers over a year ago:
ray-project/ray#46378

I plan to use the following base image for patchsorter development:
https://hub.docker.com/layers/rayproject/ray/2.54.0.1ea498-py313-gpu/images/sha256-0fed2b3ba739e4a6715407f02d3d3b6b8ac206167589d0af4d416aa4c75f4531

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

anyway - i have removed skimage entirely in my latest pull request in place of cupy. i tagged you in a note with an explination. this problem is no longer relevant

import segmentation_models_pytorch as smp
from skimage.morphology import disk, opening, closing, remove_small_objects, remove_small_holes

import cupy as cp
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jacksonjacobs1 sorry - need to add these dependencies to the container - i did so locally using:

pip install --user cucim
pip install --user cupy-cuda13x

note that the cuda version needs to be matched (12 is also available), there are precompiled wheels so not a big deal

this essentially moves all processing onto the GPU instead of pulling to cpu to use the skimage equivalent - this also addresses the other comment above.

this makes things ~2x faster

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems from my side (using cuda 12 wheel) that the added cucim logic is adding GPU memory utilization. On server 04 it's running out of GPU memory.

image

I'd also like to note from nvtop that the GPU is very underutilized. The above plot shows training and prediction for an image stored on NAS. Here's another plot for an image stored directly on the server:
image

I checked with a smaller delay to make sure the apparent underutilization wasn't a sampling issue.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my understanding of these images is not showing that the memory is full? the memory is the yellow line in both images, is that correct? so in the top one at most it was ~75% full? these should fit very comfortably within the memory - they were developed on a similar GPU

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have very different performance profiles here - this is using the train.py in the improved-v2 branch

image

i'm seeing ~3gb of VRAM usage, and a >60% average GPU utilization - this is probably pretty close to the ideal situation. i think we'd probably need to use a bigger model to get higher utility - its simply crushing through the data too quickly.

what do you think is causing your witnessed performance difference? perhaps try running the train.py as a sanity check?

Copy link
Collaborator

@jacksonjacobs1 jacksonjacobs1 Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that in my screenshots the moment mem% goes from 75% to 0 I observed a CUDA memory error causing the ray actor to crash. The program tried to allocate more than 25% of the total GPU memory resulting in the error.

I forgot to save the stack trace, but it did seem related to the cupy logic.

Not sure what's causing the discrepancy you're seeing. I'll test the train.py script on my end.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants