Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Transformers + Cuda Context bug in NeMo Curator #591

Merged

Conversation

VibhuJawa
Copy link
Collaborator

Description

This PR fixes a bug we saw in the release container which causes cuda context being created on GPU-0. I have not been able to repro this outside this container till now.

docker run --rm -it --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 nvcr.io/nvidia/nemo:25.02

NeMo Curator Repro:

import time
import dask
import dask_cudf
from nemo_curator import get_client


def main():
    client = get_client(cluster_type="gpu")
    print(f"Client obtained: {client}")

    # Load a sample dataset from dask.datasets
    ddf = dask.datasets.timeseries()

    # Convert Dask DataFrame to Dask-cuDF DataFrame
    cudf_ddf = dask_cudf.from_dask_dataframe(ddf)
    print(cudf_ddf.map_partitions(len).compute())
    time.sleep(100)

if __name__ == "__main__":
    main()

General Repro

import time
from dask_cuda import LocalCUDACluster
from dask.distributed import Client
import dask
import dask_cudf
from peft import PeftModel

def get_client():
    cluster = LocalCUDACluster()
    return Client(cluster)

def main():
    client = get_client()
    print(f"Client obtained: {client}")

    # Load a sample dataset from dask.datasets
    ddf = dask.datasets.timeseries()

    # Convert Dask DataFrame to Dask-cuDF DataFrame
    cudf_ddf = dask_cudf.from_dask_dataframe(ddf)
    print(cudf_ddf.map_partitions(len).compute())
    time.sleep(100)

if __name__ == "__main__":
    main()

Verified

This commit was signed with the committer’s verified signature.
VibhuJawa Vibhu Jawa
Signed-off-by: Vibhu Jawa <[email protected]>
@VibhuJawa VibhuJawa added bugfix Fixes a bug in the codebase r0.7.0 labels Mar 14, 2025
@VibhuJawa VibhuJawa requested a review from ryantwolf March 14, 2025 20:14
@ayushdg ayushdg added the gpuci Run GPU CI/CD on PR label Mar 14, 2025
Copy link
Collaborator

@ayushdg ayushdg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this. Let's also followup with some internal CI tests to catch these cases.

@VibhuJawa VibhuJawa merged commit d6fcbdb into NVIDIA:main Mar 14, 2025
10 checks passed
ryantwolf pushed a commit that referenced this pull request Mar 14, 2025
Signed-off-by: Vibhu Jawa <[email protected]>
VibhuJawa added a commit that referenced this pull request Mar 15, 2025

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Signed-off-by: Vibhu Jawa <[email protected]>
Co-authored-by: Vibhu Jawa <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugfix Fixes a bug in the codebase gpuci Run GPU CI/CD on PR r0.7.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants