Skip to content

Commit a2c886a

Browse files
pt247Prashant Tiwarikcpeveyericdatakelly
authored
How to use GPUs (#471)
* Initial draft for gpu. * Minor cleanup. * Update after review. * Update output of cuda check. * Update output from aws server. * Replace console output with text. * Remove todo. * Clarify CONDA_OVERRIDE_CUDA version selection. * Rename also see to related links. * Further clarification arround version selection for CONDA_OVERRIDE_CUDA. * Readme refactor. * Apply suggestions from code review applying review comments so I can view result and provide additional feedback. * Rewrite of Build a GPU-compatible environment section. * Update docs/docs/how-tos/use-gpus.mdx: validation text refactor Co-authored-by: Eric Kelly <[email protected]> * Build a GPU-compatible environment in conda-store using CONDA_OVERRIDE_CUDA refactor Co-authored-by: Eric Kelly <[email protected]> * Update docs/docs/how-tos/use-gpus.mdx Co-authored-by: Kim Pevey <[email protected]> * Update docs/docs/how-tos/use-gpus.mdx Co-authored-by: Kim Pevey <[email protected]> * Update docs/docs/how-tos/use-gpus.mdx Co-authored-by: Kim Pevey <[email protected]> * Update docs/docs/how-tos/use-gpus.mdx Co-authored-by: Kim Pevey <[email protected]> * Swap approaches. * Update docs/docs/how-tos/use-gpus.mdx Co-authored-by: Kim Pevey <[email protected]> * Refactor heading. * Minor changes to GPU page. * Update docs/docs/how-tos/use-gpus.mdx Co-authored-by: Eric Kelly <[email protected]> * Update docs/docs/how-tos/use-gpus.mdx Co-authored-by: Eric Kelly <[email protected]> * Remove redundant pytorch-cuda-override.png * Minor changes to GPU page. * Update docs/docs/how-tos/use-gpus.mdx Co-authored-by: Kim Pevey <[email protected]> * Add a note about order of channels. * remove duplicate comment about max version. * Remove pytorch-best-practices page as its no longer needed. * Minor docs cleanup. * Minor docs cleanup. * Fix typo. * Minor docs cleanup. * Replace refrences for Pytorch best practices in docs. * Format docs. * Format docs. * Update server options image to latest. * Format page. * Update server options image to latest. * Format page. * Apply suggestions from code review * fix note syntax * one more change to note syntax/formatting * Apply suggestions from code review Co-authored-by: Eric Kelly <[email protected]> * Remove unused conda-store-yaml-toggle.png. * Remove pytorch from channels. * Add pytorch again. * Remove nvidia from channels for first example. --------- Co-authored-by: Prashant Tiwari <[email protected]> Co-authored-by: Kim Pevey <[email protected]> Co-authored-by: Eric Kelly <[email protected]>
1 parent 11ee316 commit a2c886a

8 files changed

+121
-46
lines changed

docs/docs/faq.md

+3-3
Original file line numberDiff line numberDiff line change
@@ -130,7 +130,7 @@ Digital Ocean doesn't support these type of instances.
130130
131131
## Why doesn't my code recognize the GPU(s) on Nebari?
132132
133-
First be sure you chose a [GPU-enabled server when you selected a profile][selecting a profile]. Next, if you're using PyTorch, see [PyTorch best practices][pytorch best practices]. If it's still not working for you, be sure your environment includes a GPU-specific version of either PyTorch or TensorFlow, i.e. `pytorch-gpu` or `tensorflow-gpu`. Also note that `tensorflow>=2` includes both CPU and GPU capabilities, but if the GPU is still not recognized by the library, try removing `tensorflow` from your environment and adding `tensorflow-gpu` instead.
133+
First be sure you chose a [GPU-enabled server when you selected a profile][selecting a profile]. Next, if you're using PyTorch, see [Using GPUs on Nebari][using gpus]. If it's still not working for you, be sure your environment includes a GPU-specific version of either PyTorch or TensorFlow, i.e. `pytorch-gpu` or `tensorflow-gpu`. Also note that `tensorflow>=2` includes both CPU and GPU capabilities, but if the GPU is still not recognized by the library, try removing `tensorflow` from your environment and adding `tensorflow-gpu` instead.
134134

135135
## How do I migrate from Qhub to Nebari?
136136

@@ -159,7 +159,7 @@ If you have potential solutions or can help us move forward with updates to the
159159

160160
## Why does my VS Code server continue to run even after I've been idle for a long time?
161161

162-
Nebari automatically shuts down servers when users are idle, as described in Nebari's documentation for the [idle culler settings][idle-culler-settings]. This functionality currently applies only to JupyterLab servers. A VS Code instance, however, runs on Code Server, which isn't managed by the idle culler. VS Code, and other non-JupyterLab services, will not be automatically shut down.
162+
Nebari automatically shuts down servers when users are idle, as described in Nebari's documentation for the [idle culler settings][idle-culler-settings]. This functionality currently applies only to JupyterLab servers. A VS Code instance, however, runs on Code Server, which isn't managed by the idle culler. VS Code, and other non-JupyterLab services, will not be automatically shut down.
163163
:::note
164164
Until this issue is addressed, we recommend manually shutting down your VS Code server when it is not in use.
165165
:::
@@ -168,4 +168,4 @@ Until this issue is addressed, we recommend manually shutting down your VS Code
168168
[dask-tutorial]: tutorials/using_dask.md
169169
[idle-culler-settings]: https://www.nebari.dev/docs/how-tos/idle-culling
170170
[selecting a profile]: tutorials/login-keycloak#4-select-a-profile
171-
[pytorch best practices]: how-tos/pytorch-best-practices
171+
[using gpus]: how-tos/use-gpus

docs/docs/how-tos/pytorch-best-practices.md

-35
This file was deleted.

docs/docs/how-tos/use-gpus.mdx

+110
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,110 @@
1+
---
2+
id: use-gpus
3+
title: Use GPUs on Nebari
4+
description: Overview of using GPUs on Nebari including server setup, environment setup, and validation.
5+
---
6+
7+
# Using GPUs on Nebari
8+
## Introduction
9+
Overview of using GPUs on Nebari including server setup, environment setup, and validation.
10+
11+
## 1. Starting a GPU server
12+
13+
Follow Steps 1 to 3 in the [Authenticate and launch JupyterLab][login-with-keycloak] tutorial. The UI will show a list of profiles (a.k.a, instances, servers, or machines).
14+
15+
![Nebari select profile](/img/how-tos/nebari_select_profile.png)
16+
Your administrator pre-configures these options, as described in [Profile Configuration documentation][profile-configuration].
17+
18+
Select an appropriate GPU Instance and click "Start".
19+
20+
### Understanding GPU setup on the server.
21+
The following steps describe how to get CUDA-related information from the server.
22+
1. Once your server starts, it will redirect you to a JupyterLab home page.
23+
2. Click on the **"Terminal"** icon.
24+
3. Run the command `nvidia-smi`. The top right corner of the command's output should have the highest supported driver.
25+
![nvidia-smi-output](/img/how-tos/nvidia-smi-output.png)
26+
27+
If you get the error `nvidia-smi: command not found`, you are most likely on a non-GPU server. Shutdown your server, and start up a GPU-enabled server.
28+
29+
**Compatible environments for this server must contain CUDA versions *below* the GPU server version. For example, the server in this case is on 12.4. All environments used on this server must contain packages build with CUDA<=12.4.**
30+
31+
## 2. Creating environments
32+
33+
By default, `conda-store` will build CPU-compatible packages. To build GPU-compatible packages, we do the following.
34+
### Build a GPU-compatible environment
35+
By default, `conda-store` will build CPU-compatible packages. To build GPU-compatible packages, we have two options:
36+
1. **Create the environment specification using `CONDA_OVERRIDE_CUDA` (recommended approach)**:
37+
38+
Conda-store provides an alternate mechanism to enable GPU environments via the setting of an environment variable as explained in the [conda-store docs](https://conda.store/conda-store-ui/tutorials/create-envs#set-environment-variables).
39+
While creating a new config, click on the **GUI <-> YAML** Toggle to edit yaml config.
40+
```
41+
channels:
42+
- pytorch
43+
- conda-forge
44+
dependencies:
45+
- pytorch
46+
- ipykernel
47+
variables:
48+
CONDA_OVERRIDE_CUDA: "12.1"
49+
```
50+
Alternatively, you can configure the same config using the UI.
51+
52+
Add the `CONDA_OVERRIDE_CUDA` override to the variables section to tell conda-store to build a GPU-compatible environment.
53+
54+
:::note
55+
At the time of writing this document, the latest CUDA version was showing as `12.1`. Please follow the steps below to determine the latest override value for the `CONDA_OVERRIDE_CUDA` environment variable.
56+
57+
Please ensure that your choice from PyTorch documentation is not greater than the highest supported version in the `nvidia-smi` output (captured above).
58+
:::
59+
60+
2. **Create the environment specification based on recommendations from the PyTorch documentation**:
61+
You can check [PyTorch documentation](https://pytorch.org/get-started/locally/) to get a quick list of the necessary CUDA-specific packages.
62+
Select the following options to get the latest CUDA version:
63+
- PyTorch Build = Stable
64+
- Your OS = Linux
65+
- Package = Conda
66+
- Language = Python
67+
- Compute Platform = 12.1 (Select the version that is less than or equal to the `nvidia-smi` output (see above) on your server)
68+
69+
![pytorch-linux-conda-version](/img/how-tos/pytorch-linux-conda-version.png)
70+
71+
The command `conda install` from above is:
72+
```
73+
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
74+
```
75+
The corresponding yaml config would be:
76+
```
77+
channels:
78+
- pytorch
79+
- nvidia
80+
- conda-forge
81+
dependencies:
82+
- pytorch
83+
- pytorch-cuda==12.1
84+
- torchvision
85+
- torchaudio
86+
- ipykernel
87+
variables: {}
88+
```
89+
:::note
90+
The order of the channels is respected by conda, so keep pytorch at the top, then nvidia, then conda-forge.
91+
92+
You can use **GUI <-> YAML** Toggle to edit the config.
93+
94+
95+
## 3. Validating the setup
96+
You can check that your GPU server is compatible with your conda environment by opening a Jupyter Notebook, loading the environment, and running the following code:
97+
```
98+
import torch
99+
print(f"GPU available: {torch.cuda.is_available()}")
100+
print(f"Number of GPUs available: {torch.cuda.device_count()}")
101+
print(f"ID of current GPU: {torch.cuda.current_device()}")
102+
print(f"Name of first GPU: {torch.cuda.get_device_name(0)}")
103+
```
104+
Your output should look something like this:
105+
106+
![jupyter-notebook-command-output](/img/how-tos/pytorch-cuda-check.png)
107+
108+
<!-- Internal links -->
109+
[profile-configuration]: /docs/explanations/profile-configuration
110+
[login-with-keycloak]: /docs/tutorials/login-keycloak

docs/sidebars.js

+8-8
Original file line numberDiff line numberDiff line change
@@ -70,14 +70,14 @@ module.exports = {
7070
"how-tos/manual-backup",
7171
"how-tos/nebari-upgrade",
7272
"how-tos/kubernetes-version-upgrade",
73-
"how-tos/pytorch-best-practices",
7473
"how-tos/setup-argo",
7574
"how-tos/using-argo",
7675
"how-tos/jhub-app-launcher",
7776
"how-tos/idle-culling",
7877
"how-tos/nebari-extension-system",
7978
"how-tos/telemetry",
8079
"how-tos/monitoring",
80+
"how-tos/use-gpus",
8181
],
8282
},
8383
{
@@ -100,15 +100,14 @@ module.exports = {
100100
type: "category",
101101
label: "Reference",
102102
link: { type: "doc", id: "references/index" },
103-
items: [
104-
"references/RELEASE",
105-
],
103+
items: ["references/RELEASE"],
106104
},
107105
{
108106
type: "category",
109107
label: "Community",
110108
link: {
111-
type: "doc", id: "community/index"
109+
type: "doc",
110+
id: "community/index",
112111
},
113112
items: [
114113
"community/file-issues",
@@ -121,13 +120,14 @@ module.exports = {
121120
{
122121
type: "category",
123122
label: "Maintainers",
124-
items: ["community/maintainers/github-conventions",
123+
items: [
124+
"community/maintainers/github-conventions",
125125
"community/maintainers/triage-guidelines",
126126
"community/maintainers/reviewer-guidelines",
127127
"community/maintainers/saved-replies",
128128
"community/maintainers/release-process-branching-strategy",
129-
]
130-
}
129+
],
130+
},
131131
],
132132
},
133133
{
Loading
129 KB
Loading
129 KB
Loading
Loading

0 commit comments

Comments
 (0)