Add instructions for Scylla (#733)

ashao · web-flow · commit 5c2de473a718 · 2024-10-08T15:47:48.000-07:00
Scylla is in a preliminary state and so needs some specific instructions to help install SmartSim with CUDA support. The directions included here are preliminary and will be updated as needed. [ committed by @ashao ] [ reviewed by @MattToast @amandarichardsonn ]
diff --git a/doc/changelog.md b/doc/changelog.md
@@ -9,6 +9,21 @@ Jump to:
 
 ## SmartSim
 
+### Develop
+
+To be released at some point in the future
+
+Description
+
+- Add instructions for installing SmartSim on PML's Scylla
+
+Detailed Notes
+- PML's Scylla is still under development. The usual SmartSim
+  build instructions do not apply because the GPU dependencies
+  have yet to be installed at a system-wide level. Scylla has
+  its own entry in the documentation.
+  ([SmartSim-PR733](https://github.com/CrayLabs/SmartSim/pull/733))
+
 ### 0.8.0
 
 Released on 27 September, 2024
diff --git a/doc/installation_instructions/platform.rst b/doc/installation_instructions/platform.rst
@@ -20,6 +20,8 @@ that SmartSim may be used on.
 
 .. include:: platform/olcf-summit.rst
 
+.. include:: platform/pml-scylla.rst
+
 .. _site_installation:
 
 .. include:: site-install.rst
diff --git a/doc/installation_instructions/platform/pml-scylla.rst b/doc/installation_instructions/platform/pml-scylla.rst
@@ -0,0 +1,84 @@
+PML Scylla
+==========
+
+.. warning::
+    As of September 2024, the software stack on Scylla is still being finalized.
+    Therefore, please consider these instructions as preliminary for now.
+
+One-time Setup
+--------------
+
+To install SmartSim on Scylla, follow these steps:
+
+**Step 1:** Create and activate a Python virtual environment for SmartSim:
+
+.. code:: bash
+
+    module use module use /scyllapfs/hpe/ashao/smartsim_dependencies/modulefiles
+    module load cudatoolkit cudnn git
+    python -m venv /scyllafps/scratch/$USER/venvs/smartsim
+    source /scyllafps/scratch/$USER/venvs/smartsim/bin/activate
+
+**Step 2:** Build the SmartRedis C++ and Fortran libraries:
+
+.. code:: bash
+
+    git clone https://github.com/CrayLabs/SmartRedis.git
+    cd SmartRedis
+    make lib-with-fortran
+    pip install .
+    cd ..
+
+**Step 3:** Install SmartSim in the conda environment:
+
+.. code:: bash
+
+    pip install git+https://github.com/CrayLabs/SmartSim.git
+
+**Step 4:** Build Redis, RedisAI, the backends, and all the Python packages:
+
+.. code:: bash
+
+    export TORCH_CUDA_ARCH_LIST="8.0 8.6 8.9 9.0" # Workaround for a PyTorch problem
+    smart build --device=cuda-12
+    module unload cudnn # Workaround for a PyTorch problem
+
+
+.. note::
+    The first workaround is needed because for some reason the autodetection
+    of CUDA architectures is not consistent internally with one of PyTorch's
+    dependencies. This seems to be unique to this machine as we do not see
+    this on other platforms.
+
+    The second workaround is needed because PyTorch 2.3 (and possibly 2.2)
+    will attempt to load the version of cuDNN that is in the LD_LIBRARY_PATH
+    instead of the version shipped with PyTorch itself. This results in
+    unfound symbols.
+
+**Step 5:** Check that SmartSim has been installed and built correctly:
+
+.. code:: bash
+
+    srun -n 1 -p gpu --gpus=1 --pty smart validate --device gpu
+
+The following output indicates a successful install:
+
+.. code:: bash
+
+    [SmartSim] INFO Verifying Tensor Transfer
+    [SmartSim] INFO Verifying Torch Backend
+    [SmartSim] INFO Verifying ONNX Backend
+    [SmartSim] INFO Verifying TensorFlow Backend
+    16:26:35 login SmartSim[557020:MainThread] INFO Success!
+
+Post-installation
+-----------------
+
+After completing the above steps to install SmartSim in a conda environment, you
+can reload the conda environment by running the following commands:
+
+.. code:: bash
+
+    module load cudatoolkit/12.4.1 git # cudnn should NOT be loaded
+    source /scyllafps/scratch/$USER/venvs/smartsim/bin/activate
+