diff --git a/triton/ref/slurm.rst b/triton/ref/slurm.rst index 5ce8451d8..d8004b3cf 100644 --- a/triton/ref/slurm.rst +++ b/triton/ref/slurm.rst @@ -34,8 +34,7 @@ ! ``-e ERRORFILE`` ! print errors into file *error* ! ``--exclusive`` ! allocate exclusive access to nodes. For large parallel jobs. ! ``--constraint=FEATURE`` ! request *feature* (see ``slurm features`` for the current list of configured features, or Arch under the :ref:`hardware list `). Multiple with ``--constraint="hsw|skl"``. - ! ``--constraint=localdisk`` ! request nodes that have local disks - ! ``--tmp=nnnG`` ! Request ``nnn`` GB of :doc:`local disk storage space ` + ! ``--tmp=nnnG`` ! request a node with a :doc:`local disk storage space ` and ``nnn`` GB of space on it. ! ``--array=0-5,7,10-15`` ! Run job multiple times, use variable ``$SLURM_ARRAY_TASK_ID`` to adjust parameters. ! ``--mail-type=TYPE`` ! notify of events: ``BEGIN``, ``END``, ``FAIL``, ``ALL``, ``REQUEUE`` (not on triton) or ``ALL.`` MUST BE used with ``--mail-user=`` only ! ``--mail-user=first.last@aalto.fi`` ! Aalto email to send the notification about the job. External email addresses doesn't work. diff --git a/triton/ref/storage.rst b/triton/ref/storage.rst index aab8aa27f..5e31bcfbd 100644 --- a/triton/ref/storage.rst +++ b/triton/ref/storage.rst @@ -6,5 +6,5 @@ Home | ``$HOME`` or ``/home/USERNAME/`` | hard quota 10GB | Nightly | all nodes | Small user specific files, no calculation data. Work | ``$WRKDIR`` or ``/scratch/work/USERNAME/`` | 200GB and 1 million files | x | all nodes | Personal working space for every user. Calculation data etc. Quota can be increased on request. Scratch | ``/scratch/DEPT/PROJECT/`` | on request | x | all nodes | Department/group specific project directories. - :doc:`Local temp (disk) ` | ``/tmp/`` (nodes with disks only) | local disk size | x | single-node | (Usually fastest) place for single-node calculation data. Removed once user's jobs are finished on the node. Request with ``--tmp=nnnG`` or ``--constraint=localdisk``. + :doc:`Local temp (disk) ` | ``/tmp/`` (nodes with disks only) | local disk size | x | single-node | (Usually fastest) place for single-node calculation data. Removed once user's jobs are finished on the node. Request with ``--tmp=nnnG``. :doc:`Local temp (ramfs) ` | ``/dev/shm/`` (and ``/tmp/`` on diskless nodes) | limited by memory | x | single-node | Very fast but small in-memory filesystem diff --git a/triton/tut/storage.rst b/triton/tut/storage.rst index 06cb4b3f6..0fda27243 100644 --- a/triton/tut/storage.rst +++ b/triton/tut/storage.rst @@ -32,8 +32,8 @@ choose between them. The (recommended for most work) * ``/tmp``: temporary local disk space, pre-user mounted in jobs and - automatically cleaned up. Only on nodes with disks - (``--constraint=localdisk``), otherwise it's ramfs + automatically cleaned up. Use ``--tmp=nnnG`` to request at + least ``nnn`` GB of space, otherwise it's ramfs * ``/dev/shm``: ramfs, in-memory file storage * See :doc:`remotedata` for how to transfer and access the data diff --git a/triton/usage/localstorage.rst b/triton/usage/localstorage.rst index 4d5edab57..8384cca95 100644 --- a/triton/usage/localstorage.rst +++ b/triton/usage/localstorage.rst @@ -2,71 +2,132 @@ Storage: local drives ===================== -.. seealso:: +.. admonition:: Abstract + + - Path is ``/tmp/`` + - Local drives are useful for large temporary data or unpacking + many small files before analysis. They are most important for + GPU training data but are useful other times, too. + - Local storage can be either SSD drives (big and reasonably fast), + spinning hard disks (HDDs; older nodes), or ramdisk (using your + job's memory; extremely fast). + - Request local storage with ``--tmp=NNg`` (the space you think you + need; but the space isn't reserved just for you). + - For ramdisk, the space comes out of your ``--mem=`` allocation. - :doc:`the storage tutorial <../tut/storage>`. +.. seealso:: -Local disks on computing nodes are the preferred place for doing your -IO. The general idea is use network storage as a backend and local disk -for actual data processing. **Some nodes have no disks** (local -storage comes out of the job memory, **some older nodes have HDDs** -(spinning disks), and some **SSDs**. + :doc:`The storage tutorial <../tut/storage>`. + +Local disks on computing nodes are the preferred place for doing +extensive input/output (IO; reading/writing files). The general idea +is use network storage as a backend and local disk for actual data +processing when it requires many reads or writes. **Different nodes +have different types of disks, Triton is very heterogeneous**: + +.. list-table:: + :header-rows: 1 + + - - Type + - Description + - Requesting + - Path + - - Solid-state drives (SSDs) + - Much faster than HDDs but much slower than ramdisk. Generally + GPU nodes have SSDs these days. + - ``--tmp=NNg``. The space is not guaranteed just for you. + - ``/tmp/`` + - - Spinning hard disks (HDDs) + - Generally only older CPU nodes have HDDs. + - ``--tmp=NNg`` to specify size you need. The space is not + guaranteed just for you. + - ``/tmp/`` + - - Ramdisk + - Uses your jobs memory allocation. Limited space but lightning + fast. + - ``--mem=NNg`` to request enough memory for your job and your + storage. + - ``/tmp/`` on diskless nodes and ``/dev/shm/`` on every node. + +See :doc:`../overview` for details on each node's local storage. + +The reason that local storage matters is that :doc:`lustre` (scratch) +is not good for many :doc:`smallfiles`. Read those articles for +background. + + +Background +---------- A general use pattern: -- In the beginning of the job, copy needed input from WRKDIR to ``/tmp``. +- In the beginning of the job, copy needed input from Scratch to ``/tmp``. - Run your calculation normally reading input from or writing output to to ``/tmp``. -- In the end copy relevant output to WRKDIR for analysis and further +- In the end copy relevant output to Scratch for analysis and further usage. -Pros +Pros: -- You get better and steadier IO performance. WRKDIR is shared over all - users making per-user performance actually rather poor. -- You save performance for WRKDIR to those who cannot use local disks. +- You get better and steadier IO performance. Scratch is shared over all + users making per-user performance can be poor at times, especially + for many small files. +- You save performance for Scratch to those who cannot use local disks. - You get much better performance when using many small files (Lustre works poorly here) or random access. -- Saves your quota if your code generate lots of data but finally you - need only part of it +- Saves your quota if your code generate lots of data but you only + need to save part of it. - In general, it is an excellent choice for single-node runs (that is all job's task run on the same node). -Cons +Cons: - NOT for the long-term data. Cleaned every time your job is finished. - Space is more limited (but still can be TBs on some nodes) - Need some awareness of what is on each node, since they are different - Small learning curve (must copy files before and after the job). - Not feasible for cross-node IO (MPI jobs where different tasks - write to the same files). Use WRKDIR instead. + write to the same files). Use Scratch instead. + +Usage +----- -How to use local drives on compute nodes ----------------------------------------- +``/tmp`` is the temporary directory. It is ramdisk on diskless nodes. -``/tmp`` is the temporary directory. It is per-user (not per-job), if -you get two jobs running on the same node, you get the same ``/tmp``. -It is automatically removed once the last job on a node finishes. +It is per-user (not per-job), if you get two jobs running on the same +node, you get the same ``/tmp``. Thus, it is wise to ``mkdir +/tmp/$SLURM_JOB_ID/`` and use that directory, and delete it once the +job is done. + +Everything is automatically removed once the last job on a node +finishes. Nodes with local disks ~~~~~~~~~~~~~~~~~~~~~~ -You can see the nodes with local disks on :doc:`../overview`. (To -double check from within the cluster, you can verify node info with -``sinfo show node NODENAME`` and see the ``localdisk`` tag in -``slurm features``). Disk sizes greatly vary from hundreds of GB to -tens of TB. +You can see the nodes with local disks on :doc:`../overview`. Disk +sizes greatly vary from hundreds of GB (older nodes, when everything +had spinning disks) to tens of TB (new GPU nodes designed for ML +training). + +.. admonition:: Verifying node details directly through Slurm + + You don't usually need to do this. You can verify node info with + ``sinfo show node NODENAME`` and look for ``TmpDisk=`` or + ``AvailableFeatures=localdisk``. ``slurm features`` will list all + nodes (look for ``localdisk`` in features). + +You can use ``--tmp=nnnG`` (for example ``--tmp=100G``). You can use +``--constraint=localdisk`` to ensure a disk of any type, but you may +as well just specify how much space you need. -You have to use ``--constraint=localdisk`` to ensure that you get a -hard disk. You can use ``--tmp=nnnG`` (for example ``--tmp=100G``) to -request a node with at least that much temporary space. But, ``--tmp`` doesn't allocate this space just for you: it's shared among all users, including those which didn't request storage space. So, -you *might* not have as much as you think. Beware and handle out of -memory gracefully. +you *might* not have as much as you think. Beware and handle "out of +space" errors gracefully. Nodes without local disks @@ -75,7 +136,7 @@ Nodes without local disks You can still use ``/tmp``, but it is an in-memory ramdisk. This means it is *very* fast, but is using the actual main memory that is used by the programs. It comes out of your job's memory allocation, -so use a ``--mem`` amount with enough space for your job and any +so use a ``--mem=nnG`` amount with enough space for your job and any temporary storage. @@ -86,51 +147,70 @@ Examples Interactively ~~~~~~~~~~~~~ -How to use /tmp when you login interactively +How to use /tmp when you login interactively, for example space to +decompress a big file. .. code-block:: console - $ sinteractive --time=1:00:00 # request a node for one hour - (node)$ mkdir /tmp/$SLURM_JOB_ID # create a unique directory, here we use + $ sinteractive --time=1:00:00 --tmp=500G # request a node for one hour + (node)$ mkdir /tmp/$SLURM_JOB_ID # create a unique directory, here we use (node)$ cd /tmp/$SLURM_JOB_ID ... do what you wanted ... - (node)$ cp your_files $WRKDIR/my/valuable/data # copy what you need - (node)$ cd; rm -rf /tmp/$SLURM_JOB_ID # clean up after yourself + (node)$ cp YOUR_FILES $WRKDIR/my/valuable/data # copy what you need + (node)$ cd; rm -rf /tmp/$SLURM_JOB_ID # clean up after yourself (node)$ exit -In batch script -~~~~~~~~~~~~~~~ -This batch job example that prevents data loss in case program gets -terminated (either because of ``scancel`` or due to time limit). + +In batch script - save data if job ends prematurely +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +This batch job example that has a trigger (``trap``) that prevents +data loss in case the program gets terminated early (either because of +``scancel``, the time limit, or some other error). It copies the data +to a different location (``$WRKDIR/$SLURM_JOB_ID``) in case of errors +compared to other normal exits. .. code-block:: slurm + :emphasize-lines: 15-17,26-27 - #!/bin/bash + #!/bin/bash + #SBATCH --time=12:00:00 + #SBATCH --mem-per-cpu=2500M # time and memory requirements + #SBATCH --output=test-local.out + #SBATCH --tmp=50G - #SBATCH --time=12:00:00 - #SBATCH --mem-per-cpu=2500M # time and memory requirements + # The below, if uncommented, will cause the script to abort (and trap + # to run) if there are any unhandled errors. + #set -euo pipefail - mkdir /tmp/$SLURM_JOB_ID # get a directory where you will send all output from your program - cd /tmp/$SLURM_JOB_ID + # get a directory where you will send all output from your program + mkdir /tmp/$SLURM_JOB_ID + cd /tmp/$SLURM_JOB_ID + + ## set the trap: when killed or exits abnormally you get the + ## output copied to $WRKDIR/$SLURM_JOB_ID anyway + trap "rsync -a /tmp/$SLURM_JOB_ID/ $WRKDIR/$SLURM_JOB_ID/ ; exit" TERM EXIT - ## set the trap: when killed or exits abnormally you get the - ## output copied to $WRKDIR/$SLURM_JOB_ID anyway - trap "mkdir $WRKDIR/$SLURM_JOB_ID; mv -f /tmp/$SLURM_JOB_ID $WRKDIR/$SLURM_JOB_ID; exit" TERM EXIT + ## run the program and redirect all IO to a local drive + ## assuming that you have your program and input at $WRKDIR + srun $WRKDIR/my_program $WRKDIR/input > output - ## run the program and redirect all IO to a local drive - ## assuming that you have your program and input at $WRKDIR - srun $WRKDIR/my_program $WRKDIR/input > output + # move your output fully or partially + mv /tmp/$SLURM_JOB_ID/output $WRKDIR/SOMEDIR - mv /tmp/$SLURM_JOB_ID/output $WRKDIR/SOMEDIR # move your output fully or partially + # Un-set the trap since we ended successfully + trap - TERM EXIT Batch script for thousands input/output files ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -If your job requires a large amount of files as input/output using tar -utility can greatly reduce the load on the ``$WRKDIR``-filesystem. +If your job requires a large amount of files as input/output, you can +store the files in a single archive format (``.tar``, ``.zip``, etc.) +and unpack them to local storage when needed. This can greatly reduce +the load on the scratch filesystem. Using methods like this is recommended if you're working with thousands of files. @@ -139,10 +219,10 @@ Working with tar balls is done in a following fashion: #. Determine if your input data can be collected into analysis-sized chunks that can be (if possible) re-used -#. Make a tar ball out of the input data (``tar cf .tar - ``) +#. Make a tar ball out of the input data (``tar cf ARCHIVE_FILENAME.tar + INPUT_FILES ...``) #. At the beginning of job copy the tar ball into ``/tmp`` and untar it - there (``tar xf .tar``) + there (``tar xf ARCHIVE_FILENAME.tar``) #. Do the analysis here, in the local disk #. If output is a large amount of files, tar them and copy them out. Otherwise write output to ``$WRKDIR`` @@ -150,19 +230,32 @@ Working with tar balls is done in a following fashion: A sample code is below: .. code-block:: slurm + :emphasize-lines: 10-11,19-24 #!/bin/bash - #SBATCH --time=12:00:00 #SBATCH --mem-per-cpu=2000M # time and memory requirements - mkdir /tmp/$SLURM_JOB_ID # get a directory where you will put your data - cp $WRKDIR/input.tar /tmp/$SLURM_JOB_ID # copy tarred input files + #SBATCH --tmp=50G + + # get a directory where you will put your data and change to it + mkdir /tmp/$SLURM_JOB_ID cd /tmp/$SLURM_JOB_ID - trap "rm -rf /tmp/$SLURM_JOB_ID; exit" TERM EXIT # set the trap: when killed or exits abnormally you clean up your stuff + # set the trap: when killed or exits abnormally you clean up your stuff + trap "rm -rf /tmp/$SLURM_JOB_ID; exit" TERM EXIT + + # untar the files. If we only unpack once, there is no point in + # making an initial copy to local disks. + tar xf $WRKDIR/input.tar + + srun MY_PROGRAM input/* # do the analysis, or what ever else, on the input files - tar xf input.tar # untar the files - srun input/* # do the analysis, or what ever else - tar cf output.tar output/* # tar output + # If you generate many output files, tar them before copying them + # back. + # If it's just a few files of output, you can copy back directly + # (or even output them straight to scratch) + tar cf output.tar output/ # tar output (if needed) mv output.tar $WRKDIR/SOMEDIR # copy results back + # Un-set the trap since we ended successfully + trap - TERM EXIT