Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 1 addition & 2 deletions triton/ref/slurm.rst
Original file line number Diff line number Diff line change
Expand Up @@ -34,8 +34,7 @@
! ``-e ERRORFILE`` ! print errors into file *error*
! ``--exclusive`` ! allocate exclusive access to nodes. For large parallel jobs.
! ``--constraint=FEATURE`` ! request *feature* (see ``slurm features`` for the current list of configured features, or Arch under the :ref:`hardware list <hardware-list>`). Multiple with ``--constraint="hsw|skl"``.
! ``--constraint=localdisk`` ! request nodes that have local disks
! ``--tmp=nnnG`` ! Request ``nnn`` GB of :doc:`local disk storage space </triton/usage/localstorage>`
! ``--tmp=nnnG`` ! request a node with a :doc:`local disk storage space </triton/usage/localstorage>` and ``nnn`` GB of space on it.
! ``--array=0-5,7,10-15`` ! Run job multiple times, use variable ``$SLURM_ARRAY_TASK_ID`` to adjust parameters.
! ``--mail-type=TYPE`` ! notify of events: ``BEGIN``, ``END``, ``FAIL``, ``ALL``, ``REQUEUE`` (not on triton) or ``ALL.`` MUST BE used with ``--mail-user=`` only
! ``[email protected]`` ! Aalto email to send the notification about the job. External email addresses doesn't work.
Expand Down
2 changes: 1 addition & 1 deletion triton/ref/storage.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,5 +6,5 @@
Home | ``$HOME`` or ``/home/USERNAME/`` | hard quota 10GB | Nightly | all nodes | Small user specific files, no calculation data.
Work | ``$WRKDIR`` or ``/scratch/work/USERNAME/`` | 200GB and 1 million files | x | all nodes | Personal working space for every user. Calculation data etc. Quota can be increased on request.
Scratch | ``/scratch/DEPT/PROJECT/`` | on request | x | all nodes | Department/group specific project directories.
:doc:`Local temp (disk) </triton/usage/localstorage>` | ``/tmp/`` (nodes with disks only) | local disk size | x | single-node | (Usually fastest) place for single-node calculation data. Removed once user's jobs are finished on the node. Request with ``--tmp=nnnG`` or ``--constraint=localdisk``.
:doc:`Local temp (disk) </triton/usage/localstorage>` | ``/tmp/`` (nodes with disks only) | local disk size | x | single-node | (Usually fastest) place for single-node calculation data. Removed once user's jobs are finished on the node. Request with ``--tmp=nnnG``.
:doc:`Local temp (ramfs) </triton/usage/localstorage>` | ``/dev/shm/`` (and ``/tmp/`` on diskless nodes) | limited by memory | x | single-node | Very fast but small in-memory filesystem
4 changes: 2 additions & 2 deletions triton/tut/storage.rst
Original file line number Diff line number Diff line change
Expand Up @@ -32,8 +32,8 @@ choose between them. The
(recommended for most work)

* ``/tmp``: temporary local disk space, pre-user mounted in jobs and
automatically cleaned up. Only on nodes with disks
(``--constraint=localdisk``), otherwise it's ramfs
automatically cleaned up. Use ``--tmp=nnnG`` to request at
least ``nnn`` GB of space, otherwise it's ramfs
* ``/dev/shm``: ramfs, in-memory file storage

* See :doc:`remotedata` for how to transfer and access the data
Expand Down
225 changes: 159 additions & 66 deletions triton/usage/localstorage.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,71 +2,132 @@
Storage: local drives
=====================

.. seealso::
.. admonition:: Abstract

- Path is ``/tmp/``
- Local drives are useful for large temporary data or unpacking
many small files before analysis. They are most important for
GPU training data but are useful other times, too.
- Local storage can be either SSD drives (big and reasonably fast),
spinning hard disks (HDDs; older nodes), or ramdisk (using your
job's memory; extremely fast).
- Request local storage with ``--tmp=NNg`` (the space you think you
need; but the space isn't reserved just for you).
- For ramdisk, the space comes out of your ``--mem=`` allocation.

:doc:`the storage tutorial <../tut/storage>`.
.. seealso::

Local disks on computing nodes are the preferred place for doing your
IO. The general idea is use network storage as a backend and local disk
for actual data processing. **Some nodes have no disks** (local
storage comes out of the job memory, **some older nodes have HDDs**
(spinning disks), and some **SSDs**.
:doc:`The storage tutorial <../tut/storage>`.

Local disks on computing nodes are the preferred place for doing
extensive input/output (IO; reading/writing files). The general idea
is use network storage as a backend and local disk for actual data
processing when it requires many reads or writes. **Different nodes
have different types of disks, Triton is very heterogeneous**:

.. list-table::
:header-rows: 1

- - Type
- Description
- Requesting
- Path
- - Solid-state drives (SSDs)
- Much faster than HDDs but much slower than ramdisk. Generally
GPU nodes have SSDs these days.
- ``--tmp=NNg``. The space is not guaranteed just for you.
- ``/tmp/``
- - Spinning hard disks (HDDs)
- Generally only older CPU nodes have HDDs.
- ``--tmp=NNg`` to specify size you need. The space is not
guaranteed just for you.
- ``/tmp/``
- - Ramdisk
- Uses your jobs memory allocation. Limited space but lightning
fast.
- ``--mem=NNg`` to request enough memory for your job and your
storage.
- ``/tmp/`` on diskless nodes and ``/dev/shm/`` on every node.

See :doc:`../overview` for details on each node's local storage.

The reason that local storage matters is that :doc:`lustre` (scratch)
is not good for many :doc:`smallfiles`. Read those articles for
background.


Background
----------

A general use pattern:

- In the beginning of the job, copy needed input from WRKDIR to ``/tmp``.
- In the beginning of the job, copy needed input from Scratch to ``/tmp``.
- Run your calculation normally reading input from or writing output
to to ``/tmp``.
- In the end copy relevant output to WRKDIR for analysis and further
- In the end copy relevant output to Scratch for analysis and further
usage.

Pros
Pros:

- You get better and steadier IO performance. WRKDIR is shared over all
users making per-user performance actually rather poor.
- You save performance for WRKDIR to those who cannot use local disks.
- You get better and steadier IO performance. Scratch is shared over all
users making per-user performance can be poor at times, especially
for many small files.
- You save performance for Scratch to those who cannot use local disks.
- You get much better performance when using many small files (Lustre
works poorly here) or random access.
- Saves your quota if your code generate lots of data but finally you
need only part of it
- Saves your quota if your code generate lots of data but you only
need to save part of it.
- In general, it is an excellent choice for single-node runs (that is
all job's task run on the same node).

Cons
Cons:

- NOT for the long-term data. Cleaned every time your job is finished.
- Space is more limited (but still can be TBs on some nodes)
- Need some awareness of what is on each node, since they are different
- Small learning curve (must copy files before and after the job).
- Not feasible for cross-node IO (MPI jobs where different tasks
write to the same files). Use WRKDIR instead.
write to the same files). Use Scratch instead.



Usage
-----

How to use local drives on compute nodes
----------------------------------------
``/tmp`` is the temporary directory. It is ramdisk on diskless nodes.

``/tmp`` is the temporary directory. It is per-user (not per-job), if
you get two jobs running on the same node, you get the same ``/tmp``.
It is automatically removed once the last job on a node finishes.
It is per-user (not per-job), if you get two jobs running on the same
node, you get the same ``/tmp``. Thus, it is wise to ``mkdir
/tmp/$SLURM_JOB_ID/`` and use that directory, and delete it once the
job is done.

Everything is automatically removed once the last job on a node
finishes.


Nodes with local disks
~~~~~~~~~~~~~~~~~~~~~~

You can see the nodes with local disks on :doc:`../overview`. (To
double check from within the cluster, you can verify node info with
``sinfo show node NODENAME`` and see the ``localdisk`` tag in
``slurm features``). Disk sizes greatly vary from hundreds of GB to
tens of TB.
You can see the nodes with local disks on :doc:`../overview`. Disk
sizes greatly vary from hundreds of GB (older nodes, when everything
had spinning disks) to tens of TB (new GPU nodes designed for ML
training).

.. admonition:: Verifying node details directly through Slurm

You don't usually need to do this. You can verify node info with
``sinfo show node NODENAME`` and look for ``TmpDisk=`` or
``AvailableFeatures=localdisk``. ``slurm features`` will list all
nodes (look for ``localdisk`` in features).

You can use ``--tmp=nnnG`` (for example ``--tmp=100G``). You can use
``--constraint=localdisk`` to ensure a disk of any type, but you may
as well just specify how much space you need.

You have to use ``--constraint=localdisk`` to ensure that you get a
hard disk. You can use ``--tmp=nnnG`` (for example ``--tmp=100G``) to
request a node with at least that much temporary space. But,
``--tmp`` doesn't allocate this space just for you: it's shared among
all users, including those which didn't request storage space. So,
you *might* not have as much as you think. Beware and handle out of
memory gracefully.
you *might* not have as much as you think. Beware and handle "out of
space" errors gracefully.


Nodes without local disks
Expand All @@ -75,7 +136,7 @@ Nodes without local disks
You can still use ``/tmp``, but it is an in-memory ramdisk. This
means it is *very* fast, but is using the actual main memory that is
used by the programs. It comes out of your job's memory allocation,
so use a ``--mem`` amount with enough space for your job and any
so use a ``--mem=nnG`` amount with enough space for your job and any
temporary storage.


Expand All @@ -86,51 +147,70 @@ Examples
Interactively
~~~~~~~~~~~~~

How to use /tmp when you login interactively
How to use /tmp when you login interactively, for example space to
decompress a big file.

.. code-block:: console

$ sinteractive --time=1:00:00 # request a node for one hour
(node)$ mkdir /tmp/$SLURM_JOB_ID # create a unique directory, here we use
$ sinteractive --time=1:00:00 --tmp=500G # request a node for one hour
(node)$ mkdir /tmp/$SLURM_JOB_ID # create a unique directory, here we use
(node)$ cd /tmp/$SLURM_JOB_ID
... do what you wanted ...
(node)$ cp your_files $WRKDIR/my/valuable/data # copy what you need
(node)$ cd; rm -rf /tmp/$SLURM_JOB_ID # clean up after yourself
(node)$ cp YOUR_FILES $WRKDIR/my/valuable/data # copy what you need
(node)$ cd; rm -rf /tmp/$SLURM_JOB_ID # clean up after yourself
(node)$ exit

In batch script
~~~~~~~~~~~~~~~

This batch job example that prevents data loss in case program gets
terminated (either because of ``scancel`` or due to time limit).

In batch script - save data if job ends prematurely
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This batch job example that has a trigger (``trap``) that prevents
data loss in case the program gets terminated early (either because of
``scancel``, the time limit, or some other error). It copies the data
to a different location (``$WRKDIR/$SLURM_JOB_ID``) in case of errors
compared to other normal exits.

.. code-block:: slurm
:emphasize-lines: 15-17,26-27

#!/bin/bash
#!/bin/bash
#SBATCH --time=12:00:00
#SBATCH --mem-per-cpu=2500M # time and memory requirements
#SBATCH --output=test-local.out
#SBATCH --tmp=50G

#SBATCH --time=12:00:00
#SBATCH --mem-per-cpu=2500M # time and memory requirements
# The below, if uncommented, will cause the script to abort (and trap
# to run) if there are any unhandled errors.
#set -euo pipefail

mkdir /tmp/$SLURM_JOB_ID # get a directory where you will send all output from your program
cd /tmp/$SLURM_JOB_ID
# get a directory where you will send all output from your program
mkdir /tmp/$SLURM_JOB_ID
cd /tmp/$SLURM_JOB_ID

## set the trap: when killed or exits abnormally you get the
## output copied to $WRKDIR/$SLURM_JOB_ID anyway
trap "rsync -a /tmp/$SLURM_JOB_ID/ $WRKDIR/$SLURM_JOB_ID/ ; exit" TERM EXIT

## set the trap: when killed or exits abnormally you get the
## output copied to $WRKDIR/$SLURM_JOB_ID anyway
trap "mkdir $WRKDIR/$SLURM_JOB_ID; mv -f /tmp/$SLURM_JOB_ID $WRKDIR/$SLURM_JOB_ID; exit" TERM EXIT
## run the program and redirect all IO to a local drive
## assuming that you have your program and input at $WRKDIR
srun $WRKDIR/my_program $WRKDIR/input > output

## run the program and redirect all IO to a local drive
## assuming that you have your program and input at $WRKDIR
srun $WRKDIR/my_program $WRKDIR/input > output
# move your output fully or partially
mv /tmp/$SLURM_JOB_ID/output $WRKDIR/SOMEDIR

mv /tmp/$SLURM_JOB_ID/output $WRKDIR/SOMEDIR # move your output fully or partially
# Un-set the trap since we ended successfully
trap - TERM EXIT



Batch script for thousands input/output files
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

If your job requires a large amount of files as input/output using tar
utility can greatly reduce the load on the ``$WRKDIR``-filesystem.
If your job requires a large amount of files as input/output, you can
store the files in a single archive format (``.tar``, ``.zip``, etc.)
and unpack them to local storage when needed. This can greatly reduce
the load on the scratch filesystem.

Using methods like this is recommended if you're working with thousands
of files.
Expand All @@ -139,30 +219,43 @@ Working with tar balls is done in a following fashion:

#. Determine if your input data can be collected into analysis-sized
chunks that can be (if possible) re-used
#. Make a tar ball out of the input data (``tar cf <tar filename>.tar
<input files>``)
#. Make a tar ball out of the input data (``tar cf ARCHIVE_FILENAME.tar
INPUT_FILES ...``)
#. At the beginning of job copy the tar ball into ``/tmp`` and untar it
there (``tar xf <tar filename>.tar``)
there (``tar xf ARCHIVE_FILENAME.tar``)
#. Do the analysis here, in the local disk
#. If output is a large amount of files, tar them and copy them out.
Otherwise write output to ``$WRKDIR``

A sample code is below:

.. code-block:: slurm
:emphasize-lines: 10-11,19-24

#!/bin/bash

#SBATCH --time=12:00:00
#SBATCH --mem-per-cpu=2000M # time and memory requirements
mkdir /tmp/$SLURM_JOB_ID # get a directory where you will put your data
cp $WRKDIR/input.tar /tmp/$SLURM_JOB_ID # copy tarred input files
#SBATCH --tmp=50G

# get a directory where you will put your data and change to it
mkdir /tmp/$SLURM_JOB_ID
cd /tmp/$SLURM_JOB_ID

trap "rm -rf /tmp/$SLURM_JOB_ID; exit" TERM EXIT # set the trap: when killed or exits abnormally you clean up your stuff
# set the trap: when killed or exits abnormally you clean up your stuff
trap "rm -rf /tmp/$SLURM_JOB_ID; exit" TERM EXIT

# untar the files. If we only unpack once, there is no point in
# making an initial copy to local disks.
tar xf $WRKDIR/input.tar

srun MY_PROGRAM input/* # do the analysis, or what ever else, on the input files

tar xf input.tar # untar the files
srun input/* # do the analysis, or what ever else
tar cf output.tar output/* # tar output
# If you generate many output files, tar them before copying them
# back.
# If it's just a few files of output, you can copy back directly
# (or even output them straight to scratch)
tar cf output.tar output/ # tar output (if needed)
mv output.tar $WRKDIR/SOMEDIR # copy results back

# Un-set the trap since we ended successfully
trap - TERM EXIT