Skip to content

AusSRC component of the POSSUM post-processing workflow

License

Notifications You must be signed in to change notification settings

AusSRC/POSSUM_workflow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

POSSUM pipelines

AusSRC contribution to the POSSUM data pre-processing pipelines. The pre-processing of POSSUM data involves

Then, complete HPX tiles are mosaicked together and uploaded to CADC in a final step. The workflow can be applied to MFS images or full spectral cubes. In this repository there are pipelines for:

  • Pre-processing of MFS images (mfs.nf)
  • Pre-processing of spectral cube images (main.nf)
  • Mosaicking to complete tile images (mosaic.nf)

Running Pipelines

To run the pipeline you need to specify a main script, a parameter file (or provide a list of parameters as arguments) and a deployment. Currently we only support setonix as the deployments.

The pipeline needs access to a CASDA credentials file casda.ini:

[CASDA]
username =
password =

Spectral cube images (main.nf)

#!/bin/bash
#SBATCH --account=<Pawsey account>
#SBATCH --ntasks=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=32G
#SBATCH --time=24:00:00

module load singularity/4.1.0-slurm
module load nextflow/23.10.0

export MPICH_OFI_STARTUP_CONNECT=1
export MPICH_OFI_VERBOSE=1

export FI_CXI_DEFAULT_VNI=$(od -vAn -N4 -tu < /dev/urandom)

nextflow run main.nf -profile setonix --CASDA_CREDENTIALS=<path to CASDA credentials> --SBID <SBID>

Deploy

sbatch script.sh

MFS images (mfs.nf)

#!/bin/bash
#SBATCH --account=<Pawsey account>
#SBATCH --ntasks=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=32G
#SBATCH --time=24:00:00

module load singularity/4.1.0-slurm
module load nextflow/23.10.0

export MPICH_OFI_STARTUP_CONNECT=1
export MPICH_OFI_VERBOSE=1

export FI_CXI_DEFAULT_VNI=$(od -vAn -N4 -tu < /dev/urandom)

nextflow run mfs.nf -profile setonix --CASDA_CREDENTIALS=<path to CASDA credentials> --SBID <SBID>

Deploy

sbatch script.sh

File structure

This section describes how the output files are organised. All outputs are stored under the location specified by the WORKDIR parameter. Here is the structure beneath

.
├── ...
└── WORKDIR                             # Parent directory specified in params.WORKDIR
    ├── <SBID_1>
    ├── <SBID_2>
    ├── ...
    ├── <SBID_N>                        # A sub-folder for each SBID containing observation metadata
    │   ├── evaluation_files            # Download evaluation files
    │   └── hpx_tile_map.csv            # Generated map for HPX pixels covered by image cube (map file)
    └── TILE_COMPONENT_OUTPUT_DIR       # HPX tile components for each SBID are stored here
        ├── <OBS_ID_1>
        ├── ...
        └── <OBS_ID_N>                  # All tiled images a separated by observation ID
            ├── i                       # Subdirectory for each stokes parameter
            ├── ...
            └── q

Splitting

We use the CASA imregrid method to do tiling and reprojection onto a HPX grid. CASA has not been written to allow us to parallelise the tiling and reprojection over a number of nodes, and the size of our worker nodes is not sufficient to store entire cubes in memory (160 GB for band 1 images). We therefore need to split the cubes by frequency, run our program, then join at the end.

We do this twice in our full pre-processing pipeline code: for convolution to allow for using the robust method (requires setting nan to zero), and for imregrid to produce tiles as described earlier. The number of splits in frequency are specified by the NAN_TO_ZERO_NSPLIT and NSPLIT parameters respectively. Depending on the size of the cube and the size of the worker nodes, users will have to set these parameters to optimally utilise computing resources.

CASAdata

Docs: https://casadocs.readthedocs.io/en/stable/notebooks/external-data.html

To run this pipeline you will need to create a custom path for casadata. The required data files should automatically be downloaded when you run CASA. The default path is ~/.casa/data, but you can change this path by including the environment variable with custom file path CASASITECONFIG = "/software/projects/ja3/ashen/.casa/config.py". The config.py file should include the following:

rundata = '/software/projects/ja3/ashen/.casadata/data'
measurespath = '/software/projects/ja3/ashen/.casadata/data'

The above example points to a software directory on Setonix where the downloaded CASA data files will not be purged. NOTE: it is ideal to avoid using the home directory on HPC systems since Nextflow have added the --no_home default option for some of their newest releases.

Download NASA CDDIS data

The FRion predict step of the pipeline (only for main.nf) requires you to download data from NASA CDDIS. To do this you will need to create an EarthData account. Then you will create a .netrc file containing those credentials with the following content:

machine urs.earthdata.nasa.gov login <username> password <password>

Then you will need to change the file access pattern and move it to the home directory on the cluster which you intend to deploy the pipeline

chmod 600 .netrc
mv .netrc ~/

For more info: https://urs.earthdata.nasa.gov/documentation/for_users

About

AusSRC component of the POSSUM post-processing workflow

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published