AusSRC contribution to the POSSUM data pre-processing pipelines. The pre-processing of POSSUM data involves
- Convolution to a common beam (20 arcseconds) [https://github.com/AlecThomson/RACS-tools]
- Ionospheric Faraday rotation correction (for Stokes Q and U) [https://github.com/CIRADA-Tools/FRion]
- Tiling [https://github.com/Sebokolodi/SkyTiles]
Then, complete HPX tiles are mosaicked together and uploaded to CADC in a final step. The workflow can be applied to MFS images or full spectral cubes. In this repository there are pipelines for:
- Pre-processing of MFS images (
mfs.nf
) - Pre-processing of spectral cube images (
main.nf
) - Mosaicking to complete tile images (
mosaic.nf
)
To run the pipeline you need to specify a main script, a parameter file (or provide a list of parameters as arguments) and a deployment. Currently we only support setonix
as the deployments.
The pipeline needs access to a CASDA credentials file casda.ini
:
[CASDA]
username =
password =
#!/bin/bash
#SBATCH --account=<Pawsey account>
#SBATCH --ntasks=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=32G
#SBATCH --time=24:00:00
module load singularity/4.1.0-slurm
module load nextflow/23.10.0
export MPICH_OFI_STARTUP_CONNECT=1
export MPICH_OFI_VERBOSE=1
export FI_CXI_DEFAULT_VNI=$(od -vAn -N4 -tu < /dev/urandom)
nextflow run main.nf -profile setonix --CASDA_CREDENTIALS=<path to CASDA credentials> --SBID <SBID>
Deploy
sbatch script.sh
#!/bin/bash
#SBATCH --account=<Pawsey account>
#SBATCH --ntasks=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=32G
#SBATCH --time=24:00:00
module load singularity/4.1.0-slurm
module load nextflow/23.10.0
export MPICH_OFI_STARTUP_CONNECT=1
export MPICH_OFI_VERBOSE=1
export FI_CXI_DEFAULT_VNI=$(od -vAn -N4 -tu < /dev/urandom)
nextflow run mfs.nf -profile setonix --CASDA_CREDENTIALS=<path to CASDA credentials> --SBID <SBID>
Deploy
sbatch script.sh
This section describes how the output files are organised. All outputs are stored under the location specified by the WORKDIR
parameter. Here is the structure beneath
.
├── ...
└── WORKDIR # Parent directory specified in params.WORKDIR
├── <SBID_1>
├── <SBID_2>
├── ...
├── <SBID_N> # A sub-folder for each SBID containing observation metadata
│ ├── evaluation_files # Download evaluation files
│ └── hpx_tile_map.csv # Generated map for HPX pixels covered by image cube (map file)
└── TILE_COMPONENT_OUTPUT_DIR # HPX tile components for each SBID are stored here
├── <OBS_ID_1>
├── ...
└── <OBS_ID_N> # All tiled images a separated by observation ID
├── i # Subdirectory for each stokes parameter
├── ...
└── q
We use the CASA imregrid method to do tiling and reprojection onto a HPX grid. CASA has not been written to allow us to parallelise the tiling and reprojection over a number of nodes, and the size of our worker nodes is not sufficient to store entire cubes in memory (160 GB for band 1 images). We therefore need to split the cubes by frequency, run our program, then join at the end.
We do this twice in our full pre-processing pipeline code: for convolution to allow for using the robust
method (requires setting nan to zero), and for imregrid
to produce tiles as described earlier. The number of splits in frequency are specified by the NAN_TO_ZERO_NSPLIT
and NSPLIT
parameters respectively. Depending on the size of the cube and the size of the worker nodes, users will have to set these parameters to optimally utilise computing resources.
Docs: https://casadocs.readthedocs.io/en/stable/notebooks/external-data.html
To run this pipeline you will need to create a custom path for casadata. The required data files should automatically be downloaded when you run CASA. The default path is ~/.casa/data
, but you can change this path by including the environment variable with custom file path CASASITECONFIG = "/software/projects/ja3/ashen/.casa/config.py"
. The config.py
file should include the following:
rundata = '/software/projects/ja3/ashen/.casadata/data'
measurespath = '/software/projects/ja3/ashen/.casadata/data'
The above example points to a software directory on Setonix where the downloaded CASA data files will not be purged. NOTE: it is ideal to avoid using the home directory on HPC systems since Nextflow have added the --no_home
default option for some of their newest releases.
The FRion predict
step of the pipeline (only for main.nf
) requires you to download data from NASA CDDIS. To do this you will need to create an EarthData account. Then you will create a .netrc
file containing those credentials with the following content:
machine urs.earthdata.nasa.gov login <username> password <password>
Then you will need to change the file access pattern and move it to the home directory on the cluster which you intend to deploy the pipeline
chmod 600 .netrc
mv .netrc ~/
For more info: https://urs.earthdata.nasa.gov/documentation/for_users