diff --git a/docs/data/sensitive-data/sd-connect-and-a-commands.md b/docs/data/sensitive-data/sd-connect-and-a-commands.md new file mode 100644 index 0000000000..ec28aeabc1 --- /dev/null +++ b/docs/data/sensitive-data/sd-connect-and-a-commands.md @@ -0,0 +1,109 @@ +# Using SD Connect service with a-commands + +SD Connect is part of the CSC sensitive data services that provide free-of-charge sensitive data processing environment for +academic research projects at Finnish universities and research institutes. SD Connect adds an automatic encryption layer to the Allas object storage system of CSC, so that it can be used for securely storing sensitive data. Data stored to SD Connect can also be accessed for SD Desktop secure virtual desktops. + +In most cases SD Connect is used through the [SD Connect Web interface](https://sd-connect.csc.fi), but in some cases command line tools +provide more efficient way to manage data in SD Connect. + +In this document we describe how you can use use the a-commands provided by [allas-cli-utils](https://github.com/CSCfi/allas-cli-utils) to upload and download data from SD Connect. These tools are available in CSC supercomputers (Puhti, Mahti and Lumi) and they can be installed in local Linux and Mac machines too. + +Note, that Allas itself does not separate data stored with SD connect from other data stored in +Allas. Data buckets can contain a mixture of SD Connect data, other encrypted data and normal data +and it is up to the user to know the type of the data. However, it is probably a good idea to keep SD Connect data +in buckets and folders that don't contain other types of data. + + +## Opening connection to SD Connect + +To open SD Connect compatible Allas connection you must add option *--sdc* the configurtion command. In CSC supercomputers the connecton is opened with commands: + +```test +module load allas +allas-conf --sdc +``` +In local installations the connection is typically opened with commands like + +``` +export PATH=/some-local-path/allas-cli-utils:$PATH +source /some-local-path/allas-cli-utils/allas_conf -u your-csc-account --sdc +``` + +The set up process asks first your CSC passwords (Haka or Virtu passwords can't be used here). +After that you will select the CSC project to be used. This is the normal login process for Allas. +However, when SD Connect is enabled, the process asks you to give the *SD Connect API token*. This +token must be retrieved from the [SD Connect web interface](https://sd-connect.csc.fi). Note that the tokens +are project specific. Make sure you have selected the same SD Connect project in both command line and in web +interface. + +In the web interface the token can be created using dialog that opens by selecting *Create API tokens* from the *Support* menu. + +Copy the token. paste it to command line and press enter. + +The SD Connect compatible Allas connection is now valid for next eight hours. And you can use commands like +*a-list* and *a-delete* to manage both normal Allas objects and SD Connect objects. + + +## Data upload + +Data can be uploaded to SD Connect by using command *a-put* with option *--sdc*. +For example to upload file *my-secret-table.csv" to location *2000123-sens/dataset2* in Allas use command: + +```text +a-put --sdc my-secret-table.csv -b 2000123-sens/dataset2 +``` + +This will produce SD Connect object: 2000123-sens/dataset2/my-secret-table.csv.c4gh + +All other a-put options and features can be used too. For example directories are +stored as tar files, if --asis option is not used. + +Command: + +```text +a-put --sdc my-secret-directory -b 2000123-sens/dataset2 +``` + +Will produce SD connect object: 2000123-sens/dataset2/my-secret-directory.tar.c4gh + +For massive data uploads, you can use *allas-dir-to-bucket* in combination with option *--sdc*. + +```text +allas-dir-to-bucket --sdc my-secret-directory 2000123-new-sens +``` + +The command above will copy all the files from directory my-secret-directory to bucket 2000123-new-sens in SD Connect compatible format. + + +## Data download + +Data can be downloaded form Allas with command a-get. If SD Connect connection is enabled, a-get will automatically try to decrypt objects with suffix *.c4gh*. + +So for example command: + +```text +a-get 2000123-sens/dataset2/my-secret-table.csv.c4gh +``` + +Will produce local file: my-secret-table.csv + +And similarly command: + +```text +a-get 2000123-sens/dataset2/my-secret-directory.tar.c4gh +``` + +Will produce local directory: my-secret-directory + +Note that this automatic decryptions works only for the files that have +been stored using the new SD Connect that was taken in use in October 2024. + +For the older SD Connect files and other Crypt4gh encrypted files you still must +provide the matching secret key with option *--sk* + +``` +a-get --sk my-key.sec 2000123-sens/old-date/sample1.txt.c4gh +``` + +Unfortunately there is no easy way to know, which encryption method has been used in +a .c4gh file stored in Allas. \ No newline at end of file diff --git a/docs/data/sensitive-data/sd-connect-sharing-for-import.md b/docs/data/sensitive-data/sd-connect-sharing-for-import.md new file mode 100644 index 0000000000..8154f8c162 --- /dev/null +++ b/docs/data/sensitive-data/sd-connect-sharing-for-import.md @@ -0,0 +1,111 @@ +# Using SD Connect to receive sensitive research data + +This document provides instructions of how a research group can use SD Connect to receive **sensitive data** from external +data provider like a sequencing center. The procedure presented here is applicable in cases where the data will analyzed in +SD Desktop or in a computer that has internet connection. + +In some sensitive data environments internet connection is not available. In those cases, please check the alternative +approach, defined in: + + * [Using Allas to receive sensitive research data](./sequencing_center_tutorial.md) + + +## SD Connect + +SD Connect is part of the CSC sensitive data services that provide free-of-charge sensitive data processing environment for +academic research projects at Finnish universities and research institutes. SD Connect adds an automatic encryption layer to the Allas object storage system of CSC, so that it can be used for securely storing sensitive data. SD Connect can be used for storing any kind of sensitive research data during the active working phase of a research project. +SD Connect is however not intended for data archiving. You must remove your data from SD Connect when the research project ends. + +There is no automatic backup processes in SD Connect. In technical level SD Connect is very reliable and fault-tolerant, +but if you, or some of your project members, remove or overwrite some data in SD Connect, +it is permanently lost. Thus, you might consider making a backup copy of your data to some other location. + +Please check the [SD Connect documentation](./sd_connect.md) for more details about SD Connect. + + +## 1. Obtaining a storage space in SD Connect + +If you are already using SD Connect service, you can skip this chapter and start from chapter 2. +Otherwise, do following steps to get access to SD Connect. + + +### 1.1. Create a user account + +If you are not yet CSC customer, register yourself to CSC. You can do these steps in the +CSC’s customer portal [MyCSC](https://my.csc.fi). + +Create a CSC account by logging in to MyCSC with Haka or Virtu. Remember to activate multi factor +authentication for your CSC account in order to be able to use SD Connect- + + +### 1.2. Create or join a project + +In addition to CSC user account, users must either join an existing CSC computing project +or set up a new computing project. You can use the same project to access other +CSC services too like SD Desktop, Puhti, or Allas. + +If you are eligible to act as a [project manager](https://research.csc.fi/prerequisites-for-a-project-manager), you can create a new CSC project in MyCSC and apply access to SD Connect. +Select 'Academic' as the project type. As a project manager, you can invite other users as members to your project. + +If you wish to be joined to an existing project, please ask the project manager to add your CSC user account to the +project member list. + +### 1.3. Add SD Connect access for your project + +Add _SD Connect_ service to your project in MyCSC. Only the project manager can add services. +After you have added SD Connect, to the project, the other project members need to login to +MyCSC and approve the terms of use for the service before getting access to SD Connect. + +After these steps, your project has 10 TB storage space available in SD Connect. +Please [contact CSC Service Desk](../../support/contact.md) if you need more storage space. + + +## 2. Creating a shared folder + +### 2.1. Creating a new root folder in SD Connect + +Once the service is enabled, you can login to [SD Connect interface](https://sd-connect.csc.fi). +After connecting, check that **Current project** setting refers to the CSC project +that you want to use. After that you can click the **Create folder** button to +create a new folder to be shared with the data provider. + +Avoid using spaces (use _ instead) and special characters in the folder names as they may cause problems in some cases. +Further, add some project specific feature, like project acronym, to the name, as the root folder needs to have an unique name +among all root folders of all SD Connect and Allas projects. + +### 2.2 Sharing the folder + +For sharing you need to know the _Sharing ID_ string of the data producer. You should request this 32 characters long +random string form the data producer by email. + +Do to the sharing, go to the folder list in SD Connect and press the share icon of the folder you wish to share. +Then copy the project ID to the first field of the sharing tool and select **Collaborate** as the sharing permission type. + +Now sharing is done and you can send the name of the shared bucket to the data producer by email. + + +### 2.3 Revoke bucket sharing after data transport + +Moving large datasets (several terabytes) of data to SD Connect can take a long time. +Once the producer tells that all data has been imported to the shared folder in Allas, you remove the external +access rights in SD Connect interface. Click the _share_ icon of the shared +folder and press **Delete** next to the project ID of the data producer. + + +## 3. Using encrypted data + +By default data stored to SD Connect is accessible only to the members of the CSC project. However project members can +share the folder to other CSC projects. + +The project members can download the data to their own computers using the SD Connect WWW interface +that automatically decrypts the data after downloading. + +The data can be accessed in [SD Desktop](https://sd-desktop.csc.fi) too, using the _Data Gateway_ +tool. + +In Linux and Mac computers, you can install a local copy of _allas-cli-utils_ tools that provides command line +tools to download (_a-get_) and upload ( a-put --sdc ) data from and to SD Connect. + +* [Using SD Connect data with a-commands](sd-connect-and-a-commands.md) + + diff --git a/docs/data/sensitive-data/sd-desktop-working.md b/docs/data/sensitive-data/sd-desktop-working.md index 53d79bca22..45e5a93c21 100644 --- a/docs/data/sensitive-data/sd-desktop-working.md +++ b/docs/data/sensitive-data/sd-desktop-working.md @@ -133,3 +133,7 @@ Read next: - [How to import data for analysis in your desktop](./sd-desktop-access.md) - [Customisation: adding software](./sd-desktop-software.md) - [How to manage your virtual desktop (delete, pause, detach volume etc.)](./sd-desktop-manage.md) + +## Submitting jobs from SD Desktop to HPC environments + +- [How to use sdsi-client to submit batch jobs from SD Desktop to Puhti](./tutorials/sdsi.md) diff --git a/docs/data/sensitive-data/sequencing_center_tutorial.md b/docs/data/sensitive-data/sequencing_center_tutorial.md index 1e957a67d0..24fbd9dbab 100644 --- a/docs/data/sensitive-data/sequencing_center_tutorial.md +++ b/docs/data/sensitive-data/sequencing_center_tutorial.md @@ -1,5 +1,9 @@ # Using Allas storage service to receive sensitive research data +This document provides an example of how a research group can use Allas service to receive **sensitive data** from external +data provider like a sequencing center. In many cases [SD Connect](sd-connect-sharing-for-import.md), provides you a more easy way to receive sensitive data but in some cases, SD Connect can't be used. For example, SD Connect is not able to provide you an encrypted file that you could later on decrypt in an environment that does not have internet connection. + +## Allas Allas storage service is a general purpose data storage service maintained by CSC. It provides free-of-charge storage space for academic research projects at Finnish universities and research institutes. @@ -10,9 +14,6 @@ There is no automatic backup processes in Allas. In technical level Allas is ver but if you, or some of your project members, remove or overwrite some data in Allas, it is permanently lost. Thus, you might consider making a backup copy of your data to some other location. -This document provides an example of how a research group can use Allas service to receive **sensitive data** from external -data provider like a sequencing center. - The steps 1 (Obtaining storage space in Allas), and 2 (Generating encryption keys) require some work, but they need to be done only once. Once you have the keys in place you can move directly to step 3 when you need to prepare a new shared bucket. @@ -34,17 +35,18 @@ Create a CSC account by logging in to MyCSC with Haka or Virtu. ### Step 1.2. Create or join a project -In addition to CSC user account, new users must either join a CSC computing project + +In addition to CSC user account, users must either join an existing CSC computing project or set up a new computing project. You can use the same project to access other -CSC services too like Puhti, cPouta, or SD desktop. +CSC services too like SD Desktop, SD Connect pt Puhti. -Create a CSC project in MyCSC and apply access to Allas. See if you are eligible to act as a project manager. -If your work belongs to any of the free-of-charge use cases, select 'Academic' as the project type. -As a project manager, you can invite other users as members to your project. +If you are eligible to act as a [project manager](https://research.csc.fi/prerequisites-for-a-project-manager), you can create a new CSC project in MyCSC and apply access to Allas. +Select 'Academic' as the project type. As a project manager, you can invite other users as members to your project. If you wish to be joined to an existing project, please ask the project manager to add your CSC user account to the project member list. + ### Step 1.3. Add Allas access for your project Add _Allas_ service to your project in MyCSC. Only the project manager can add services. diff --git a/docs/data/sensitive-data/tutorials/sdsi.md b/docs/data/sensitive-data/tutorials/sdsi.md new file mode 100644 index 0000000000..0f4cc33db3 --- /dev/null +++ b/docs/data/sensitive-data/tutorials/sdsi.md @@ -0,0 +1,275 @@ +# Submitting jobs from SD Desktop to the HPC environment of CSC + +The limited computing capacity of a SD Desktop virtual machines can prevent running heavy analysis tasks +for sensitive data. This document describes, how heavy compting tasks can be submitted form SD Desktop +to the Puhti HPC cluster. + +Please note following details that limit the usage of this procedure: + * You have to contact servicedesk@csc.fi the enable the job submission tools for your project. By default the job submission tools don't work. + * Each jobs reserves always one, and only one, full Puhti node for your task. Try to construct your batch job so that it uses effectively all the 40 computing cores of one Puhti node. + * The input files that the job uses must be uploaded to SD Connect before the job submission. Even though the job is submitted from SD Desktop, you can't utilize any files from the SD Desktop VM in the batch job. + * The jobs submitted from SD Desktop to Puhti have higher security level that normal Puhti jobs but lower than that of SD Desktop. + + +# Getting stared + +Add Puhti service to your project and contact CSC (sevicedesk@csc.fi) and request that Puhti access will be created for your SD Desktop environment. In this process a robot account will be created for your project and a project specific server process is launched for you project by CSC. + +The job submission is done with command `sdsi-client`. This command can be added to your SD desktop machine by installing `CSC Tools` with [SD tool installer](../sd-desktop-software.md) to your SD Desktop machine. + +# Submitting jobs + +## Data Upload + +The batch jobs submitted by sdsi-client read the input data from SD Connect service. Thus all the input data must be uploaded to SD Connect before the job is submitted. Note that you can't use data in the local disks of your SD Desktop virtual machine or unencrypted files as input files for your batch job. However, local files in Puhti can be used, if the access permissions allow all group members to use the data. + +Thus the first step in constructing a sensitive data batch job is to upload the input data to SD Connect. + +## Constructing a batch job file + +When you submit a batch job from SD Desktop, you must define following information: + +1. What files need be downloaded from SD Connect to Puhti to be used as input files (`data:`) +2. What commands will be executed (`run : `) +3. What data will be exported from Puhti to SD Connect when the job ends +4. How much resources (time, memory, temporary dick space ) the job needs. (`sbatch:`) + +You can define this this in command line as _sdsi-client_ command options, but normally +it is more convenient to give this information as a batch job definition file. +Below is a sample of a simple sdsi job definition file, named as _job1.sdsi_ + +```text +data: + recv: + - 2008749-sdsi-input/data1.txt.c4gh + - 2008749-sdsi-input/data2.txt.c4gh +run: | + md5sum 2008749-sdsi-input/data1.txt + md5sum 2008749-sdsi-input/data2.txt +sbatch: +- --time=00:15:00 +- --partition=test +``` + +More sdsi batch job examples can be found below + +## Submitting the job + +The batch job defined in the file can be submitted with command + +```text +sdsi-client new -input job1.sdsi +``` +The submission command will ask for your CSC password, after which it submits the task and it prints the ID number of the job. +You can use this ID number to check the status of your job. For example for job 123456 you can check the status in *SD Desk desktop* with command: + +```text +sdsi-client status 123456 +``` +Alternatively, you can use this ID in *Puhti* with `sacct` command: + +```text +sacct -j 123456 +``` + +## Steps of processing + +The task submitted with sdsi-client is transported to the batch job system of Puhti +where it is processed among other batch jobs. The resource requirements for the batch job: computing time, memory, local disk size, GPUs, are +set according to the values defined in the _sbatch:_ section in the job description file. + +The actual computing starts only when a suitable Puhti node is available. Queueing times may be long as +the jobs always reserves one full node with sufficient local disk and memory. + +The execution of the actual computing includes following steps: + + 1. The input files, defined in the job description file, are downloaded and decrypted to the + local temporary disk space of the computing node. + + 2. Commands defined is the _run:_ section are executed. + + 3. Output files are encrypted and uploaded to SD Connect. + + 4. Local temporary disk space is cleaned. + + +## Output + +By default the exported files include standard output and standard error of the batch job (this is the text that in interactive working is written to the terminal screen ) and files that are in directory _$RESULTS_. + + +The results are uploaded from Puhti to SD Connecti into bucket named as: *sdhpc-results-*_project_number_, in a subfolder named after the batch job ID. In the example above the project used was 2008749 and the job id was 123456. Thus the job would produce two new files in SD Connect: + +```txt + sdhpc-results-2008749/123456/slurm.err.tar.c4gh + sdhpc-results-2008749/123456/slurm.out.tar.c4gh +``` + You can change the output bucket with sdsi-client option `-bucket bucket-name`. Note that the bucket + name must be unique in this case too. + + +## Running serial jobs effectively + +The jobs that sdsi-client submits reserve always one full Puhti node. These nodes have 40 computing cores +so you should use these batch jobs for tasks that can utilize multiple computing cores. +Preferably all 40. + +In the previous example, the actual computing task consisted of calculating md5 +checksums for two files. The command used, `md5sum`, is able to use just one computing core so +the job waisted resources as 40 cores were reserved but only one was used. + +However if you need to calculate a large amount of unrelated tasks that are able to use only one +or few computing cores, you can use tools like _gnuparallel_, _nextfllow_ or _snakemake_ to submit several +computing tasks to be executed in the same time. + +In the examples below we have a tar-arcvive file that has been stored to SD Connect: `2008749-sdsi-input/data_1000.tar.c4gh`. The tar file contains 1000 text files (_.txt_) for which we want to compute md5sum. Bellow we have three alternative ways to run the tasks so that all 40 cores are effectively used. + +### GNUparallel + +In the case of GNUparallel based parallelization the workflow could look like +following: + +```text +data: + recv: + - 2008749-sdsi-input/data_1000.tar.c4gh +run: | + source /appl/profile/zz-csc-env.sh + module load parallel + tar xf 2008749-sdsi-input/data_1000.tar + cd data_1000 + ls *.txt | parallel -j 40 md5sum {} ">" {.}.md5 + tar -cvf md5sums.tar *.md5 + mv md5sums.tar $RESULTS/ +sbatch: +- --time=04:00:00 +- --partition=small +``` + +In the sample job above, the first command, `source /appl/profile/zz-csc-env.sh` is used to add +_module_ command and other Puhti settings to the execution environment. +GNUparallel is enabled with command `module load parallel`. +Next the tar file containing 1000 files is extracted to the temporary local disk area. +Finally, the file listing of the .txt files in the extracted directory is guided to `parallel` command that runs the given command, `md5sum`, for each file (_{}_) using 40 parallel processes (`-j 40`). + +### nextfllow + +If you want to use NextFlow you must first upload a NextFlow task file (_md5sums.nf_ in this case) to SD Connect. This file defines the input files to be processed, commands to be executed and outputs to be created. Note that you can't upload this file to the SD Connect form SD Desktop, but you must upload it for example from your own computer or from Puhti. + +Content of NextFlow file _md5sums.nf_ + +```text +nextflow.enable.dsl=2 + +process md5sum { + tag "$filename" + + input: + path txt_file from files("*.txt") + + output: + path "${txt_file}.md5" + + script: + """ + md5sum $txt_file > ${txt_file}.md5 + """ +} + +workflow { + md5sum() +} +``` +The actual sdsi job file could look like this: + +```text +data: + recv: + - 2008749-sdsi-input/md5sums.nf.c4gh + - 2008749-sdsi-input/data_1000.tar.c4gh +run: | + source /appl/profile/zz-csc-env.sh + module load nextflow + tar xf 2008749-sdsi-input/data_1000.tar + cp 2008749-sdsi-input/md5sums.nf data_1000 + cd data_1000 + nextflow run md5sums.nf -process.executor local -process.maxForks 40 + tar -cvf md5sums.tar *.md5 + mv md5sums.tar $RESULTS/ + +sbatch: +- --time=04:00:00 +- --partition=small +``` + + +### SnakeMake + +If you want to use SnakeMake you must first upload a SnakeMake job file (_md5sums.snakefile_ in this case) to SD Connect. This file defines the input files to be processed, commands to be executed and outputs to be created. Note that you can't upload this file to the SD Connect form SD Desktop, but you must upload it for example from your own computer or from Puhti. + +Content of SnakeMake file _md5sums.snakefile_ + +```text +txt_files = [f for f in os.listdir(".") if f.endswith(".txt")] + +rule all: + input: + expand("{file}.md5", file=txt_files) + +rule md5sum: + input: + "{file}" + output: + "{file}.md5" + shell: + "md5sum {input} > {output}" +``` + +The actual sdsi job file could look like this: + +```text +data: + recv: + - 2008749-sdsi-input/md5sums.snakefile.c4gh + - 2008749-sdsi-input/data_1000.tar.c4gh +run: | + source /appl/profile/zz-csc-env.sh + module load snakemake + mkdir snakemake_cache + export SNAKEMAKE_OUTPUT_CACHE=$(pwd)"/snakemake_cache" + tar xf 2008749-sdsi-input/data_1000.tar + cp 2008749-sdsi-input/md5sums.snakefile data_1000 + cd data_1000 + snakemake --cores 40 --snakefile md5sums.snakefile + tar -cvf md5sums.tar *.md5 + mv md5sums.tar $RESULTS/ + +sbatch: +- --time=04:00:00 +- --partition=small +``` + +### GPU computing + +sdsi-client can also be used to submit jobs that utilize the GPU capacity of Puhti. +In the below example, GPU computing is used to speed up whisper speech recognition tool. +Whisper is installed in Puhti and activated there with command `module load whisper`. + + +```text +data: + recv: + - 2008749-sdsi-input/interview-52.mp4.c4gh +run: | + source /appl/profile/zz-csc-env.sh + module load whisper + whisper --model large -f all -o $RESULTS --language Italian 2008749-sdsi-input/interview-52.mp4 +sbatch: +- --time=01:00:00 +- --gres=gpu:v100:1 +``` + + + + +