Skip to content

Software components

egazzarr edited this page Dec 1, 2023 · 26 revisions

The following contains detailed information on the installed cluster components, i.e. Rucio, Reana, Jupyterhub, Dask, CVMFS.

Rucio

Rucio is installed via its helm chart through IaC (terraform). The CERN VRE tf and values YAML files can be used as a template for the initial setup. The helm charts should be installed one after the other, first the Rucio server, then daemons and then the UI and probes if required. Note that some secrets containing host certificates for the servers need to be applied to the cluster BEFORE installing the helm charts.

1. Certificates and secrets

Some secrets need to be created before applying the Rucio Helm charts via Terraform. A script with instructions can be found here. In order to generate the host certificates, head to the CERN CA website and create new grid host certificates for the main server, for the auth server, for the webui, specifying your desired DAN names (ours are vre-rucio.cern.ch, vre-rucio-auth.cern.ch, vre-rucio-ui.cern.ch). The secrets will be encrypted with sealed-secrets. The CERN Openstack nginx-ingress-controller pod in the kube-system namespace has a validatingwebhookconfiguration named ingress-nginx-admission that needs to be deleted in order for the nginx ingress controller to be able to reach the K8s API.

CERN CA certificates

The way to install the CA certificates in a persistant and updated way is the following

Centos based OS

> vi /etc/yum.repos.d/linuxsupport7s-stable.repo
> yum install -y CERN-CA-certs

with

> cat linuxsupport7s-stable.repo
# Example modified for cc7 taken from https://gitlab.cern.ch/linuxsupport/rpmci/-/blob/master/kojicli/linuxsupport8s-stable.repo 
[linuxsupport7s-stable]
name=linuxsupport [stable]
baseurl=https://linuxsoft.cern.ch/cern/centos/7/cern/$basearch
enabled=1
gpgcheck=False
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-koji file:///etc/pki/rpm-gpg/RPM-GPG-KEY-kojiv2
priority=1
protect=1

what adds a CERN-bundle.pem file (among others) into the /etc/pki/tls/certs/ directory.

Ubuntu based OS

rpms cannot be installed on apt based systems. Thus add the "bundle" file manually.

> curl -fsSL 'https://cafiles.cern.ch/cafiles/certificates/CERN%20Root%20Certification%20Authority%202.crt' | openssl x509 -inform DER -out /tmp/cernrootca2.crt
> curl -fsSL 'https://cafiles.cern.ch/cafiles/certificates/CERN%20Grid%20Certification%20Authority(1).crt' -o /tmp/cerngridca.crt
> curl -fsSL 'https://cafiles.cern.ch/cafiles/certificates/CERN%20Certification%20Authority(2).crt' -o /tmp/cernca.crt
> mv /tmp/cernrootca2.crt /tmp/cerngridca.crt /tmp/cernca.crt /usr/local/share/ca-certificates/ 
update-ca-certificates
# Move the files anywhere or merge them into a single file. For example;
# cat cernrootca2.crt >> /certs/rucio_ca.crt
# cat cerngridca.crt >> /certs/rucio_ca.crt
# cat cernca.crt >> /certs/rucio_ca.crt

For info, the update-ca-certificates command updates the /etc/ssl/certs (command description).

Bundle CA files

Both methods provide the same output file. However, the CERN-ca-bundle.pem, contains a fourth and extra certificate.

2. Database

  • requesting a DBOD instance at CERN (postgres better than Oracle)
  • configure psql to connect and mange it
  • In order to pass the database connection string to Rucio, we need to export it as a variable. Therefore, run this command locally:
$export DB_CONNECT_STRING="<set manually>"

Create a secret named ${helm_release}db-secret in the cluster.

  • Bootstrapping of the database with the Rucio DB init container.

    keep in mind that when installing the daemons chart, the DB will be connected to different services. In the case of the Database On Demand at CERN, the maximum number of DB connections is limited to 100, so you have to set the limit on the database.pool_size.

Database major versions changes

In order to perform a major upgrade (i.e. 1.30.0 --> 1.31.0), you will need to run manually an alembic upgrade of the DB. Go to section 9 of the WIKI to know more about this.

3. FTS - File Transfer Service

Creation of FTS either long proxy or cert and key cluster secrets for the FTS renewal cronjob to run and create the required x509 proxy

look at this issue that automatises the process and requires you to provide the certificate cert, key and password in order to automatise the proxy creation and inject it into the cluster.

It will be necessary to request a ROBOT certificate from a service account to delegate the proxy to FTS. This is much better than delegating FTS transfers to a single user account. In order to request a Grid Robot Certificate, follow the instructions at the bottom of https://ca.cern.ch/ca/Help/?kbid=021003, separate the certificate into hostcert.pem and hostkey.pem and create the <release>-fts-cert and <release>-fts-key secrets.

They will be used by the fts-cron container to generate the proxy.

4. Apply Helm releases via flux

Apply the Rucio helm charts by providing your specific values.yaml files stored in the /rucio repo.

5. Rucio Servers

Once the Rucio helm charts are applied, the rucio-server and rucio-server-auth will be created. They are services of type Loadbalancer, accessible from outside the cluster. You can inspect them with:

kubectl get service servers-vre-rucio-server -n rucio-vre 

The external IP address is created once the chart gets applied to the cluster. Once the IP is created, you can inspect it on CERN's aiadm with the command openstack loadbalancer list. You will then have to add a description to the loadbalancer, and a tag to it in order for it to be reachable via a DNS name (i.e. vre-rucio.cern.ch).

# backlog: this option uses the CERN's loadbalancer as a service 

# set a description
openstack loadbalancer set --description "vre-rucio.cern.ch" $LB_ID_MAIN
openstack loadbalancer set --description "vre-rucio-auth.cern.ch" $LB_ID_AUTH
openstack loadbalancer set --description "vre-rucio-ui.cern.ch" $LB_ID_UI

#set a tag
openstack loadbalancer set --tag landb-alias=vre-rucio $LB_ID_MAIN
openstack loadbalancer set --tag landb-alias=vre-rucio-auth $LB_ID_AUTH
openstack loadbalancer set --tag landb-alias=vre-rucio-ui $LB_ID_UI

Afterwards, open the firewall for that loadbalancer service to be accessible from the outside world, and not only from CERN. Use this link: https://landb.cern.ch/portal/firewall.

6. Authentication

Connection to any third-party authentication service, in this case the ESCAPE-IAM instance. The authentication is managed through OIDC tokens. To achieve the users' authentication to the VRE Rucio instance, a connection between IAM and Rucio needs to be achieved. We are managing the Rucio instance, while the IAM instance is managed by CNAF admins. We suggest to get familiar with OAuth2 and OIDC tokens by going through this presentation on tokens in the Rucio framework.

The set-up is straight forward. Before starting, you need to have:

  • The Rucio servers (main+auth) running, as described in the Rucio server chart. Apply these to your K8s cluster via Terraform and check that the service gets created correctly.
  • Your Rucio DB needs to be already initiated correctly.

Once you have this ready, register two new clients from the MitreID IAM bashboard, as described in the Rucio documentation.

You can name them however you want, ours are:

  1. cern-vre-rucio-admin is the** ADMIN** client.
  2. cern-vre-rucio-auth is the **AUTH **client. For each of them, you have a client-id and a client-secret, and you can generate a 'registration access token'. Create a new idpsecret.json file and populate it as follows:
{
  
    "escape": {
      
      "issuer": "https://iam-escape.cloud.cnaf.infn.it/",
  
      "redirect_uris": [
        "https://vre-rucio-auth.cern.ch/auth/oidc_code",
        "https://vre-rucio-auth.cern.ch/auth/oidc_token"
      ],
  
      "client_id": "<AUTH-client-id>",
      "registration_access_token": "<**AUTH**-client-token>",
      "client_secret": "<AUTH-client-secret>",
  
      "SCIM": {
        "client_id": "<ADMIN-client-id>",
        "grant_type": "client_credentials",
        "registration_access_token": "<ADMIN-client-token>",
        "client_secret": "<ADMIN-client-secret>"
      }
  
    }
  
  }

After having injected this secret as a helm_release_name-idpsecrets as stated here, you need to add the config.oidc and the additionalSecrets.idpsecrets sections in the values.yaml of the server and daemons, if you haven't already done so.

Last step is to run the iam-rucio-sync.py script, ideally as a cronjob, in a container that has both server and client modules installed. This will populate the accounts table of the DB with all the IAM ESCAPE accounts. You can test run it from the Rucio server pod by entering into the shell and executing the code as described in the containers repo.

If the synchronisation gives problems, execute the mapping manually:

rucio-admin identity add --account <account_name> --type OIDC --id "SUB=<look-in-IAM-user-info>, ISS=https://iam-escape.cloud.cnaf.infn.it/" --email "<user_email>"

You need root permissions, so you need to authenticate with Rucio from the root account, which has the following config:

[client]
rucio_host = https://vre-rucio.cern.ch:443
auth_host = https://vre-rucio-auth.cern.ch:443
auth_type = userpass
username = ddmlab
password = <password_tbag>
ca_cert = /etc/pki/tls/certs/CERN-bundle.pem
account = root
request_retries = 3
protocol_stat_retries = 6
oidc_issuer = escape
oidc_polling = true

[policy]
permission = escape
schema = escape
lfn2pfn_algorithm_default = hash

When you now run the command 'rucio whoami' by providing a rucio.cfg that is similar to this:

[client]
rucio_host = https://vre-rucio.cern.ch:443
auth_host = https://vre-rucio-auth.cern.ch:443
ca_cert = /etc/pki/tls/certs/CERN-bundle.pem
auth_type = oidc
account = <your_IAM_account>
oidc_audience = rucio
oidc_scope = openid profile wlcg wlcg.groups fts:submit-transfer offline_access 
request_retries = 3
oidc_issuer = escape
oidc_polling = true
auth_oidc_refresh_activate = true
oidc_audience = rucio

[policy]
permission = escape
schema = escape  
lfn2pfn_algorithm_default = hash 

The server should prompt you with a link to generate the token to authenticate to the instance.

Debugging x509 authentication

In order to be authenticated by the Rucio auth server with your x509 certificate, you need to periodically update the CA certificates that will verify that your personal certificate is valid. You can do that by running the command fetch-cerl (yum install fetch-crl && fetch-crl) or you could safely copy the contents of /etc/grid-security/certificates/* from the official institutional computers (in case of cern, lxplus.cern.ch).

To see whether you are correctly authenticated to the instance, always run first the command rucio -vvv whoami (-vvv stands for verbose).

curl -vvv --cacert /etc/pki/tls/certs/CERN-bundle.pem --cert <x509_cert_path> --key <x509_key_path> https://<rucio_auth_server>/auth/x509

Bearer token

When you authenticate with tokens, the OAuth method is described on the Bearer token documentation. The token gets stored in the directory /tmp/root/.rucio_root/, and you can export it with:

export tokenescape=<content_of_your_authe_token_file>

Inspect it with:

curl -s -H "Authorization: Bearer $tokenescape" https://iam-escape.cloud.cnaf.infn.it/userinfo | jq .

7. RSEs - Rucio Storage Elements

If you want, you can use CRIC to better manage your RSEs, but for now we are setting them up manually.

Before adding any RSE to the RUCIO instance:

  • Be sure that you can communicate with the endpoint: download the corresponding client to communicate with the storage and test it (explore the storage, for example).
  • Then test if you can interact with the endpoint using Gfal (local-SE).
  • Last check will be testing the the connection/communication between SEs using FTS.

You can either execute the following commands manually, or using this script to make your life easier. Remember to insert your own variables at the start of the file!

Using rucio CLI

First of all you'll need to identify with RUCIO, f. ex, using a rucio.cfg file.

> export RUCIO_CONFIG=/paht/to/the/file/rucio.cfg
> rucio whoami

The following script fully sets up an EULAKE RSE from a root account

> rucio-admin rse add <RSE_NAME>
Added new deterministic RSE: RSE_NAME

> rucio-admin rse add-protocol --hostname <HOSTNAME> \
  --scheme <SCHEME> \
  --prefix <PREFIX> \
  --port <PORT> \
  --impl <IMPT>  \
  --domain-json '{"wan": {"read": X, "write": X, "delete": X, "third_party_copy_read": X, "third_party_copy_write": X}, "lan": {"read": X, "write": X, "delete": X}}' <RSE_NAME>

> rucio-admin rse set-attribute --rse <RSE_NAME> --key <KEY> --value <VALUE>
Added new RSE attribute for RSE_NAME: KEY-VALUE

# Do the same for the following attributes:

Attributes:
===========
  QOS: X
  SITE: X
  city: Missing from CRIC
  country_name: NULL
  fts: X
  greedyDeletion: X
  latitude: X
  lfn2pfn_algorithm: hash
  longitude: X
  oidc_support: x
  region_code: X
  source_for_used_space: X
  verify_checksum: X

# Defining storage for the RSE
> rucio-admin account set-limits <ACCOUNT> <RSE_NAME> <LIMIT>
> rucio-admin rse set-limit <RSE_NAME> MinFreeSpace <LIMIT>

# Once you have added at least 2 SEs, you can set up the distances between the SEs. This can be done in a single direction, if intented.
> rucio-admin rse add-distance --distance <distance> --ranking <ranking> <RSE-1> <RSE-2>

A few remarks

  • lfn2pfn_algorithm attribute needs to be set to hash.
  • (Optional attribute - no need to set it up) country code needs to be a 2 letter code, not more.
  • oidc_support does not affect user authentication, upload or download, but only transfers and rules. Here the documentation in more details. The root account needs to be well configured with it, otherwise the reaper daemon will throw OIDC errors when deleting.
  • greedyDeletion needs to be studied, but it is better for the moment to keep it to True, so that the reaper actually deletes the files.
  • In general, avoid adding NULL values to any key attribute.

Deletion

File deletion

Files in RUCIO can only be deleted by the reaper deamon.

The usual way to do this is just by using the RUCIO CLI

> rucio erase <SCOPE>:<DID>

This operation is asynchronous (it should work within the next 24h), and first deletes all the replicas associated to the DID until the file completely disappears from the DB. To delete all the replicas associated to one rule, use:

rucio delete-rule --purge-replicas --all <rule_id>

RSEs deletion

Sometimes, it will be convenient to delete an RSE from the Data Lake. In order to do so, the RSE needs to be completely empty. To check if it is, run

rucio list-rse-usage <rse_name>

To see which datasets are ont he RSE:

rucio list-datasets-rse <rse_name>

If you know the account that uploaded data on the RSE:

rucio list-rules –-account=<account_name>

Access the DB, for example with psql, and execute the query:

SELECT * FROM replicas WHERE rse_id ='<rse_id>';

And check if there are still some files there. In general, if lock_cnt is 0 and tombstone is in the past, the reaper daemon should delete the file! To do so, run the rucio erase command as explained above.

Sometimes you might want to delete an RSE from your Rucio DB. This can happen if:

  • the site is already gone (a partner institutions has decommissioned it)
  • you lost access to it

To do so:

  1. use the --schema mock on the RSE with the following command:
  2. set the greedy deletion to True
  3. (if the state of the file in the DB is A (available), you can set the tombstone time to the past so that the reaper actions take place without the 24h delay)
  4. update the update_at time in the DB to the past
  5. (use under your own risk- this step that worked in our case. If there are not rules associated to a replica but still the replica shows a, for example, C (copying) state, you can update the state of the replica to U (unavailable) and update the updated_at time to the past as explained on step 3.)
# Step 1. command
> rucio-admin rse add-protocol --hostname example.com --scheme mock --prefix / --port 0 --impl rucio.rse.protocols.mock.Default <rse_name>
# Step 2. command
> rucio-admin rse set-attribute --key greedyDeletion --value True --rse <rse_name>
# Step 3. command
> rucio-admin rucio-admin replicas set-tombstone --rse <rse_name> <scope>:<did>
# Step 4 sql (DB) command
UPDATE replicas SET updated_at = TO_DATE('2023-01-01', 'YYYY-MM-DD') WHERE rse_id = '<rse_id>';
# Step 5 sql (DB) command
UPDATE replicas SET state = 'U' WHERE rse_id = '<rse_id>';  # command not tested, though.

8. Monitoring

The hermes Rucio daemon takes care of the messages on the VRE operations for file transfer, upload and download. These messages (metrics) are useful to pass them to monitoring dashboard such as Grafana. Here are some initial useful links:

Grafana - Kibana connection.

Elastic Search (ES) data sources can be added to Grafana on the configuration tab (if you have the corresponding rights on the Grafana instance). Follow this tuto/doc to set up in Grafana the ES Data Source you are interested in. User and pass for both timber and monit-opensearch-lt data source can be found in tbag.

Examples of containers/scripts set up as cronjobs that populated the dashboard can be found here:

9. Rucio version upgrades

The VRE pulls Helm charts directly from the Rucio main repository: https://github.com/rucio/helm-charts. When a minor version upgrade (i.e. 1.30.0 --> 1.30.1) is performed, the Helm chart can be edited directly on Github (flux will take care of applying the edits to the K8s cluster) without worries.

In a case of a major version upgrade (i.e. 1.30.0 --> 1.31.0), the logic of the database (DB) tables will be changed, and the upgrade needs to be performed carefully. Here is an outline of the steps to be taken to migrate the Rucio instance from v1 to v2 (where, for example, v1=1.30.0 and v2=1.31.0):

  • create a clone C1 of the C0 DB (if on CERN's DBOD, this is very straightforward); this will leave you with database C0 and C1.
  • go inside a v2 rucio-server Docker container (in a development cluster, for example) and edit the /opt/rucio/etc/rucio.cfg to connect to the C1 clone of the DB.
[database]
default = postgresql://<user>:<pwd>@<hostname>.cern.ch:<port>/<db_name>
  • run the alembic upgrade with alembic upgrade head --sql; the alembic.ini file should be picked up automatically. Clone C1 is now in v2.
  • check that the upgrade has worked well by looking at the DB Tables and noticing if anything fishy happened. If all is good, you are good to go!
  • create a clone C2 of C0, both still in v1.
  • on the production cluster, in v1, stop the connections to the DB by deleting the K8s db-secret.
  • from the v2 rucio server in the dev cluster, change the DB config to connect to clone C0 of the database and run the alembic upgrade as before.
  • once the upgrade of C0 is performed, apply the v2 Helm charts to the production cluster, and re-create the db-secret in K8s to connect to C0 database, which is now in v2.
  • if all is well, you now have an upgraded version of the DB and of the K8s cluster!

CronJobs

Most of the VRE operation cronjobs are sharing a BASE image with the common software shared between all the cron containers. All the VRE containers can be found within the VRE repository.

vre-base-ops

Starting from a BASEIMAGE=rucio/rucio-server:release-1.30.0 base image, the base-ops container has the following software installed: vre-base-ops Docker file

This image should interact with different RUCIO and grid components. Please check the Certificates and secrets section to see how to install (in a persistant way) certain certificates.

Rucio related containers

rucio-noise

Uses the vre-base-ops latest stable version. TBC.

iam-rucio-sync

Uses the vre-base-ops latest stable version. TBC.

rucio-client

Uses a BASEIMAGE=rucio/rucio-client:release-1.30.0 as a base image TBC.

CVMFS - CERN-VM File System

To enable mounting CVMFS on the cluster you will need first to check which is the CSI drivers version that it's on the cluster. This happens automatically if using CERN OpenStack resource provider - unless specified during the creation of the cluster (either by un-checking the CVMFS CSI Enabled box, either through the argument --labels cvmfs_enabled=false if using CLI).

# To check the CVMFS CSI version installed on the cluster

(k9s) [root@kike-dev example]# kubectl describe pod csi-cvmfsplugin-xfnqz -n kube-system | grep Image
    Image:         registry.cern.ch/magnum/csi-node-driver-registrar:v2.2.0
    Image ID:      registry.cern.ch/magnum/csi-node-driver-registrar@sha256:2dee3fe5fe861bb66c3a4ac51114f3447a4cd35870e0f2e2b558c7a400d89589
    Image:         registry.cern.ch/magnum/cvmfsplugin:v1.0.0
    Image ID:      registry.cern.ch/magnum/cvmfsplugin@sha256:409e1e2a4b3a0a6c259d129355392d918890325acfb4afeba2f21407652d58a5

If the cvmfsplugin is set to v1.0.0, you will need to upgrade the plugin to v>=2. Follow this tutorial.

On our case we use helm to make the upgrade:

$ kubectl patch daemonset csi-cvmfsplugin -p '{"spec": {"updateStrategy": {"type": "OnDelete"}}}' -n kube-system 
$ vi values-v1-v2-upgrade.yaml
`yaml
nodeplugin:

  # Override DaemonSet name to be the same as the one used in v1 deployment.
  fullnameOverride: "csi-cvmfsplugin"

  # DaemonSet matchLabels must be the same too.
  matchLabelsOverride:
    app: csi-cvmfsplugin

  # Create a dummy ServiceAccount for compatibility with v1 DaemonSet.
  serviceAccount:
    create: true
    use: true
`
$ helm upgrade cern-magnum cern/cvmfs-csi --version 2.0.0 --values values-v1-v2-upgrade.yaml --namespace kube-system

Once the upgrade has happened, you will need to set up a stage class and a PVC to mount cmvfs. Following the above tutorial plus the examples found in the same repo, the K8s manifest to be applied should look like as shown below. On the ESCAPE cluster this was the manifest applied - to be updated to the VRE one:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: cvmfs
provisioner: cvmfs.csi.cern.ch
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: cvmfs
  namespace: jupyterhub
spec:
  accessModes:
  - ReadOnlyMany
  resources:
    requests:
      # Volume size value has no effect and is ignored
      # by the driver, but must be non-zero.
      storage: 1
  storageClassName: cvmfs

--> Because of the jupyter-role=singleuser:NoSchedule taint (using-a-dedicated-node-pool-for-users and Assigning Pods to Nodes). This was solved thanks to ticket INC3384022.

  • how to solve it: Add the following toleration to the DaemonSet deployed by the CSI-cvmfs helm chart (v2.0.0)
│       tolerations:                             
│       - key: "jupyter-role"
│         operator: "Equal"
│         value: "singleuser"
│         effect: "NoSchedule"

or kubectl patch daemonset csi-cvmfsplugin -p '{"tolerations": [{"key": "jupyter-role", "operator": "Equal", "value": "singleuser", "effect": "NoSchedule"}]}' -n kube-system (or alternatively kubectl patch daemonset csi-cvmfsplugin -p '{"tolerations": [{"key": "jupyter-role", "operator": "Exists", "effect": "NoSchedule"}]}') should work.

Reana

  1. Apply the reana-release.yaml helm chart via flux, keeping the ingress disabled as the default Reana ingress is Traefik, while at CERN Openstack already deploys nginx as the ingress controller.
  2. If you are using your own DB instance, change the configuration with DB name, host and port in the helm chart, delete the secret <your-reana-helm-release>-db-secrets which contains the username and password and re-apply your own as in the infrastructure/scripts/create-reana-secrets.sh. NOTE THAT THE SECRET GETS UPDATED EVERY TIME YOU APPLY A RELEASE, so make sure you check the db secret is you own and not the default one once you do some development that requires you to re-apply changes.
    Initialise the DB as described in the helm chart. If you are using k9s, type :helm and press enter on the <your-reana-helm-release> name for instructions.
  3. Once the helm chart is applied correctly, add the DNS name (reana-vre.cern.ch) as a label to the ingress nodes, like the jhub label as well:
openstack server set --property landb-alias=jhub-vre--load-1-,reana-vre--load-1- cern-vre-bl53fcf4f77h-node-0 
openstack server set --property landb-alias=jhub-vre--load-2-,reana-vre--load-2- cern-vre-bl53fcf4f77h-node-1
openstack server set --property landb-alias=jhub-vre--load-3-,reana-vre--load-3- cern-vre-bl53fcf4f77h-node-2
  1. Apply the reana-ingress.yaml manually: the letsencrypt annotation should create the secret cert-manager-tls-ingress-secret-reana automatically.
  2. Configure your identity provider. For this, follow the initial instructions on https://github.com/reanahub/docs.reana.io/pull/151/files. For the IAM ESCAPE idP the OpenID configuration is the following: https://iam-escape.cloud.cnaf.infn.it/.well-known/openid-configuration. The secrets of the IAM client acting on behalf of the application are stored in reana-vre-iam-client. You can then see that the users get created in the DB, and the release has a way to specify email notification whenever a new user requests a token. A coupel of useful commands to deal with users are:
$ export REANA_ACCESS_TOKEN=$(kubectl get secret reana-admin-access-token -n reana -o json | jq -r '.data | map_values(@base64d) | .ADMIN_ACCESS_TOKEN')

$ echo $REANA_ACCESS_TOKEN

# LIST USERS

$ kubectl exec -i -t deployment/reana-server -n reana -- flask reana-admin user-list --admin-access-token $REANA_ACCESS_TOKEN

# CREATING USER
kubectl exec -i -t deployment/reana-server -n reana -- flask reana-admin user-create --email <user-email> --admin-access-token $REANA_ACCESS_TOKEN

# GRANTING TOKEN TO NEW USER 
kubectl exec -i -t deployment/reana-server -n reana -- flask reana-admin token-grant -e <user-email> --admin-access-token $REANA_ACCESS_TOKEN
  1. Navigate to reana-vre.cern.ch and log in with your IAM credentials.

JupyterHub

JupyterHub is installed through Z2JH Helm Chart. The domain is https://nb-vre.cern.ch/ URL, which uses a Sectigo certificate.

The chart values are adjusted to use:

  • LoadBalancer as a service
  • IAM ESCAPE OAuth authentication
  • SSL/HTTPS with the domain and Sectigo certificate
  • DBOD postgres database

Secrets for IAM and Sectigo are stored in the same namespace.

Daskhub: scalable Dask deployment

This chart combines a Jupyterhub deployment, a Dask deployment and a Dask Gateway to distribute the workflow on all the nodes of the cloud cluster. It can be accessed via nb-vre.cern.ch. Here the instructions to set it up via the Helm chart in the repository.

  1. Apply namespace and chart: https://github.com/vre-hub/vre/tree/main/infrastructure/cluster/flux-v2/dask.

  2. label the ingress nodes with the correct URL (nb-vre, as well as the already existing ones for jhub and rucio). openstack server set --property landb-alias=jhub-vre--load-1-,reana-vre--load-1-,nb-vre--load-1- cern-vre-bl53fcf4f77h-node-0 openstack server set --property landb-alias=jhub-vre--load-2-,reana-vre--load-2-,nb-vre--load-2- cern-vre-bl53fcf4f77h-node-1 openstack server set --property landb-alias=jhub-vre--load-3-,reana-vre--load-3-,nb-vre--load-3- cern-vre-bl53fcf4f77h-node-2

  3. create IAM client (nb-vre-iam-client) with redirect https://nb-vre.cern.ch/hub/oauth_callback.

  4. Apply daskhub secrets: https://github.com/vre-hub/vre/tree/main/infrastructure/secrets/dask.

  5. For the DB connection, create a new DB and user in your DB instance:

CREATE DATABASE dask;
CREATE USER dask WITH ENCRYPTED PASSWORD 'BEEtpXGrP8uJGRXuLMI6';
GRANT ALL PRIVILEGES ON DATABASE dask TO dask;
  1. Apply the release, that will request a certificate (with letsencrypt service at CERN) for the service name nb-vre.cern.ch. If the certificate does not get issued, look into the errors with kubectl describe -n daskhub in order, all of these resources, that depend one on the other: certificate < certificaterequest < issuer/clusterissuer < orders/challenges

  2. Navigate to the URL nb-vre.cern.chand test in a notebook:

import dask
from dask_gateway import Gateway
from dask_gateway import GatewayCluster
import dask.array as da

# create cluster 

cluster = GatewayCluster()  # see the pod dask-scheduler being created in other nodes of the cluster
cluster.scale(4)            # see the pods dask-worker being created in other nodes of the cluster
cluster.adapt(minimum=2, maximum=20)

# inspect the active cluster 

gateway=Gateway()
gateway.list_clusters()

# execute a Dask computation 

x = da.random.random((10000, 10000, 10), chunks=(1000, 1000, 5))
y = da.random.random((10000, 10000, 10), chunks=(1000, 1000, 5))
z = (da.arcsin(x) + da.arccos(y)).sum(axis = (1,2))
z.compute()

# shut down the Dask cluster

cluster.shutdown()

Abbreviations

  • CSI: Container-Storage-Intergace
  • PV: Persistent Volume
  • PVC: Persistent Volume Claim
  • SE: Storage Element