Skip to content

CHTC/tutorial-rstudio-to-chtc

Repository files navigation

Learn How to Deploy R Calculations on HTC

This is a mini tutorial for how to deploy and run R calculations on CHTC's High Throughput Computing (HTC) system.

Using data from the NOAA Global Historical Climatology Network, this tutorial generates histograms showing the distribution of daily high and low temperatures across the four meteorological seasons.

Example Calculation Using RStudio (Demo)

This tutorial provides some example scripts for analyzing data from the NOAA Global Historical Climatology Network.

Transitioning from RStudio to CHTC

What is the list of data we want to iterate through?

If we run this script locally, there's a few things that we want to change or note in order to run it in CHTC:

  • Remove the for-loop
  • Add arguments
  • Note what software/dependencies we're using

Find Dependencies

Now that you've successfully run the example calculation using RStudio on your computer, let's consider how to migrate the calculation to work on the High Throughput Computing (HTC) system.

To execute the example calculation, you needed

  • The R script example.R, which contained the commands you wanted to execute.
  • The additional my_functions.R and .csv station data files, which were read in by the example.R script.
  • The R console version 4.4.2 with the tidyverse package installed, which was the software environment for executing example.R.

To run the example calculation on the HTC system, you will need the same set of items:

  • The "input files": additional files needed in order for the shell command to function.
  • The "software environment": programs that are used to run, or are required by, the shell command.

For this tutorial, we'll clone the git repository to the system. But you can also upload files directly from your computer, as described in our guide Transfer Files between CHTC and your Computer.

The traditional way of handling the "software environment" can be rather complicated. For most new users, we recommend instead that they use something called a "container". For now, think of containers as a plug-and-play method for deploying software.

Tip

Your R project may be structured differently or may be more extensive than the project in this tutorial. For guidance on how to identify the necessary information for migrating your project to CHTC, see the notes below.

Logging into CHTC

Before proceeding, we need to make sure that you can login to the HTC system. Access to CHTC systems is currently only via the command line (aka terminal) using the SSH protocol.

First, open the "Terminal" application on your computer.

Note

Mac and Linux operating systems come with a unix "Terminal" application by default. Windows 11 comes with a powershell "Terminal" application, but older Windows machines may need to install it manually: Windows Terminal. If you are unable to install software on your machine, then you should still be able to use the "PowerShell" or "Cmd" applications instead.

Warning

Technically, RStudio comes with a built-in "Terminal" that can be accessed via a tab in the bottom left "Console" pane. While you can use that to login to CHTC, we do not recommend it, as it becomes too easy to run commands on the login server that you meant to run on your computer!

You should see something like this (colors, font, and size will likely differ):

Terminal window, blank

Next, you'll enter the following command (with the information unique to your account):

ssh yourNetID@hostname

where yourNetID should be replaced with your actual NetID, and hostname should be replaced with the address provided in your account confirmation email. For example, if your NetID is bbadger and your account is on hostname ap2002.chtc.wisc.edu (where most new user accounts are located), the command would be ssh [email protected].

Note

You will need to be on the university internet for the command to work! That means you either need to be physically on campus, or else connected to the GlobalProtect VPN (WiscVPN).

The first time you connect to a server via SSH, you will prompted to confirm that you trust the server. Most of the time, it is okay to enter "yes".

If you are concerned about the security of your connection to CHTC, please contact a facilitator for more information.

When prompted for your password, enter the same password you use to login with your NetID to other university services, such as MyUW (my.wisc.edu).

Finally, you will be prompted to complete the two-factor authentication using DUO.

After the two-factor authentication has been confirmed, you should be logged in to the HTC system. A welcome message containing the following should be displayed in your terminal:

Terminal window, with SSH command, DUO prompt, and start of welcome message

There will be some additional information in the welcome message, but at the bottom of the screen your terminal prompt should look like

[yourNetID@ap2002 ~]$ 

Tip

For more information on logging in to the system, see Log in to CHTC.

Copying Files to CHTC

Once you are logged in, duplicate the tutorial materials to your directory on the HTC system with the following command:

git clone https://github.com/CHTC/tutorial-rstudio-to-chtc

Now run the command ls (lowercase "L" and lowercase "S") to list the contents of your directory on the HTC system:

ls

You should see a directory called tutorial-rstudio-to-chtc, corresponding to the GitHub repository that you just cloned.

Next, run the command cd followed by the directory name (tutorial-rstudio-to-chtc) to change directory:

cd tutorial-rstudio-to-chtc

By running this command, you've changed your location on the server to be inside of the GitHub repository. Your command prompt typically shows the name of the directory you are located inside; in this case, you should see

[yourNetID@ap2002 ~]$ cd tutorial-rstudio-to-chtc
[yourNetID@ap2002 tutorial-rstudio-to-chtc]$

Running the ls command here should show you the contents of the GitHub repository for this tutorial.

Remote terminal, showing git clone command and navigating to git repo directory

Tip

For more information on how to use the command line, see our guide Basic shell commands.

Run Example Calculation as a Test Job

All that we need to do is create a "submit file" that will describe to HTCondor how to run our calculation on one of the input files.

About the submit file

The submit file describes to HTCondor the calculation (or "job") that we want to submit. Just like how the example.R file describes to R the commands that you want to execute within the R language, the submit file describes to HTCondor how it should execute the example.R file on a computer within the HTC system.

To start with, the submit file will need to detail the items discussed in Transitioning from RStudio:

  • The shell line contains the commands you want to execute (as if running them locally).

    shell = Rscript example.R $(station)
    
  • The "input files" (including scripts!) needed in for the command to run.

    transfer_input_files = example.R, my_functions.R, input/$(station).csv
    
  • The "software environment" with the programs that are used to run, or are required by, the "executable" script.

    container_image = docker://rocker/tidyverse:4.4.2
    

Since you'll be asking HTCondor to execute the calculation on a remote machine, there are a few more items that need to be declared as well:

  • A job management "log" for keeping track of HTCondor's actions.

    log = example.$(station).log
    
  • Standard "output" and "error" files to record the messages that would normally be printed to your screen.

    output = example.$(station).out
    error = example.$(station).err
    
  • A set of resource "requests" for the amount of computing power that should be used.

    request_cpus = 1
    request_memory = 2GB
    request_disk = 5GB
    

Finally, since HTCondor is designed for high throughput computing, you can define the number of calculations (or jobs) that you want it to run on your behalf. This is ALSO where we will define the $(datafile) variable shown above.

  • The "queue" statement

    queue station from ( madison )
    

To create a submit file for our example, we just need to combine all of these lines into one file. The order of the lines is a matter of preference, with the exception of the queue statement - that must always come last.

Create the submit file

We'll use a command-line text editor to create the submit file. Run the following command to open a new file called "example.sub":

nano example.sub

Your terminal will open a blank file into which you can type the contents of a file. You move the cursor in the file using the arrow keys. Keyboard shortcuts for other operations are listed at the bottom of the screen, where the ^ represents the Ctrl (or Control on Mac) key and the M- represents the Alt (or Option on Mac) key.

nano, empty file

Copy and paste the following contents into the terminal. If you are having trouble pasting into the terminal, take a few minutes to type the contents in manually.

container_image = docker://rocker/tidyverse:4.4.2

shell = Rscript example.R $(station)

transfer_input_files = example.R, my_functions.R, input/$(station)
# transfer_output_files = 
transfer_output_remaps = "$(station).png=results/$(station).png"

log = logs/example.$(station).log
output = logs/example.$(station).out
error = logs/example.$(station).err

request_cpus = 1
request_memory = 2GB
request_disk = 5GB

queue station from ( madison.csv )

To tell nano to save the contents of the file, use the ^O shortcut (Ctrl key and the letter O key together). You'll be asked to confim the file name - make sure that it is example.sub before hitting the Enter key to confirm.

nano, with example.sub showing write-out command prompt

Finally, close the text editor using the ^X shortcut (Ctrl key and the letter X key together). Your command prompt will return, and entering the command ls will show a new file called example.sub.

You can check the contents of the submit file by running this command:

cat example.sub

Submit the job

Now that you've described to HTCondor how to run your calculation, all that's left to do is ask HTCondor to actually run your calculation.

To do so, run the following command:

condor_submit example.sub

This tells HTCondor to use the information in the example.sub file to create the corresponding job(s) in your queue. The output of this command will be the number of jobs in the submission as well as a unique ID. This ID (referred to as the batch or cluster ID) can be used to identify and select jobs that correspond to this submission.

Remote terminal, showing output of "condor_submit example.sub" command

Monitor the job

For a snapshot of the jobs in your queue, use the command

condor_q

For live updates of the jobs in your queue, use the command

condor_watch_q

This will give a live update of the status of your job(s) in the queue, with progress bars and with colors to indicate the different job states. To exit the live view, use the ^C shortcut (Ctrl key and the letter C key together).

Note that completed jobs will not show up in condor_q output, and will only show up in condor_watch_q if the jobs were in the queue when the command was initially run.

Remote terminal, showing output of "condor_q", "condor_watch_q" commands for monitoring single job

Tip

For more information on monitoring jobs, see our guide Learn About Your Jobs Using condor_q.

The job lifecycle

When you run the condor_submit command, you are asking HTCondor to manage the execution of the corresponding job(s) on your behalf. That is, HTCondor will handle everything for you without you needing to intervene (assuming nothing goes wrong). This also means that you do not need to be logged in once you've submitted the jobs.

So what is happening behind the scenes?

  1. The job is submitted. HTCondor parses what the job needs to function based on the contents of the submit file.

  2. The job is "idle". HTCondor is trying to find a machine (an execution point or "EP") capable of running your job. This is commonly referred to as "matchmaking".

  3. The job is matched. HTCondor finds an available execution point and claims it for your job, then begins preparations for running the job.

  4. Input files are transferred. HTCondor then transfers the files needed for the job to function, as declared in your submit file.

    The list of files transferred includes the items defined in the executable, transfer_input_files, and container_image options of the submit file. All the files will be located in a temporary directory unique to the job. This step is important because the execution point running the job does NOT have access to your files on the server where you submitted the job!!

  5. The job is "running". In the temporary directory unique to the job, HTCondor executes the script listed as the executable in your submit file.

    If a container_image is specified, the script will have access to the software installed inside of the container. Messages that would normally be printed to the screen when the script is executed will instead be saved to the output and error files you specified in the submit file.

  6. Output files are transferred. When the executable script stops running (regardless of whether it failed or succeeded), HTCondor tries to transfer back output files.

    If transfer_output_files is not defined in the submit file, the default is to transfer back any new or changed file in the top level of the job's temporary directory. (Files in sub-directories will be ignored.) The files will be returned to the same directory on the server where you ran the condor_submit command.

  7. The job is "done". If the output files are transferred successfully, the job is marked as "done". HTCondor then removes the job from your queue and creates a record in its history.

Note

What if something goes wrong?

If the problem is something that HTCondor knows how to handle, the job is typically reset to the "idle" state so as to try again. If HTCondor doesn't know how to handle the problem, the job is reset to the "idle" state and then placed into the "hold" state with a message about the problem.

Note that HTCondor doesn't care if your script has an error. A job may still end up being marked as "done", even though it didn't do what you wanted! It's up to you to check the output, error, and any other files to confirm that your script executed as you intended.

Checking the results

Once the job you submitted is marked as done in the condor_watch_q output, or the job no longer appears in the output of condor_q or condor_watch_q, it has completed.

Run the ls -R command to list the files in your directory. Once the job is completed, you should see the following new files: example.log, example.out, example.err, in the logs/ directory, and madison.png in the results directory.

(You may also see a file called docker_stderror, which you can ignore.)

The contents of example.madison.out should have the "normal" output messages for the script. You can use the command

head logs/example.madison.out

to print the first 10 lines of the file, or use

cat logs/example.madison.out

to print all the lines in the file.

Remote terminal, showing the new files and the first 10 lines of the output file

Next, make sure that there are no error messages by running

cat logs/example.madison.err

In this case, we see a bunch of messages that we would normally see in the console in RStudio, but none of these messages are breaking errors. That is because a lot of software programs will use the "error" message channel to report additional information that is not considered "output". But if something goes wrong with your job, there will likely be a proper error message in this file.

Remote terminal, showing the top part of the error file

Tip

If you want to view the .png image files that were created, you'll need to download them to your computer. For instructions on how to do so, see our guide Transfer Files between CHTC and your Computer.

Run Example Calculation as Multiple Jobs

This last step is the easiest! Now we want to run three jobs, one for each of our data files.

To understand why you might want to do this, consider a more realistic example: instead of 3 datasets that only take seconds to analyze, what if you had 1,000 datasets where each one took 10 hours to analyze? A single for-loop to analyze all 1,000 datasets would take 10,000 hours (more than a year) to run! By having a separate job for the analysis of each dataset, the time to completion becomes however long it takes to run 10,000 such jobs. If there were enough computers to run all 10,000 jobs at roughly the same time, the time to completion would only be 10 hours!! (In practice, the time to completion would probably be closer to a week or two, but that is still much faster than the single for-loop.)

Modify the submit file

All we need to change is the final queue statement in example.sub

Instead of

queue station from ( madison.csv )

We will use:

queue station from station_list.txt

Tip

For more information about the setting up a submit file for multiple jobs, see our guide Submitting Multiple Jobs Using HTCondor.

Submit multiple jobs

You'll use a similar command as before to submit the jobs to HTCondor:

condor_submit example.sub

You'll see a message that 3 jobs have been submitted, as well as the unique ID for the submission. Each job will be managed completely independently from each other.

Use condor_q and condor_watch_q to monitor the progress of your jobs.

Remote terminal, showing output of "condor_submit", "condor_q", "condor_watch_q" commands for multiple jobs

Once completed, there should be a .err, .out, and .png file in the htc-results directory for each of the datasets. Take a look at the files to make sure that everything ran as expected.

Remote terminal, showing directory and file contents after multiple jobs completed

Next Steps

Now that you've finished this tutorial, you are ready to start transitioning your own R project to be run on the HTC system. But unless your R project is fairly simple, there are a few more things you'll need to work on to get up and running.

For a full walk-through of how to get started on the HTC system, see our guide Roadmap to getting started.

Software

This tutorial used a pre-existing container that came with R 4.4.2 and tidyverse packages already installed. If that is all you need, then you're in luck! Just use the same container_image line in your submit file.

If you're like most users, however, then you have additional R packages that you want to use in your scripts. To make those packages available for use in your HTC job, we recommend that you build your own container. While that may sound like a daunting task, we have a lot of documentation and examples to help you get started, and the faciliation team is happy to help with any questions or issues.

Our recommendation for most users is to use "Apptainer" containers for deploying their software. For instructions on how to build an Apptainer container, see our guide Use Apptainer Containers. If you are familiar with Docker, or want to learn how to use Docker, see our guide Running HTC Jobs Using Docker Containers.

For examples of containers that you can use or modify, see the R section of our Recipes GitHub repository.

This information can also be found in our guide Overview: How to Use Software.

Data

The ecosystem for moving data to, from, and within the HTC system can be complex, especially if trying to work with large data (> gigabytes). For guides on how data movement works on the HTC system, see the "Manage data" section of our HTC guides page.

GPUs

If your R project is capable of using GPUs, and you would like to use the GPUs available on the HTC system, see our guide Use GPUs.

Getting Help

CHTC employs a team of Research Computing Facilitators to help researchers use CHTC computing for their research.

  • Web guides: HTC Computing Guides - instructions and how-tos for using the HTC system.
  • Email support: get help within 1-2 business days by emailing [email protected].
  • Virtual office hours: live discussions with facilitators - see the "Get Help" page for current schedule.
  • One-on-one meetings: dedicated meetings to help new users, groups get started on the system; email [email protected] to request a meeting.

This information, and more, is provided in our Get Help page.

Appendix: Preparing to transition an R project from your computer to CHTC

Locate your R scripts

Identify the R scripts that you use to run your calculation. Typically you'll have one main R script that is the entry point to your program, and for simple programs this will be the only script. You can use the "Files" pane to navigate the files in your R project.

In this tutorial, main script was example.R. But we also need the script my_functions.R, since it is loaded by example.R.

For your project, you may have other scripts. If you are not sure which or if any of the scripts are needed, take a look at your main R script and see if it references any of the other scripts. Ideally all of the scripts you use in your calculation are in the same folder (or a subfolder thereof). If not, you should consider reorganizing your scripts into the project directory.

Locate your other input files

Identify the input files besides your R scripts that your calculation needs to function. To start with, consider what is needed to run a single example calculation.

In this tutorial, we needed the dataset .csv files as input for calculations, and the files were located in the same directory as the R scripts.

If your input files are not in the same directory as your R scripts, you may want to consider consolidating them into the project directory, at least for one example calculation.

Check your R scripts for "absolute paths"

If your R script(s) references or loads other files, or writes outputs to file, you should check if they are using "absolute paths". If so, you'll want to rewrite your program to use a "relative" path. (This is another reason you'll want to consolidate your files into the project directory.)

You will likely need to test that your program still functions as expected.

Tip

For more information about "absolute" and "relative" paths, see the note below (About paths).

Find your version of R

There are several ways of finding the version of R that you are using in your project. Use one or more of them to identify the version, which will be in the pattern X.Y.Z.

In the examples below, the version number is 4.4.2. Make sure you note your specific version number.

To minimize the chance of discrepancies, you'll want to use the same version to run your calculations on the HTC system.

Console

When you open the console, the very first line contains the version of R, which looks like this:

R version 4.4.2 (2024-10-31 ucrt) -- "Pile of Leaves"

Packages pane

In a box on the right side should be a "Packages" tab that you can click on to open the Packages pane. This pane lists packages that are installed (checked box) or that are available to be installed (unchecked box) in your R environment.

Scroll down to the "System Library" section and look for the "base" package, and note the the number in its "Version" column. This corresponds to the version of R you are using in your environment.

Command

You can programmatically identify the version of R that you are using by entering the following command in the R console:

R.version.string

This will print something like the following:

[1] "R version 4.4.2 (2024-10-31 ucrt)"

This command can be used wherever you are using R, which makes it useful in scenarios that don't involve RStudio.

Identify your R packages

Identify the R packages that your project uses, so that later you can reproduce the environment on CHTC.

To start, make a list of the packages that you load in your R scripts, which is generally done using library('<package_name>') commands. Then, look in the "Packages" pane to identify the corresponding versions of the packages. Usually the package names alone is enough, but sometimes the versions of the packages can matter as well (About versions).

If you'd rather not do this manually, you can install and use a package called renv to not only automatically detect the packages you are using, but to also create files that can be used to replicate the environment automatically when building a container. For more information, see the renv recipe in the Recipes repository: https://github.com/CHTC/recipes/tree/main/software/R/renv.

About paths

An "absolute path" is used to reference the location of a file in relation to the "root" directory of your computer. This is fine when your program is running on your computer, but can break the program if you try to run it on a different computer whose files are organized differently from yours.

A "relative path" is used to reference the location of a file in relation to where the current script is running. This is useful when you need to run your program on different computers.

The absolute path to the dataset file on a Windows machine may look like C:/Users/bbadger/Documents/REPONAME/madison.csv, while on a Mac machine the path may look like /Users/bbadger/Documents/REPONAME/madison.csv.

A relative path starts from the current working directory, and defines the location of the file in relation to that. Such a path may look like ./madison.csv or ../data/madison.csv. Here, the . represents the current directory, while .. represents the parent directory. You can chain together several .. to go several directories upwards in the file system.

Consider for example the following folder structure:

project/
├── data/
│   ├── 2023/
│   │   └── raw_data.csv
│   └── 2024/
│       └── raw_data.csv
└── scripts/
    └── v1/
        └── program.R

The script program.R can reference the 2024 raw_data.csv file using this relative path: ../../data/2024/raw_data.csv.

About versions

Most software uses the Major.Minor.Patch versioning syntax.

  • Major version number - A change in this number signals major changes in the software, and commands that worked in the previous version may not work in the new version.
  • Minor version number - A change in this number signals additional features or enhancements. Commands in previous versions should work fine in later versions, though there may be superficial changes.
  • Patch version number - A change in this number signals that bugs have been fixed. There should be no change, superficially or functionally, other than those resulting from correcting the bugs.

Does the version number matter?

To a certain extent, yes. If your code was written for Major version X, there's no guarantee it will function for a different Major version, so you should continue to use Major version X. Code written for Minor version Y should function for Minor versions >= Y, but there may be superficial changes you might want to avoid, so it's up to you whether or not to be consistent. You should always use the latest Patch Version; if there is a discrepancy in your results between two Patch versions, that is (hopefully) because a bug that affected the results has been fixed. (It is also possible another bug has been introduced - either way, you should investigate the nature of the bug fixes.)

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages