tensorflow

ChristinaLK

and

agitter

Update checkpointing/tensorflow/README.md

Mar 30, 2022

4aa25c0 · Mar 30, 2022

History

Name	Name	Last commit message	Last commit date
parent directory ..
README.md	README.md	Update checkpointing/tensorflow/README.md	Mar 30, 2022
checkpoint.sub	checkpoint.sub	Update checkpointing example to highlight Is_Resumable option	Mar 1, 2022
environment.yml	environment.yml	Incorporate commit suggestions.	Dec 16, 2021
run.sh	run.sh	Update example to use GPU and fix formatting.	Dec 10, 2021
tf_checkpointing.py	tf_checkpointing.py	Add tarfile module comment.	Dec 16, 2021

README.md

Checkpointing in Tensorflow with conda environment and GPU support

The example uses a pre-configured conda environment to demonstrate how one might implement model checkpointing in Tensorflow.

The Tensorflow example downloads the MNIST dataset and trains a neural net on it for 20 epochs, making a "checkpoint" every fifth epoch. In the event of a runtime interruption, these checkpoints are preserved on the submit server and are used to resume training, preventing the loss of already-completed work.

Checkpointing can be useful when running jobs that take multiple days to train or when training on resources that may be interrupted, like GPUs on CHTC backfill servers, campus pools outside of CHTC, or OSG. These additional GPU resources can be accessed by adding one or more of the following options to your submit file:

Is_resumable = true: access CHTC backfill servers
wantFlocking = true: access campus pools outside CHTC and CHTC backfill servers
wantGlideIn = true: access OSG, campus pools outside CHTC, and CHTC backfill servers

The five files include:

checkpoint.sub ## used to submit the job on HTCondor.
run.sh ## the executable called by checkpoint.sub.
tf_checkpointing.py ## the python program with an a checkpointing implementation.
tf_checkpointing.tar.gz ## the conda environment used by HTCondor for tensorflow dependencies -- this is hosted on a squid web server and is not included in the repo.
environment.yml ## The environment file used to create the conda env used in this example. This file is not used by the job. The environment is included in tf_checkpointing.tar.gz.

Usage

Log into the HTC system.
Clone this repository: git clone https://github.com/CHTC/templates-GPUs.
cd into this folder: cd templates-GPUs/checkpointing/tensorflow
Because conda environments tend to be large, a Squid caching server is used to host the environment. You can view the environment .tar.gz file from the submit node at /squid/gpu-examples. For more information about using Squid, please review the CHTC guide: Large File Availability Via Squid guide.
Submit the sample job: condor_submit checkpoint.sub.
Upon completion, HTCondor will return a zipped model file, model.tar.gz, along with checkpoint files checkpoint.h5 (checkpointed model) and checkpoint.txt (which epoch to resume training on).
In the event of a runtime error, HTCondor will return only checkpoint.h5 and checkpoint.txt, assuming it has reached at least the first checkpoint.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files

tensorflow

tensorflow

README.md

Checkpointing in Tensorflow with conda environment and GPU support

Usage

Files

tensorflow

Directory actions

More options

Directory actions

More options

Latest commit

History

tensorflow

Folders and files

parent directory

README.md

Checkpointing in Tensorflow with conda environment and GPU support

Usage