-
Notifications
You must be signed in to change notification settings - Fork 147
Managing Data
Some of the data sets interfaced by skdata are hundreds of megabytes or several gigabytes in size on disk. In order to use them, it is necessary first to either download or locate a local copy of the data set. This page describes how that happens and where those files go.
Data sets that require downloading external data (i.e. most of them) use the
mechanism implemented in skdata.data_home
.
This module exports a function called get_data_home
that
identifies a root directory, within which data set modules will
create subdirectories to store downloaded files.
This directory defaults to "~/.skdata" on unix-like machines, but it can be
configured by a "$SKDATA_ROOT" environment variable.
(See docstring in
data_home.py
for details.)
If you want to split data sets across different filesystems, computers, etc.
then you should think about addressing that at the filesystem level. There is
not currently support for such arrangements in skdata's file-locating logic.
Your main mechanism for supporting such file layouts is symlinks. If some data
set (e.g. imagenet or hollywood2) is too large to fit on your "/home"
filesystem, or you want to share a copy with other users via a networked filesystem, then
consider either (a) replacing your own "/.skdata/imagenet" folder with a
symlink or else (b) configuring skdata to look for a data root directory at
a different location than your "/.skdata".
Generally the way to get rid of all files created by a data set module is to delete that directory with the same name as the module from the "~/.skdata" directory (or wherever you configured it to be with "$SKDATA_ROOT").
Some data sets may offer scripts in their "main.py" files for deleting temporary files to free up space without erasing the files that were downloaded. Check the data set in question if you want to free up space, but avoid a future re-download.