This repository contains a demo experiment for running LensKit experiments on public data sets with current best practices for moderately-sized experiments.
This experiment uses DVC to script the experiment, and is laid out in several subcomponents:
-
lkdemo
is a Python package containing support code (e.g. log configurations) and algorithm definitions. Two files are of particular interest:lkdemo/algorithms.py
defines the different algorithms we can train with sensible default configurations.lkdemo/datasets.py
defines the different data sets, so that any supported data set can be loaded into the format LensKit expects in a uniform fashion.
-
data
contains data files and controls. -
data-split
contains cross-validation splits, produced bysplit-data.py
. These splits only contain the test files, to save disk space - the train files can be obtained withlkdemo.datasets.ds_diff
, as seen inrun-algo.py
. -
runs
contains the results of running LensKit train/test runs. -
Various Python scripts to run individual pieces of the analysis. They use
docopt
for parsing their arguments and thus have comprehensive usage docs in their docstrings. -
Jupyter notebooks to analyze results. These are parameterized and run with Papermill to analyze different data sets with the same notebook.
This experiment comes with dependencies specified in pyproject.toml
, and locked with uv.lock
for use with [uv][].
To set up, run:
$ uv sync
This will create a virtual environment in .venv/
, whic you can activate with:
$ . ./.venv/bin/activate
The dvc
program controls runs of individual steps, including downloading data.
For example, to download the ML-20M data set and recommend with ALS, run:
dvc repro runs/dvc.yaml:ml20m@ALS
To re-run the whole experiment:
dvc repro
To reproduce results on one data set:
dvc repro eval-report-ml100k
The various dvc.yaml
files control the run. Look at them to modify and extend!
You will probably want to consult the DVC user guide.