-
Notifications
You must be signed in to change notification settings - Fork 260
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add scripts and configs for hyperparameter tuning #675
Conversation
This change just made some error messages go away indicating the missing imitation.algorithms.dagger.ExponentialBetaSchedule but it did not fix the root cause.
Codecov Report
@@ Coverage Diff @@
## master #675 +/- ##
==========================================
- Coverage 96.33% 95.75% -0.58%
==========================================
Files 93 93
Lines 8789 8884 +95
==========================================
+ Hits 8467 8507 +40
- Misses 322 377 +55
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First round of review mostly with suggestions to the parallel script.
Not part of this PR but @taufeeque9 could you elaborate on the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the scripts in benchmarking/
being type checked and linted? I don't see them in the output at https://app.circleci.com/pipelines/github/HumanCompatibleAI/imitation/3897/workflows/02ec5434-d883-4edf-9fa5-99faeb0645f2/jobs/15772 The directory benchmarking/
is not specified in the inputs
config for ptype in setup.cfg
.
I think should either move the scripts into src/imitation/scripts/
and make them part of the package, or update CI config to test things in benchmarking/
(worth checking we're not missing them for flake8 etc as well).
benchmarking/tuning.py
Outdated
from pandas.api import types as pd_types | ||
from ray.tune.search import optuna | ||
from sacred.observers import FileStorageObserver | ||
from tuning_config import parallel_ex, tuning_ex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PEP8: standard library, third-party then first party-imports with line separating each.
from tuning_config import parallel_ex, tuning_ex | |
from tuning_config import parallel_ex, tuning_ex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now it seems like there's a lot of duplication in specifying the pre-tuned hyperparameters in *.json
files in benchmarking/
, and in named_configs
in each algorithm. If possible it seems preferable to make this more DRY (Don't Repeat Yourself) so that we have a single source of truth: otherwise I could imagine them drifting apart as we do future hyperparameter sweeps.
It is convenient to be able to write e.g. with seals_ant
rather than with benchmarking/airl_seals_ant_best_hp_eval.json
. But Sacred does support an ex.add_named_config
: https://github.com/IDSIA/sacred/blob/17c530660d5b405af0f5c286b1a93f3d8911d026/sacred/ingredient.py#L251 So we could replace some of these with that.
It does seem like the named configs have been edited to be more minimal. That's great, but as a user, which one should I use? If the answer is "always use the named_config" then we should really consider editing the JSONs (manually or automatically). If the answer is "it depends" then it would help to document the pros and cons and justify why we need both.
benchmarking/tuning.py
Outdated
][0] | ||
|
||
if print_return: | ||
all_returns = df[df["mean_return"] == row["mean_return"]][return_key] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit fragile, you could in theory have multiple distinct hyperparameter groups that led to the same mean returns across seeds, but in practice probably OK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case we get multiple distinct hyperparameter groups that get the same mean returns across seeds, we pick the first hyperparameter group from the df
. And the ordering of hyperparameter groups in the df
should be arbitrary, so it should be fine to do this, right?
I have moved the benchmarking scripts into
This was a great point! Thanks for noting it. I have added all of the json files as named configs in their respective So for example, in the
Note that there's a redundancy in the command where |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, some minor suggestions.
CI is failing right now -- we should get it green before merging.
Lint error is just whitespace should be easy to fix.
Unit test error is in test_scripts for eval_policy. Not sure why this is being triggered here as we've not changed the code. It may just be a Gymnasium upgrade triggering it? If so, happy for you to do a hotfix for this in a separate PR, we can then get that hotfix merged then finally merge this PR. Can probably be resolved by changing gym.make
to include appropriate render_mode="rgb_array"
.
With these changes the benchmarking directory seems useless now as it just contains a util.py file to clean up the json files. We can move the util.py file somewhere else and delete the benchmarking directory.
Agreed probably best to move util.py
somewhere else. There's not an obvious alternative location for benchmarking/README.md
and we might want to include benchmarking table output etc in that directory so I'm OK keeping it around as just a stub for now but you could also move it to e.g. BENCHMARK.md
at the top-level.
I've tried to think of ways to remove this redundancy but couldn't figure a way out that preserves the command name airl and removes airl from the named config.
I think this is fine, people can type the extra 5 keystrokes :)
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merged -- well done @taufeeque9 for pushing this over the finish line, and thanks @ernestum for the detailed review! |
Description
This PR adds the scripts and configs for tuning hyperparameters of imitation algorithms. The results in the first draft of the paper were computed using the configs in the PR. Note that the preference comparisons algorithm still needs to be tuned on all of the environments.
Changes:
imitation/scripts/parallel.py
: Added features useful for tuning hyperparameters (HPs) of the algorithms. Specifically, added options to 1) repeat trials across configs to get a better (mean) estimate of the return of HP configs, 2) evaluate the best config from an HP tune run, and 3) restart failed trials of a tune run.imitation/scripts/analyze.py
: Added table verbosity level 3, which generates a CSV with all the config HPs, including RL HPs and the specific algorithm's arguments, along with the expert returns and the returns of the imitation algorithm.imitation/scripts/config/parallel.py
: Added HP search space for various algorithms like BC, Dagger, GAIL, and AIRL. Search space for Preference Comparisons may not be good enough.imitation/scripts/config/train_*.py
: The added configs contain the HP for the tuned base RL algorithm.Testing
Updated test cases for
parallel
and tested HP tuning manually.