You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* Initial docs commit
* Add extensive docstrings for data types
* Add documentation on configs and data
* Add documentation for models
* Add documentation for orchestrators
* Add documentation for pipelines
* Resolve missed merge conflict
* Add docs for ilql
* Add some brief documentation on examples present
* update readme with link to docs
* Add rtd yml config
* Remove unneeded/ugly undoc-members
* Update docs for configs to account for method specific configs
* Add docstrings for method configs
* Move docstring into ModelBranch class
* Update docs with pipeline and model refactors
* Resolve erroneous merge (use updated dataclass attributes from master)
* Remove old file from before merge
* Add spacing after docstrings
* Update README.md
* removed duplicated class method
* Removed unneeded whitespace
* Add whitespace after docstrings where appropriate
* Update readthedocs version to py39
* precommit fixes
* Change save_interval to checkpoint_interval in docstring
* Remove redundant docs links from readme
:param num_rollouts: Number of experiences to observe before learning
69
+
:type num_rollouts: int
70
+
71
+
:param init_kl_coef: Initial value for KL coefficient
72
+
:type init_kl_coef: float
73
+
74
+
:param target: Target value for KL coefficient
75
+
:type target: float
76
+
77
+
:param horizon: Number of steps for KL coefficient to reach target
78
+
:type horizon: int
79
+
80
+
:param gamma: Discount factor
81
+
:type gamma: float
82
+
83
+
:param lam: GAE lambda
84
+
:type lam: float
85
+
86
+
:param cliprange: Clipping range for PPO policy loss (1 - cliprange, 1 + cliprange)
87
+
:type cliprange: float
88
+
89
+
:param cliprange_value: Clipping range for predicted values (observed values - cliprange_value, observed values + cliprange_value)
90
+
:type cliprange_value: float
91
+
92
+
:param vf_coef: Value loss scale w.r.t policy loss
93
+
:type vf_coef: float
94
+
95
+
:param gen_kwargs: Additioanl kwargs for the generation
96
+
:type gen_kwargs: Dict[str, Any]
97
+
"""
98
+
62
99
ppo_epochs: int
63
100
num_rollouts: int
64
101
chunk_size: int
@@ -76,6 +113,28 @@ class PPOConfig(MethodConfig):
76
113
@dataclass
77
114
@register_method
78
115
classILQLConfig(MethodConfig):
116
+
"""
117
+
Config for ILQL method
118
+
119
+
:param tau: Control tradeoff in value loss between punishing value network for underestimating the target Q (i.e. Q value corresponding to the action taken) (high tau) and overestimating the target Q (low tau)
120
+
:type tau: float
121
+
122
+
:param gamma: Discount factor for future rewards
123
+
:type gamma: float
124
+
125
+
:param cql_scale: Weight for CQL loss term
126
+
:type cql_scale: float
127
+
128
+
:param awac_scale: Weight for AWAC loss term
129
+
:type awac_scale: float
130
+
131
+
:param steps_for_target_q_sync: Number of steps to wait before syncing target Q network with Q network
132
+
:type steps_for_target_q_sync: int
133
+
134
+
:param two_qs: Use minimum of two Q-value estimates
0 commit comments