@@ -13,14 +13,14 @@ for any other extension.
1313For a new learner you need to implement the functions
1414```
1515update!(learner, buffer) # returns nothing
16- selectaction (learner, policy, state ) # returns an action
16+ defaultpolicy (learner, actionspace, buffer ) # returns a policy
1717defaultbuffer(learner, environment, preprocessor) # returns a buffer
1818```
1919
2020Let's assume you want to implement plain, simple Q-learning (you don't need to
2121do this; it is already implemented. Your file ` qlearning.jl ` could contain
2222``` julia
23- import ReinforcementLearning: update!, selectaction , defaultbuffer, Buffer
23+ import ReinforcementLearning: update!, defaultpolicy , defaultbuffer, Buffer
2424
2525struct MyQLearning
2626 Q:: Array{Float64, 2} # number of actions x number of states
@@ -36,8 +36,8 @@ function update!(learner::MyQLearning, buffer)
3636 Q[a, s] += learner. alpha * (r + maximum (Q[:, snext]) - Q[a, s])
3737end
3838
39- function selectaction (learner:: MyQLearning , policy, state )
40- selectaction (policy, learner. Q[:, state] )
39+ function defaultpolicy (learner:: MyQLearning , actionspace, buffer )
40+ EpsilonGreedyPolicy (. 1 , actionspace, s -> getvalue ( learner. params, s) )
4141end
4242
4343function defaultbuffer (learner:: MyQLearning , environment, preprocessor)
@@ -46,10 +46,10 @@ function defaultbuffer(learner::MyQLearning, environment, preprocessor)
4646 Buffer (statetype = typeof (processedstate), capacity = 2 )
4747end
4848```
49- The function ` defaultbuffer ` gets called during the construction of an
50- ` RLSetup ` . It returns a buffer that is filled with states, actions and rewards
51- during interaction with the environment. Currently there are three types of
52- Buffers implemented
49+ The functions ` defaultpolicy ` and ` defaultbuffer ` get called during the
50+ construction of an ` RLSetup ` . ` defaultbuffer ` returns a buffer that is filled
51+ with states, actions and rewards during interaction with the environment.
52+ Currently there are three types of Buffers implemented
5353``` julia
5454import ReinforcementLearning: Buffer, EpisodeBuffer, ArrayStateBuffer
5555?Buffer
@@ -65,7 +65,7 @@ reset!(environment) # returns state
6565
6666Optionally you may also implement the function
6767```
68- plotenv(environment, state, action, reward, done )
68+ plotenv(environment)
6969```
7070
7171Please have a look at the
@@ -82,9 +82,11 @@ preprocess(preprocessor, reward, state, done) # returns a preprocessed (state, r
8282```
8383
8484## Policies
85+ Policies are function-like objects. To implement for example a policy that
86+ returns (the action) ` 42 ` for every possible input ` state ` one could write
8587```
86- selectaction(policy, values) # returns an action
87- getactionprobabilities(policy, state) # Returns a normalized (1-norm) vector with non-negative entries.
88+ struct MyPolicy end
89+ (p::MyPolicy)(state) = 42
8890```
8991
9092## Callbacks
0 commit comments