Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reliable way to access true underlying Markov state #3

Closed
smorad opened this issue Nov 4, 2022 · 13 comments
Closed

Reliable way to access true underlying Markov state #3

smorad opened this issue Nov 4, 2022 · 13 comments

Comments

@smorad
Copy link
Collaborator

smorad commented Nov 4, 2022

A user has requested a method to access the true underlying state for each environment. We should return an info dict {state: ...} from env.step for all envs.

@mgerstgrasser
Copy link

That was me! And yes - that could be useful to compare against a fully-observed "optimum". Do some environments already return this in the info dict?

@smorad
Copy link
Collaborator Author

smorad commented Nov 9, 2022

Yes.

The labyrinth environments and bandit environment.

@raphaelavalos
Copy link
Contributor

Hello @smorad,
I have been working on that. I already did some of the environments and was planning to continue this soon. If you want I would be happy to handle this issue :)

@smorad
Copy link
Collaborator Author

smorad commented Nov 20, 2022

That would be fantastic, thanks @raphaelavalos! For most of the envs, the state is in an underlying np.array so returning it as state in the info dict wouldn't be too hard.

There are also some decisions to make on what exactly the state is. For example, in Battleship, is the state the result of all tiles we've fired upon, or is it the position of all the ships? Similarly, in Higher Lower, is the state the next card in the deck, or all cards we have already seen?

I'd argue the first case of Battleship and Higher Lower states is probably more interesting from a learning perspective (i.e. if you were training a policy based on state).

@mgerstgrasser
Copy link

Hi both, I'd be happy to help too, though I won't have time until Christmas at the earliest. One thought from my side in the mean time: It might be useful to optionally return the state as part of the observation, either instead of or even together with the partial observation in a Dict. I think that could enable some interesting applications, e.g. ablation studies comparing how much is lost compared to the fully observable case. What do you think?

@smorad
Copy link
Collaborator Author

smorad commented Nov 20, 2022

I considered something similar, but messing with the gym.step return values prevents people from plugging POPGym envs straight into StableBaselines3, CleanRL, RLlib, etc. Making obs a dict would mean the the aforementioned libraries would operate on the observation and Markov state by default, which I don't think we want.

@mgerstgrasser
Copy link

Oh, no, I mean purely optionally, with the option set when the environment is instantiated. By default return just the partial observation as-is, but have an option in env.__init__() that changes it to a dict with the state.

@raphaelavalos
Copy link
Contributor

For the environments done before this issue I was doing it with a Dictjust like @mgerstgrasser suggestion. I do agree that PopGym should work easily with the other libraries so both options will be available with init flags.
For environments where there could be different state representation we can discuss it in this issue or implement both.

@raphaelavalos
Copy link
Contributor

raphaelavalos commented Dec 1, 2022

Hey !
I have already implemented the state for those environments

  • Concentration (half-way done)
  • Higher Lower
  • Battleship
  • Multiarmed Bandit
  • Minesweeper
  • Repeat Previous
  • Repeat First
  • Autoencode
  • Stateless Cartpole
  • Noisy Stateless Cartpole
  • Stateless Pendulum
  • Noisy Stateless Pendulum
  • Count Recall
  • Labyrinth Escape
  • Labyrinth Explore
  • Count Recall

Currently, I have added a method get_state :

class Autoencode(gym.Env):
    ...
   
   def get_state(obs):
       state = ...
       return state

I was thinking about creating a class PopgymEnv(gym.Env) from which all the environments would inherit, to standardize the API. What do you think @smorad ?

class IncludeState(enum.IntEnum):
    NO = 0
    INFO = 1
    DICT = 2

class PopgymEnv(gym.Env):
    def __init__(self, include_state: IncludeState = IncludeState.NO):
        super().__init__()
        self.state_space = None
       

@raphaelavalos
Copy link
Contributor

Another possibility is to create two wrappers StateInfo and StateDict instead of the flag mentioned above.

As I had to modify the code of all the environments, the PR will also contain some small modifications on the original code such as:

  • the use of max_episode_length for all environments (currently some are labelled episode_length)
  • replacing np.random by self.np_random to ensure proper seeding (the only environments where it is not possible yet are the mazelib based - see this issue). Some environments already used it but not all.
  • removing the last action from the observations and rely on a wrapper to include it instead

@smorad would that work for you?

@smorad
Copy link
Collaborator Author

smorad commented Dec 2, 2022

Fantastic, thanks again for all the hard work!

I think both the popgym.Env baseclass and wrappers are both good approaches. It's a value judgement between inheritance or functional approaches, so I defer to you. I think we should probably be a bit more verbose for less familiar users:

class IncludeMState(enum.IntEnum): # State is an overloaded term, maybe MState == MarkovState?
  NO = 0
  IN_INFO_DICT = 1
  IN_OBSERVATION = 2

or maybe even

class Observability(enum.IntEnum):
  PARTIAL = 0
  FULL_IN_INFO_DICT = 1
  FULL = 2 # Maybe we should have an option for returning JUST the Markov state as an observation. This could be the upper bound for what we want a partially observable agent to achieve.
  FULL_AND_PARTIAL = 3 # I think this is closer to your idea of `Dict({"mstate": state, "obs": obs})`

Modifications sound great. I am considering deprecating the maze environments anyways, because the paper shows they are actually a pretty bad benchmark for memory. I don't want to mislead users into thinking their memory is working when the task can be solved by an MLP.

The only suggestion I have is with the third point. Most of the environments where I include the previous action are very difficult/impossible to learn without observing the previous action. We should probably edit the documentation to make it clear that environments like MineSweeper or LabyrinthEscape really need the wrapper (unless you've found a way to train policies that don't need it!). Or perhaps even add a class variable like MineSweeper.previous_action_wrapper_suggested = True or something so we can apply the wrappers programatically.

@raphaelavalos
Copy link
Contributor

Hey,
The possibility of getting back the underlying MDP is a good idea, I will add that.
Regarding the last action I agree that in the general case of POMDPs you need to condition your policy on the observation-action history and so there are no reasons not to add the action as input. My point was more that depending on the framework you use the last action is already added and you don't want to have it twice.
I think that for best flexibility we should make a LastActionWrapper, and maybe a warning in the read-me about that.
I will try to finish this PR this week :)

@smorad
Copy link
Collaborator Author

smorad commented Dec 20, 2022

Closing this as #5 and c165868 implement support for underlying states using the Markov wrapper.

@smorad smorad closed this as completed Dec 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants