Reliable way to access true underlying Markov state #3

smorad · 2022-11-04T15:02:28Z

A user has requested a method to access the true underlying state for each environment. We should return an info dict {state: ...} from env.step for all envs.

The text was updated successfully, but these errors were encountered:

mgerstgrasser · 2022-11-04T18:53:09Z

That was me! And yes - that could be useful to compare against a fully-observed "optimum". Do some environments already return this in the info dict?

smorad · 2022-11-09T17:30:22Z

Yes.

The labyrinth environments and bandit environment.

raphaelavalos · 2022-11-18T16:03:38Z

Hello @smorad,
I have been working on that. I already did some of the environments and was planning to continue this soon. If you want I would be happy to handle this issue :)

smorad · 2022-11-20T19:55:15Z

That would be fantastic, thanks @raphaelavalos! For most of the envs, the state is in an underlying np.array so returning it as state in the info dict wouldn't be too hard.

There are also some decisions to make on what exactly the state is. For example, in Battleship, is the state the result of all tiles we've fired upon, or is it the position of all the ships? Similarly, in Higher Lower, is the state the next card in the deck, or all cards we have already seen?

I'd argue the first case of Battleship and Higher Lower states is probably more interesting from a learning perspective (i.e. if you were training a policy based on state).

mgerstgrasser · 2022-11-20T20:27:40Z

Hi both, I'd be happy to help too, though I won't have time until Christmas at the earliest. One thought from my side in the mean time: It might be useful to optionally return the state as part of the observation, either instead of or even together with the partial observation in a Dict. I think that could enable some interesting applications, e.g. ablation studies comparing how much is lost compared to the fully observable case. What do you think?

smorad · 2022-11-20T20:42:13Z

I considered something similar, but messing with the gym.step return values prevents people from plugging POPGym envs straight into StableBaselines3, CleanRL, RLlib, etc. Making obs a dict would mean the the aforementioned libraries would operate on the observation and Markov state by default, which I don't think we want.

mgerstgrasser · 2022-11-20T20:44:48Z

Oh, no, I mean purely optionally, with the option set when the environment is instantiated. By default return just the partial observation as-is, but have an option in env.__init__() that changes it to a dict with the state.

raphaelavalos · 2022-11-21T08:45:31Z

For the environments done before this issue I was doing it with a Dictjust like @mgerstgrasser suggestion. I do agree that PopGym should work easily with the other libraries so both options will be available with init flags.
For environments where there could be different state representation we can discuss it in this issue or implement both.

raphaelavalos · 2022-12-01T16:14:13Z

raphaelavalos · 2022-12-02T10:12:06Z

Another possibility is to create two wrappers StateInfo and StateDict instead of the flag mentioned above.

As I had to modify the code of all the environments, the PR will also contain some small modifications on the original code such as:

the use of max_episode_length for all environments (currently some are labelled episode_length)
replacing np.random by self.np_random to ensure proper seeding (the only environments where it is not possible yet are the mazelib based - see this issue). Some environments already used it but not all.
removing the last action from the observations and rely on a wrapper to include it instead

@smorad would that work for you?

smorad · 2022-12-02T17:01:57Z

Fantastic, thanks again for all the hard work!

I think both the popgym.Env baseclass and wrappers are both good approaches. It's a value judgement between inheritance or functional approaches, so I defer to you. I think we should probably be a bit more verbose for less familiar users:

class IncludeMState(enum.IntEnum): # State is an overloaded term, maybe MState == MarkovState?
  NO = 0
  IN_INFO_DICT = 1
  IN_OBSERVATION = 2

or maybe even

class Observability(enum.IntEnum):
  PARTIAL = 0
  FULL_IN_INFO_DICT = 1
  FULL = 2 # Maybe we should have an option for returning JUST the Markov state as an observation. This could be the upper bound for what we want a partially observable agent to achieve.
  FULL_AND_PARTIAL = 3 # I think this is closer to your idea of `Dict({"mstate": state, "obs": obs})`

Modifications sound great. I am considering deprecating the maze environments anyways, because the paper shows they are actually a pretty bad benchmark for memory. I don't want to mislead users into thinking their memory is working when the task can be solved by an MLP.

The only suggestion I have is with the third point. Most of the environments where I include the previous action are very difficult/impossible to learn without observing the previous action. We should probably edit the documentation to make it clear that environments like MineSweeper or LabyrinthEscape really need the wrapper (unless you've found a way to train policies that don't need it!). Or perhaps even add a class variable like MineSweeper.previous_action_wrapper_suggested = True or something so we can apply the wrappers programatically.

raphaelavalos · 2022-12-05T07:50:17Z

Hey,
The possibility of getting back the underlying MDP is a good idea, I will add that.
Regarding the last action I agree that in the general case of POMDPs you need to condition your policy on the observation-action history and so there are no reasons not to add the action as input. My point was more that depending on the framework you use the last action is already added and you don't want to have it twice.
I think that for best flexibility we should make a LastActionWrapper, and maybe a warning in the read-me about that.
I will try to finish this PR this week :)

smorad · 2022-12-20T05:56:23Z

Closing this as #5 and c165868 implement support for underlying states using the Markov wrapper.

raphaelavalos mentioned this issue Dec 6, 2022

Accessing the Markov State and more #5

Merged

16 tasks

smorad closed this as completed Dec 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reliable way to access true underlying Markov state #3

Reliable way to access true underlying Markov state #3

smorad commented Nov 4, 2022

mgerstgrasser commented Nov 4, 2022

smorad commented Nov 9, 2022

raphaelavalos commented Nov 18, 2022

smorad commented Nov 20, 2022

mgerstgrasser commented Nov 20, 2022

smorad commented Nov 20, 2022

mgerstgrasser commented Nov 20, 2022

raphaelavalos commented Nov 21, 2022

raphaelavalos commented Dec 1, 2022 •

edited by matteobettini

Loading

raphaelavalos commented Dec 2, 2022

smorad commented Dec 2, 2022 •

edited

Loading

raphaelavalos commented Dec 5, 2022

smorad commented Dec 20, 2022

Reliable way to access true underlying Markov state #3

Reliable way to access true underlying Markov state #3

Comments

smorad commented Nov 4, 2022

mgerstgrasser commented Nov 4, 2022

smorad commented Nov 9, 2022

raphaelavalos commented Nov 18, 2022

smorad commented Nov 20, 2022

mgerstgrasser commented Nov 20, 2022

smorad commented Nov 20, 2022

mgerstgrasser commented Nov 20, 2022

raphaelavalos commented Nov 21, 2022

raphaelavalos commented Dec 1, 2022 • edited by matteobettini Loading

raphaelavalos commented Dec 2, 2022

smorad commented Dec 2, 2022 • edited Loading

raphaelavalos commented Dec 5, 2022

smorad commented Dec 20, 2022

raphaelavalos commented Dec 1, 2022 •

edited by matteobettini

Loading

smorad commented Dec 2, 2022 •

edited

Loading