Single-Agent RL Atari Pong

Atari Pong Single-Agent Classic Reinforcement Learning (no Deep RL) as course project of Distributed Artificial Intelligence, University of Modena and Reggio Emilia, Italy

Observation preprocessing

The screen pixel observation is downsampled on rows and columns by a factor of 3 and 2 respectively. Reaching a shape of 53 x 80. I'm considering just the pixels from 35 to 92 i.e. cutting out the side walls and the scores to reduce the amount of pixels.

The states are calculated considering the resized screen values (described in the previous section) as:

$$53*80 (pos\_ball) * 53 (pos\_agent) * 6 (n\_actions) = 1 348 320 (states) * 4 (byte) = 5.4 MB$$

I made the assumption that i don't need to know the position of the competitor in order to win the game, indeed i counted the states only for agent_0. This assumption make the game partial observable.

Learning

In this project I invesigated the Q-Learning (RL) potentials regarding the extraction of smart behaviours. I focused mainly on the hard convergence problem due to sparsity i.e. the qtables are big. In order to tackle this problem I experimented the effects of gaussian reward (smoother reward) and qtable initialization.

Qtable Initialization

At first I was convinced that initializing the qtable with values different from zero could be a good solution as happens in neural networks. I soon realized that the random initialization weren't actually good. Indeed It introduced noise in the q-learning convergence (since it relies on qtable values).

The image above proves that behaviour. The random initialization works worse than a zero initialization.

Gaussian Rewards

In order to address the sparsity problem, I implemented a gaussian smoothing on the reward signal. Since exists a close relationship between the states and the screen's pixels, it makes sense to spead the reward spatially by smoothing (e.g. if a specific pixel is a great location to catch the ball than it's reasonable that the near ones are a good positions too).

It shows that the gaussian reward converge faster to a defined threshold. mCR10 is the mean over the last 10 steps of the cumulative reward signal.

Reward kernel: 3x3 vs 5x5

It shows that the 5x5 reward converge faster than the 3x3. mCR10 is the mean over the last 10 steps of the cumulative reward signal.

3x3 Kernel

The following images show the qtable state (in 3x3 smootherd reward setting) for each action of the racket.

The title of each subplots defines the coordinate position of the racket when the action is performed. The subplot itself shows the ball position. Basically It tells whether is good (white) or bad(black), for the racket, to be in that position (subplot number title) and doing that action.

5x5 Kernel

The following images show the qtable state for each action of the pong racket of a 5x5 smoothed reward training. The image meaning is the same described in the 3x3 reward section.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
train_history		train_history
.gitignore		.gitignore
README.md		README.md
gaussian3.ipynb		gaussian3.ipynb
gaussian5.ipynb		gaussian5.ipynb
no_gaussian.ipynb		no_gaussian.ipynb
no_gaussian_init.ipynb		no_gaussian_init.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Single-Agent RL Atari Pong

Observation preprocessing

Learning

Qtable Initialization

Gaussian Rewards

Reward kernel: 3x3 vs 5x5

3x3 Kernel

5x5 Kernel

About

Uh oh!

Releases

Packages

Languages

fmolivato/sarl_atari_pong

Folders and files

Latest commit

History

Repository files navigation

Single-Agent RL Atari Pong

Observation preprocessing

Learning

Qtable Initialization

Gaussian Rewards

Reward kernel: 3x3 vs 5x5

3x3 Kernel

5x5 Kernel

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages