Atari Pong Single-Agent Classic Reinforcement Learning (no Deep RL) as course project of Distributed Artificial Intelligence, University of Modena and Reggio Emilia, Italy
The screen pixel observation is downsampled on rows and columns by a factor of 3 and 2 respectively. Reaching a shape of 53 x 80. I'm considering just the pixels from 35 to 92 i.e. cutting out the side walls and the scores to reduce the amount of pixels.
The states are calculated considering the resized screen values (described in the previous section) as:
I made the assumption that i don't need to know the position of the competitor in order to win the game, indeed i counted the states only for agent_0. This assumption make the game partial observable.
In this project I invesigated the Q-Learning (RL) potentials regarding the extraction of smart behaviours. I focused mainly on the hard convergence problem due to sparsity i.e. the qtables are big. In order to tackle this problem I experimented the effects of gaussian reward (smoother reward) and qtable initialization.
At first I was convinced that initializing the qtable with values different from zero could be a good solution as happens in neural networks. I soon realized that the random initialization weren't actually good. Indeed It introduced noise in the q-learning convergence (since it relies on qtable values).
The image above proves that behaviour. The random initialization works worse than a zero initialization.
In order to address the sparsity problem, I implemented a gaussian smoothing on the reward signal. Since exists a close relationship between the states and the screen's pixels, it makes sense to spead the reward spatially by smoothing (e.g. if a specific pixel is a great location to catch the ball than it's reasonable that the near ones are a good positions too).
It shows that the gaussian reward converge faster to a defined threshold. mCR10 is the mean over the last 10 steps of the cumulative reward signal. It shows that the 5x5 reward converge faster than the 3x3. mCR10 is the mean over the last 10 steps of the cumulative reward signal.The following images show the qtable state (in 3x3 smootherd reward setting) for each action of the racket.
The title of each subplots defines the coordinate position of the racket when the action is performed. The subplot itself shows the ball position. Basically It tells whether is good (white) or bad(black), for the racket, to be in that position (subplot number title) and doing that action.
The following images show the qtable state for each action of the pong racket of a 5x5 smoothed reward training. The image meaning is the same described in the 3x3 reward section.










