RL.tex

\chapter{Reinforcement Learning}
\label{chap:r-learning}

We have seen how the policy and value iteration methods require either a model or a full and sufficiently rich trajectory. MC learning forces us to wait until the end of the episode to improve the policy. In this Section, we want to learn the system as the data is collected along a single trajectory while also controlling the system.

In MC learning, to construct the $Q$ function from specific realizations, we could have not made use of Bellman equation, eq.~\eqref{eq:Q-function-policy-evaluation} because this is only satisfied in the limit of infinitely long episode. For a finite episode, the \emph{temporal difference error} is the amount by which Bellman equation is not satisfied,
\begin{equation}
  \label{eq:TD-error}
  e = r + \gamma Q(x^\prime,u^\prime) - Q(x,u)
\end{equation}
for two subsequent data points
\begin{equation*}
  % (x_k,u_k,r_k)\quad (x_{k+1},u_{k+1}).
  (x,u,r)\quad (x^\prime,u^\prime).
\end{equation*}
In general the error $e$ is not zero but its average must be, $\mathbb{E}[e] = 0$, according to the Bellman equation.

An iterative way to find the $Q$ function that solves $\mathbb{E}[e] = 0$ is provided by the stochastic approximation. Let $q$ a decision variable (?) and $e(q)$ a random variable for which we want to find $q^\star$ that solves $\mathbb{E}[e(q^\star)]=0$. Then the iteration\footnote{The lecture had it $q_{k+1} \leftarrow q_k -\alpha_k e(q_k)$ but this notation does not make sense.}
\begin{equation}
  \label{eq:TD-error-iterative-solution}
  q_{k+1} = q_k -\alpha_k e(q_k)
\end{equation}
converges to the solution $q^\star$ of $E[e(q)]=0$ provided
\begin{itemize}
\item $e(q)$ is bounded;
\item $\mathbb{E}[e(q)]$ is non-decreasing in $q$;
\item $\mathbb{E}(e^\prime(q^\star))$ exists and is positive;
\item the sequence $\alpha_k$ satisfies
  \begin{equation*}
    \sum_{k=0}^\infty \alpha_k = \infty, \quad \sum_{k=0}^\infty \alpha_k^2 < \infty
  \end{equation*}
\end{itemize}

The iteration is a gradient-like update with a time-varying time gain $\alpha_k$ that can be taken to be $\alpha_k = \frac{1}{k}$. At each iteration, $q$ is updated using a statistically decreasing (in absolute value) contribution $\alpha_k e(q_k)$: is $\alpha_k$ decreasing too slowly, the convergence to $q^\star$ is slow (in the case $\alpha_k=\alpha$ is a constant, no convergence occurs); is $\alpha_k$ decreasing too fast (\textit{e.g.} $\alpha_k=\tfrac{1}{k^2}$), the sequence $\{q_k\}$ converges to a wrong value that depends on the initial value of the sequence.

A better intuition of how this algorithm works would be a good idea.


\subsection{TD Learning: SARSA}
\label{sec:TD-learning-sarsa}

The SARSA algorithm\footnote{SARSA derives its name from the use in the reinforcement learning community of the tuple
  \begin{equation*}
    (s,a,r,s^\prime,a^\prime)
  \end{equation*}
  for the update iteration where $s$ and $a$ are the state and action and $s^\prime$ and $a^\prime$ the next state and action.} uses the empirical evaluation of the TD error eq.~\eqref{eq:TD-error} to update the $Q$ function via the stochastic approximation iteration eq.~\eqref{eq:TD-error-iterative-solution}\footnote{Why is the sign $+$? In the notes the index $k$ refers to both the state and the iteration number but $\alpha$ is not state-dependent. I remove $k$ for the states.}
\begin{equation}
  \label{eq:TD-learning-iterative-update}
  Q(x,u) \leftarrow Q(x,u) + \alpha_k\left[r + \gamma Q(x^\prime,u^\prime) - Q(x,u)\right]
\end{equation}
The SARSA algorithm interleaves policy evaluation with policy improvement: at every time $k$, it
\begin{itemize}
\item selects $u$ via an $\epsilon$-greedy policy eq.~\eqref{eq:epsilon-greedy-policy-improvement} based on the latest estimate of $Q(x,u)$; and
\item performs the iterative update eq.~\eqref{eq:TD-learning-iterative-update} using a constant $\alpha_k = \alpha$ (which would lead to a steady-state non-zero variance but it is suitable for non-stationary problems) while taking $\epsilon\rightarrow 0$ as $k\rightarrow \infty$ to control the convergence of the learning.
\end{itemize}
Contrast this with MC learning, where the algorithm needs to wait until the end of the episode to estimate $Q$.

\subsection{TD Learning: $Q$ learning}
\label{sec:TD-learning-Q-learning}

Value iteration solves the Bellman equation without a policy improvement step. We can apply TD learning directly to Bellman optimality principle because in eq.~\eqref{eq:Q-function-Bellman-optimality}, the optimal action $u$ is the one that minimizes $Q^\star(x^\prime,u)$.

In our case, we hope we can select such $u$ by maintaining an \emph{estimate} of the optimal $Q$ function that is continuously updated from the individual realizations
\begin{equation*}
  % Q^\star(x_k,u_k) \approx r_k + \gamma \min_{u^\prime} Q^\star(x_{k+1},u^\prime)
  Q^\star(x,u) \approx r + \gamma \min_{u^\prime} Q^\star(x^\prime,u^\prime)
\end{equation*}
because the RHS is the same as the RHS of eq.~\eqref{eq:Q-function-Bellman-optimality} in the limit of an infinitely long episode:
\begin{equation*}
  Q^\star(x,u) = \mathbb{E}_{r,x^\prime} \left[r + \gamma \min_{u^\prime} Q^\star(x^\prime,u^\prime)\right].
\end{equation*}

The $Q$ learning algorithm updates the estimate of the $Q$ function using the stochastic approximation\footnote{Is $a_k$ also here taken as a constant?}
\begin{equation}
  \label{eq:Q-learning-stochastic-iterate}
  Q(x,u) \leftarrow Q(x,u) + \alpha_k\left(r + \gamma \min_{u^\prime} Q(x^\prime,u^\prime) - Q(x,u)\right).
\end{equation}
There are a few things to notice:
\begin{itemize}
\item the next input $u^\prime$ does not appear in eq.~\eqref{eq:Q-learning-stochastic-iterate} and the $Q$ update is independent on the action used. $Q$ learning is an \emph{off-policy algorithm} because it learns the best policy even from a random input;
\item convergence to the optimal $Q$ is guaranteed under the same assumptions as for the stochastic approximation;
\item the policy is typically updated as $\epsilon$-greedy;
\item for large state and input spaces, one usually approximate the $Q$ function with linear approximants or neural networks;
\item the stochastic approximation step eq.~\eqref{eq:Q-learning-stochastic-iterate} is \emph{forward} dynamic programming, going forward in time.
\end{itemize}

\subsection{$Q$ Learning with Linear Approximation}

We want to develop\footnote{This derivation is different than that given in class that I did not fully understand. Also the exchange in Moodle did not help despite the TA's best effort.} a version of the $Q$ learning using the value function approximation $Q_\theta$~\cite[Sect.~17.6]{decision-making-kochenderfer}. To do this, we search for the approximation $\Phi(x,y)\theta$ of $Q^\star(x,u)$ that minimizes\footnote{This is only my speculation: the average over the policy only gives the expected value but without the $\rho$ norm as in Sect.~\ref{sec:RL-linear-parametrization}. \emph{I guess} this is because in the approximation, one only selects a subset of states that are relevant and well explored by the policy.}
\begin{equation*}
  L(\theta) = \frac{1}{2} \underset{\pi^\star}{\mathbb{E}}\left[\left(Q^\star(x,u) - \Phi(x,u)\theta\right)^2\right]
\end{equation*}
on the state-action pairs generated by the optimal policy $\pi^\star$. A solution to the minimization is found iteratively by using gradient descent: the update rule is
\begin{align*}
  \theta \leftarrow \theta - \alpha \underset{\pi^\star}{\mathbb{E}} \left[\nabla L(\theta)\right] = \theta + \alpha \underset{\pi^\star}{\mathbb{E}}\left[\left(Q^\star(x,u) - \Phi^\top(x,u)\theta\right)\Phi(x,u)\right]
\end{align*}
with $\alpha$ a tuning parameter (?). We now introduce two simplifications
\begin{itemize}
\item in the update rule we use the experimental samples of our state-action pairs; and
\item similarly to what done in $Q$ learning, we use the individual realization
  \begin{equation*}
    Q^\star(x,u) \approx r + \gamma \min_{u^\prime} \Phi^\top(x^\prime,u^\prime)\theta
  \end{equation*}
  as an estimate for $Q^\star$.
\end{itemize}
With these two approximations, the update rule becomes
\begin{equation*}
  \theta \leftarrow\theta + \alpha \left[\left(r + \gamma \min_{u^\prime} \Phi^\top(x^\prime,u^\prime)\theta - \Phi^\top(x,u)\theta\right)\Phi(x,u)\right].
\end{equation*}


\begin{comment}
\section{Are MC and TD Model Free?}

What is needed to know

\begin{itemize}
\item We assume that the model is not time-dependent;
\item the state must be known
\end{itemize}

Is it possible

To control a system one tries a policy

\subsection{Policy Gradient}

In RL there is a subsection called policy gradient methods.

The associated cost
\begin{equation*}
  R(\tau) = \sum_{t=0}^T \gamma^t r_t
\end{equation*}
A stochastic policy $\pi_\theta(x,u)$ parametrized in $\theta$. The goal is the minimization of the expected cost
\begin{equation*}
  J(\theta) \doteq \mathbb{E}_{\tau \pi_\theta}R(\tau) = \sum_\tau P_\theta(\tau) R(\tau)
\end{equation*}
where $P_\theta(\tau)$ is the probability that trajectory $\tau$ happens when the policy $\pi_\theta$ is used.

There are two sources of complexity
\begin{itemize}
\item there is a sum over all possible trajectories
\item $\mathbb{P}_\theta(\tau)$ is unknown and depends on both the transition probabilities of the system and the policy $\pi_\theta$.§
\end{itemize}

\end{comment}

%%% Local Variables:
%%% mode: latex
%%% TeX-master: "notes"
%%% End: