lqr.tex

\chapter{Dynamic Programming and LQR}
\label{sec:dynamic-programming-lqr}

In this chapter (reviewed) we review the basics of Dynamic Programming and we show how it can be used to derive the optimal control law for a linear time invariant system with quadratic cost. The chapter serves three purposes:
\begin{enumerate}
\item it formulates a control problem in discrete time with a state-space representation of the plant, which will be the formulation that we will use for most of the course;
\item it reviewes some tools and concepts that will be used multiple times in the rest of the course, such as backward induction and the Bellman optimality principle;
\item it introduces the first computational control strategy of the course, \textit{i.e.}, the linear quadratic regulator (LQR).
\end{enumerate}


\section{The Optimal Control Problem}

In the most general sense, to optimally control a (discrete-time) dynamical system consists in finding a sequence of control inputs $\bsu = \{u_t\}_{t=0,1,\ldots}$ that minimize a cost function $J(\bsu)$,
\begin{equation*}
  \min_{\bsu\in\mathcal{U}} J(\bsu)
\end{equation*}
with the control input constrained to the space $\mathcal{U}$. This is generally a hard problem for the following reasons:
\begin{itemize}
\item the dimension of the decision variable $\bsu$ is large;
\item a fairly accurate model of the system is required. This can be obtained by first principles, by numerical simulation or can be a black box (oracle) that, given the control input $\bsu$, computes the system trajectory and evaluates the cost;
\item the optimization problem is in general non-convex. For convex problems, which have a single (global) minimum\footnote{Not all functions with single minimum are convex: for instance
    \begin{equation*}
      f(x) = -\frac{1}{1+x^2}.
    \end{equation*}}, there exists efficient solvers.
\end{itemize}
Moreover, from a practical point of view, the open-loop nature of the optimal control sequence makes it less reliable in the presence of disturbances and uncertainties.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Dynamic Programming}
\label{sec:bellmans-principle}

The optimal control problem becomes tractable under the following conditions:
\begin{itemize}
\item there exists a Markovian state that evolves according to
  \begin{equation*}
    x_{t+1} = f_t(x_t,u_t).
  \end{equation*}
  In other words, the complete state of the system is described by $x_t$ which, together with $u_t$ is the only thing needed to know for the future evolution;
\item the initial condition $x_0$ is known;
\item the cost is addictive over time
  \begin{equation*}
    V(\bsx,\bsu) = \sum_t g_t(x_t,u_t).
  \end{equation*}
\end{itemize}
Under these assumptions, the optimal control problem can be rewritten as the following optimization problem\footnote{In the class notes the minimization is done over both states $\bsx$ and control inputs $\bsu$, as they are both variables to be determined by the solver. In these notes, on the contrary I write the minimization only over $\bsu$, and consider the states $\bsx$ to be determined from the initial condition $x_0$ and the system dynmics.}
\begin{equation}
  \label{eq:general-optimization-problem}
  \begin{aligned}
    \min_{\bsu} \quad & \sum_{t=0}^T g_t(x_t,u_t) + g_T(x_T) \\
    \text{subject to} \quad  & x_{t+1} = f_t(x_t,u_t)\\
                              & x_o=X_0 \\
                              &x_t \in \mathcal{X}_t\quad \forall t \\
                              &u_t \in \mathcal{U}_t\quad \forall t.
  \end{aligned}
\end{equation}


\subsection{Dynamic Programming}
\label{sec:dynamic-programming}

By employing Bellman's optimality principle, we can approach the problem by iterating backwards from the last stage $T$: at $T$ the problem is trivial since there is no control input to optimize for. Therefore if the state is in state $x_T$ at time $T$, the remaining cost to be incurred is simply
\begin{equation*}
  V_T(x_T) = g_T(x_T).
\end{equation*}
At time $T-1$, we need to minimize
\begin{align*}
  V_{T-1}(x_{T-1}) &= \left\{
  \begin{aligned}
    \min_{u_{T-1}} &\ g_{T-1}(x_{T-1},u_{T-1}) + V_T(x_T) \\
    \text{subject to } & x_T = f_{T-1}(x_{T-1},u_{T-1})
  \end{aligned}
    \right. \\
  &= \min_{u_{T-1}} \big( g_{T-1}(x_{T-1},u_{T-1}) + V_T(f_{T-1}(x_{T-1},u_{T-1}) \big)
\end{align*}
and at a generic time $t$
\begin{equation}
  \label{eq:backward-induction}
  V_t(x_t) = \min_{u_t} \left\{ g_t(x_t,u_t) + V_{t+1}(f_t(x_t,u_t)) \right\}.
\end{equation}
In other words, the problem is decomposed into stage problems that can be solved by backward induction.
The problem is computationally easy to solve if
\begin{itemize}
\item the decisions space at every stage is convex;
\item $g_t(\cdot,u_t)$ is convex in $u_t$. Note that we do not request $g_t$ to be convex in $x_t$ because $x_t$ is a parameter of the problem;
\item $V_{t+1}(f_t(\cdot,u_t))$ is convex in $u_t$.
\end{itemize}
In practice the problem is tractable only for quadratic cost $g$ and linear dynamics $f$.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Linear-Quadratic Regulator (LQR)}
\label{sec:lqr}

LQR problems have quadratic cost functions\footnote{A more general quadratic cost includes the cross terms in $x$ and $u$
\begin{equation*}
  \begin{bmatrix}
    x^\top & u^\top
  \end{bmatrix}
  \begin{bmatrix}
    Q & N \\ N^\top & R
  \end{bmatrix}
  \begin{bmatrix}
    x \\ u
  \end{bmatrix} = x^\top Qx + u^\top Ru + 2x^\top N u.
\end{equation*}
This type of quadratic cost function can be used to account for the cost effort of a pre-existing proportional control law.
Consider for example the case in which an additional control term $-Kx$ is present. In that case the total control input is $-Kx + u$ and its cost is
\begin{equation*}
  (-Kx + u)^\top R (-Kx + u) = x^\top K^\top R K  x + u^\top R u - 2 x^\top K^\top R u
\end{equation*}
which corresponds to the quadratic form above when $Q=K^\top R K$ and $N = K^\top R$.}%
\footnote{Were $Q = Q_S+Q_A$ (or $R$) to have an antisymmetric component $Q_A\doteq\tfrac{1}{2}\left(Q-Q^\top\right)$, this would not contribute to the cost function. Remembering that $Q_A=-Q_A^\top$ and $x^\top Q_Ax = x^\top Q_A^\top x$ since it is a real quantity, one has
\begin{equation*}
  x^\top Q_Ax = x^\top Q_A^\top x = -x^\top Q_Ax
\end{equation*}
and then $x^\top Q_Ax=0$.}, linear dynamics and no constraints
\begin{equation}
  \label{eq:LQR-formulation}
  \begin{aligned}
    \min_{\bsu} \quad & \sum_{t=0}^{T-1} \left(||x_t||_Q^2 + ||u_t||_R^2\right) + ||x_T||_S^2,\quad Q,S \succeq 0,\ R \succ 0 \\
    \text{subject to} \quad & x_{t+1} = Ax_t + Bu_t.
  \end{aligned}
\end{equation}
One imposes a \emph{strictly} positive cost on the control inputs, $R\succ 0$ to reflect the case that is relevant in practice where there is a cost associated with actuating the system.

The optimal control in this specific case is a linear feedback of the state.

\begin{proof}
  We start by showing that the remaining cost until time $T$ can be expressed in the form
  \begin{equation}
    \label{eq:LQR-generic-solution}
    V_t(x) = x^\top P_tx,\quad P_t\succeq 0.
  \end{equation}
  We show this by induction starting from $t=T$ and recursing backwards.

  The initial case is satisfied because $V_T(x) = x^\top Sx$ is quadratic and $P_T = S$. At time $t$, assuming that $V_{t+1}(x) = x^\top P_{t+1} x$, we have $V_t(x) = \min_u J_t(x,u) = J_t(x,u^\star)$ where
  \begin{align*}
    J_t(x,u) &= x^\top Qx + u^\top Ru + V_{t+1}(Ax+Bu) \\
             &= x^\top\left(Q + A^\top P_{t+1}A\right) x + u^\top\left(R + B^\top P_{t+1}B\right) u + 2u^\top B^\top P_{t+1}Ax.
  \end{align*}
  To find the minimum $u^\star$ we set its gradient to zero, $\nabla_{u} J(x,u) = 0$
  \begin{align}
    &\left(R + B^\top P_{t+1}B\right)u^\star + B^\top P_{t+1}A x = 0 \nonumber \\
    \label{eq:LQR-optimal-control-input}
    &\quad \rightarrow u^\star = -\underbrace{(R+B^\top P_{t+1}B)^{-1}B^\top P_{t+1}A}_{K_t}x.
  \end{align}
  This solution exists since $R+B^\top P_{t+1}B$, being a strictly positive $m\times m$ definite matrix, can be inverted. We need to prove that $V_t(x)= x^\top P_t x$. We plug the optimal control $u^\star$ into the expression for $V_t(x)=J_t(x,u^\star)$ to obtain
  \begin{align*}
    V_t(x) &= x^\top \left(Q + A^\top P_{t+1}A\right) x - x^\top A^\top P_{t+1}B\left(R+B^\top P_{t+1}B\right)^{-1}B^\top P_{t+1}Ax
  \end{align*}
  where
  \begin{equation}
    \label{eq:solution-timedependent-ricatti}
    P_t = Q + A^\top P_{t+1}A - A^\top P_{t+1}B\left(R+B^\top P_{t+1}B\right)^{-1}B^\top P_{t+1}A.
  \end{equation}

The optimal control sequence $\bsu^\star = \{u_0^\star, u_1^\star,\ldots,u_{T-1}^\star\}$ can be computed by evaluating the matrices $K_t$ by backward recursion, using eq.~\eqref{eq:solution-timedependent-ricatti} and eq.~\eqref{eq:LQR-optimal-control-input}.
\end{proof}

Note that the optimal control sequence is a feedback law, $u_t=-K_tx_t$.
One could compute the entire optimal sequence via forward integration of the system dynamics from the initial state $x_0$ to obtain the next states $x_t$ and the next control inputs $u_t^\star = -K_tx_t$.
However, it is more convenient to perform the computation of the optimal control input $u_t$ online, using the linear feedback gain $K_t$ and the measurement $x_t$ of the plant state, for the following reason.

In case of state perturbation, the system evolves differently from the predicted state, but applying the input $u_t=-K_t\tilde{x}_t$ on the measured state $\tilde{x}_t$ guarantees optimality because by Bellman's principle: the computed sequence is optimal from any starting point and onwards. Contrast this with model mismatch: in this case the feedback law using $K_t$ is not optimal because it relies on the matrices $A$ and $B$ being an accurate representation of the system.


\subsection{Infinite Horizon LQR}
\label{sec:infinite-horizon-LQR}

There are situations where one wants to maintain a stationary state. This is captured by making the horizon infinite,
\begin{equation}
  \label{eq:infinite-horizon-LQR-formulation}
  \begin{aligned}
\min_{\bsu} &\sum_{t=0}^\infty \left(||x_t||_Q^2 + ||u_t||_R^2 \right),\quad Q\succeq 0, R\succ 0 \\
    \text{subject to } &x_{t+1} = Ax_t + Bu_u,\quad x_0=X_0.
  \end{aligned}
\end{equation}
If the backward induction that we have used to compute the optimal feedback gain converges, then the problem becomes time-invariant and only one feedback matrix $K_\infty$ is required.

The convergence of eq.~\eqref{eq:infinite-horizon-LQR-formulation} is guaranteed by the following observations:
\begin{enumerate}
\item the cost is bounded from above: if the system is stabilizable\footnote{A system is stabilizable if all uncontrollable state variables have stable dynamics.}, there is an input sequence that yields a finite cost. To see why, let $u_t = -Kx_t$ a linear feedback such that $A-BK$ has eigenvalues inside the unit circle and $x_t=(A-BK)^tX_0$. The cost to go
\begin{align*}
  V(X_0) &= \sum_{t=0}^\infty x_t^\top Qx_t^{\phantom{\top}} + x_t^\top K^\top RKx_t \\
         &= X_0^\top \sum_{t=0}^\infty (A-BK)^{t\top}\left(Q+K^\top RK\right)(A-BK)^tX_0
\end{align*}
is a geometric converging series;
\item the sequence $P_{t}$ is increasing, as it describes the cost of an increasingly longer sum of stage costs.
Therefore, the sequence $P_t$ converges to a limit, as it is increasing and bounded above.
\end{enumerate}

Since the problem is time-invariant, the feedback law is computed by solving the time-invariant ARE with the subindex $t$ dropped in eq.~\eqref{eq:solution-timedependent-ricatti}
\begin{equation*}
  P = Q + A^\top P A - A^\top P B\left(R + B^\top PB\right)^{-1}B^\top P A
\end{equation*}
using \textit{e.g.} an ARE solver or analytically. Note that AREs may have multiple solutions and for a controllable system, the \emph{optimal} stabilizing solution $P_S$ is the matrix $P_\infty$ with the minimum cone. An ARE solver will generally pick the correct (stabilizing) solution.

Alternatively, one can iterate eq.~\eqref{eq:solution-timedependent-ricatti}
\begin{equation}
  \label{eq:solution-timeindependent-ricatti-iterate}
  P \leftarrow Q + A^\top PA - A^\top PB\left(R+B^\top PB\right)^{-1}B^\top PA
\end{equation}
until convergence to $P_\infty$.

The corresponding feedback gain $K_\infty$ does however not stabilize the system unless the uncontrollable modes are stable and the controllable unstable modes\footnote{$(Q^{1/2},A)$ being detectable is sufficient, although many textbooks like theorem~8.1 in ``Predictive Control for linear and hybrid systems'' by Borrelli, Bemporad and Morari requires observability of the pair.} are weighted in $Q$:
\begin{theorem}
  The feedback $K_\infty$ stabilizes the system if and only if $(A,B)$ is stabilizable and $(Q^{1/2},A)$ is observable\footnote{In class, the theorem was given for the observability of the pair $(C,A)$ where $Q=CC'$. When $Q$ is symmetric, its SVD has the simpler form $Q = U\Sigma U'$ since
    \begin{equation*}
      U\Sigma V = Q = Q' = V\Sigma U'
    \end{equation*}
    and $C$ is set to $C \doteq U\sqrt{\Sigma}$ after dropping the zero singular values from $\Sigma$ and the corresponding rows from $U$.}.
\end{theorem}
The existance of a stabilizing controller $K$ is guaranteed by the fact that $(A,B)$ is stabilizable. Since $(Q^{1/2},A)$ is observable, the unstable states of $A$ appear at the output $y=Q^{1/2}x$, which also enters the objective function as the state dependent term
\begin{equation*}
  ||y||^2 = \left(x^\top Q^{1/2}\right)\left(Q^{1/2} x^{\phantom{\top}}\right) = x^\top Qx.
\end{equation*}
The convergence of the infinite horizon cost $V$ implies the convergence of $y$ and by stabilizability, the convergence of $x$.


\subsection{Computation of a Stabilizing Controller}
\label{sec:stability-optimal-controller}

If $(Q^{1/2},A)$ has some unobservable states, the optimal control law $K_\infty$ may not stabilize the system, because the optimality of the controller is defined with respect to the cost function. Consider the unstable system
\begin{equation*}
  x^+ =
  \begin{bmatrix}
    2 & 0 \\ 0 & 3
  \end{bmatrix}x + u, \quad Q =
  \begin{bmatrix}
    1 & 0 \\ 0 & 0
  \end{bmatrix}, \quad R =
  \begin{bmatrix}
    1 & 0 \\ 0 & 1
  \end{bmatrix},\quad u =-Kx
\end{equation*}
where the diagonal $A$ matrix has eigenvalues outside of the unit circle, $\lambda=2,3$. The system can be controlled by an appropriate choice of $K$ which cannot not be optimal because acting on the second state incurs a cost, since $R_{22}\neq 0$, while there is no cost associated with letting the second state diverge because $Q_{22}=0$: the optimal solution will therefore yield zero control input of the second state.

However, even if a mode is not stable and not weighted in the cost function, it can still be stabilized by computing $P_S$ as the limit of the backward induction eq.~\eqref{eq:solution-timeindependent-ricatti-iterate} initialized at any $P_0\succ 0$.
This is equivalent to assuming a non-zero terminal cost\footnote{Strictly speaking there is no terminal cost in the infinite horizon LQR problem, but there is one in every finite horizon LQR that we consider when we do the backward induction procedure}.
The resulting limit will also be a solution of the Riccati equation and will be the optimal control among the stabilizing feedback laws.
The limit $P_\infty$ of the iteration starting with $P_0=0$ may not produce a stabilizing solution.

For the example above, the Riccati equation can be solved by hand and gives a diagonal matrix whose entries and solutions are
\begin{align*}
  1 + 3p_1 - \frac{4p_1^2}{1+p_1} = 0\ &\rightarrow p_1 = 2\pm \sqrt{5} \\
  8p_2 - \frac{9p_2^2}{1+p_2} = 0\ &\rightarrow p_2 = 0, 8
\end{align*}
and the optimal \emph{stabilizing} feedback solution is
\begin{equation*}
  P_S =
  \begin{bmatrix}
    2 + \sqrt{5} & 0 \\ 0 & 8
  \end{bmatrix},\quad K_S =
  2\begin{bmatrix}
    \frac{2+\sqrt{5}}{3+\sqrt{5}} & 0 \\ 0 & \frac{4}{3}.
  \end{bmatrix}
\end{equation*}
A solution containing $p_1=2-\sqrt{5}$ must be rejected because it does not generate a positive semi-definite $P$.

%%% Local Variables:
%%% mode: latex
%%% TeX-master: "notes"
%%% End: