deepc.tex

\chapter{Data-Enabled Predictive Control}
\label{chap:deepc}

This lecture presents a data-based representation of a linear system. This description generates valid inputs and outputs without using a state-based representation and avoids system identification based \textit{e.g.} on the Ho-Kalman algorithm.

\section{State-Based System Response}
\label{sec:deepc-state-based-response}

The time evolution of the linear system
\begin{equation*}
  \begin{aligned}
    x_{t+1} &= Ax_t + Bu_t \\
    y_t &= Cx_t + Du_t
  \end{aligned}
\end{equation*}
from the initial state $x_0$ and under the control input $\bsu$ is
\begin{equation}
  \label{eq:time-evolution-free-response-convolution}
  y_t = \underbrace{CA^tx_0}_{\text{free response}} + \underbrace{\sum_{j=0}^{t-1}CA^{t-1-j}Bu_j}_{\text{convolution}} + \underbrace{Du_t}_{\text{feed through}}
\end{equation}
or, in matrix form,
\begin{equation}
  \label{eq:time-evolution-free-response-convolution-matrix}
  \begin{bmatrix}
    y_0 \\ y_1 \\ y_2 \\ \vdots \\ y_{L-1}
  \end{bmatrix} = \underbrace{
    \begin{bmatrix}
      C \\ CA \\ CA^2 \\ \vdots \\ CA^{L-1}
    \end{bmatrix}
  }_{\doteq\mathcal{O}_L} x_0 + \underbrace{
  \begin{bmatrix}
    D & 0 & 0 & \ldots & 0 \\
    CB & D & 0 & \ldots & 0 \\
    CAB & CB & D & & \vdots \\
    \vdots \\
    CA^{L-2}B & CA^{L-3}B & \ldots & CB & D
  \end{bmatrix}}_{\doteq\mathcal{G}_L}
\begin{bmatrix}
  u_0 \\ u_1 \\ u_2 \\ \vdots \\ u_{L-1}
  \end{bmatrix}
\end{equation}
$\mathcal{O}_L$ is the observability matrix and $\mathcal{G}_L$ the Markov coefficients describing the input-output of the system when the initial state is $x_0=0$.

For the rest of this chapter, we will consider for simplicity SISO systems, $u\in \mathbb{R}$ and $y\in \mathbb{R}$, that are both controllable and observable. The additional complication for MIMO arises from the need to keep track of $m$ inputs and $p$ outputs.

To keep the notation compact\footnote{Should we have this definition somewhere earlier?}, we will indicate a vector of length $L$ by
\begin{equation*}
  \bsw_L \doteq
  \begin{bmatrix}
    w_0 \\ w_1 \\ \vdots \\ w_{L-1}
  \end{bmatrix}.
\end{equation*}
Using this notation, the expression eq.~\eqref{eq:time-evolution-free-response-convolution-matrix} is rewritten as
\begin{equation*}
  \bsy_L = \mathcal{O}_Lx_0 + \mathcal{G}_L\bsu_L
\end{equation*}
where (for SISO) $\mathcal{O}_L\in\mathbb{R}^{L\times n}$ and $\mathcal{G}_L\in \mathbb{R}^{L\times L}$ is a square matrix.

We have seen in the previous chapter that the rank of $\mathcal{O}_L$ can at most be $n$: if $L=n$, $\mathcal{O}_L$ is invertible since the system is assumed to be observable, and the initial state $x_0$ is uniquely determined by
\begin{equation*}
  x_0 = \mathcal{O}_L^{-1}(\bsy_L - \mathcal{G}_L\bsu_L)
\end{equation*}
if the control input $\bsu_L$ is known. If $L>n$ the system of equations is overdetermined, the initial state $x_0$ can be reconstructed only if $\bsy_L$ is a valid output sequence, that is, if the vector $\bsy_L - \mathcal{G}_L\bsu_L$ is in the column image of $\mathcal{O}_L$.

In the different representation of behavioral system theory, the terms are rearranged as
\begin{equation*}
  \begin{bmatrix}
    \bsu_L \\ \bsy_L
  \end{bmatrix} = \underbrace{
  \begin{bmatrix}
    I_L & 0 \\ \mathcal{G}_L & \mathcal{O}_L
  \end{bmatrix}}_{\Lambda_L}
\begin{bmatrix}
  \bsu_L \\ x_0
\end{bmatrix}
\end{equation*}
where $\Lambda_L\in \mathbb{R}^{2L\times(L+n)}$ and has rank $L+n$ when $L\ge n$, since one is free to choose the initial state $x_0\in \mathbb{R}^n$ and $L$ control inputs.

This formulation highlights the fact that trajectories of inputs $\bsu_L$ and outputs $\bsy_L$ must be in the column image of $\Lambda_L$. The size of the matrix $\Lambda_L$ grows as the length of the time sequence one tries to verify.

In this approach inputs and outputs are treated on equal footing. This is convenient for systems where is no such clear-cut distinction, for instance a complex electric circuit, where voltages and currents are mutually dependent.

The matrix $\Lambda_L$ is a full description of the system equivalent to the one generated by the state space system matrices $A$, $B$, $C$, $D$ and can predict the future $K$ outputs, $\bsy_K^{(\textrm{future})} = \{y_n,\ldots,y_{n+K-1}\}$, given the future $K$ inputs, $\bsu_K^{(\textrm{future})} = \{u_n,\ldots,u_{n+K-1}\}$,
\begin{equation*}
  %\label{eq:deepc-K-steps-predictor-Lambda}
  \begin{bmatrix}
    \bsu_n^{(\textrm{past})} \\ \bsu_K^{(\textrm{future})} \\ \bsy_n^{(\textrm{past})} \\ \bsy_K^{(\textrm{future})}
  \end{bmatrix} = \underbrace{
  \begin{bmatrix}
    I_{n+K} & 0 \\ \mathcal{G}_{n+K} & \mathcal{O}_{n+K}
  \end{bmatrix}}_{\Lambda_L}
  \begin{bmatrix}
    \bsu_n^{(\textrm{past})} \\ \bsu_K^{(\textrm{future})} \\ x_0
  \end{bmatrix}
\end{equation*}
with the matrix $\Lambda_L$ appropriately extended. This can be done, since $\Lambda_L$ has been derived from the state space system matrices, and the initial state $x_0$ is either known or can be uniquely derived from the past inputs $\bsu_n^{(\textrm{past})} = \{u_0,\ldots,u_{n-1}\}$ and outputs $\bsy_n^{(\textrm{past})} = \{y_0,\ldots,y_{n-1}\}$.

\section{Behavioral Model Approach}
\label{sec:behavioral-model-approach}

There is an alternative approach to the prediction of the future $K$ outputs and the behavior theory formulation makes it convenient to express. The main observation is that input and outputs $
\begin{bmatrix}
  \bsu_L& \bsy_L
\end{bmatrix}^\top
$ can only live in a subspace of dimension $2n+K=n+L$ in the space $\mathbb{R}^{2L}$: this subspace is spanned for instance by $n+L$ independent \emph{experiments}
\begin{equation*}
  \begin{bmatrix}
    \bsu_L^{(1)} \\ \bsy_L^{(1)}
  \end{bmatrix},\
  \begin{bmatrix}
    \bsu_L^{(2)} \\ \bsy_L^{(2)}
  \end{bmatrix},\
  \begin{bmatrix}
    \bsu_L^{(3)} \\ \bsy_L^{(3)}
  \end{bmatrix},\ldots,
  \begin{bmatrix}
    \bsu_L^{(n+L)} \\ \bsy_L^{(n+L)}
  \end{bmatrix}.
\end{equation*}
Therefore all and the only valid input-output sequences are of the form
\begin{equation*}
  \begin{bmatrix}
    \bsu_L \\ \bsy_L
  \end{bmatrix} = \underbrace{
  \begin{bmatrix}
    \bsu_L^{(1)} & \bsu_L^{(2)} & \bsu_L^{(3)} & \ldots & \bsu_L^{(n+L)} \\
    \bsy_L^{(1)} & \bsy_L^{(2)} & \bsy_L^{(3)} & \ldots & \bsy_L^{(n+L)}
  \end{bmatrix}}_{H}g
\end{equation*}
with $g\in\mathbb{R}^{n+L}$. Note that this is effectively a change of coordinates: $g$ is in general not more the vector of control inputs and initial state anymore\footnote{The new matrix $H$ corresponds to $\Lambda_L$ when the experiments are constructed in the following ways: for the first $L$ experiments, the initial condition is 0 and only one input at the time is activated with unit amplitude: for the first experiment, $u_0=\delta_{t-1}$ at time $t=1$, for the second experiment $u_1=\delta_{t-2}$ at time $t=2$\ldots. For the remaining $n$ experiments, the inputs are not activated and one element of the basis of the state space is activated at the time.}.

To construct $H$ while being able to predict $K$ steps in the future, at least $n+L=2n+K$ trajectories are needed, each of length $L=n+K$. An efficient way to produce the required data is to acquire a \emph{single} experiment of length $2L+n-1$ and to generate $n+L$ synthetic experiments by rearranging the control input sequence $\{u_0, u_1, u_2,\ldots\}$ as the Hankel matrix
\begin{equation*}
  \mathcal{H}_L(\bsu) =
  \begin{bmatrix}
    u_0 & u_1 & u_2 & \ldots & u_{L+n-1} \\
    u_1 & u_2 & u_3  \\
    u_2 & u_3 & u_4 \\
    \vdots \\
    u_{L-1} & u_L & u_{L+1} & \ldots & u_{2L+n-2}
  \end{bmatrix}
\end{equation*}
and, equivalently, $\mathcal{H}_L(\bsy)$ from the measured data $\bsy$. The $K$ step predictor becomes
\begin{equation*}
  \label{eq:deepc-K-steps-predictor-H}
  \begin{bmatrix}
    \bsu_n^{(\textrm{past})} \\ \bsu_K^{(\textrm{future})} \\ \bsy_n^{(\textrm{past})} \\ \bsy_K^{(\textrm{future})}
  \end{bmatrix} =
  \begin{bmatrix}
    \mathcal{H}_L(\bsu_\textrm{data}) \\ \mathcal{H}_L(\bsy_\textrm{data})
  \end{bmatrix} g
\end{equation*}
with $g\in \mathbb{R}^{2n+K}$ and $\bsy^\textrm{future}\in\mathbb{R}^K$ unknowns: the problem is determined because there $2(n+K)$ unknowns and an equal number of equations.


\section{Model-Based Predictive Control}
\label{sec:model-based-predictive-control}

The behavioral model permits one to express MPC using the matrix $H$ that generates valid future control inputs $\bsu$ and future outputs $\bsy$ compatible with the past data:
\begin{equation*}
  \begin{aligned}
    \min_g\ & \sum_{k=0}^{K-1} g_k(y_k,u_k) + g_K(y_K) \\
    \textrm{subject to } & \begin{bmatrix}
                             \mathcal{H}_L(\bsu_\textrm{data}) \\ \mathcal{H}_L(\bsy_\textrm{data})
                           \end{bmatrix} g =
                           \begin{bmatrix}
                             \bsu_\textrm{past} \\ \bsu \\ \bsy_\textrm{past} \\ \bsy
                           \end{bmatrix} \\
                            & u_k \in \mathcal{U}_k\ \forall k \\
                            & y_k \in \mathcal{Y}_k\ \forall k
  \end{aligned}
\end{equation*}
The state $x$ of the system is not required by the DeePC approach and therefore it cannot appear in the cost function.

From a computational point of view, this problem is more complex than standard tracking MPC:
\begin{itemize}
\item it requires a much larger memory footprint to hold $H$ instead of the system matrices $A$, $B$, $C$, $D$; and
\item it needs to optimize on $g\in\mathbb{R}^{2n+K}$, where $K$ degrees of freedom are the future input $\bsu$ and the remaining $2n$ terms of $g$ are determined by the initial conditions $
  \begin{bmatrix}
    \bsu_\textrm{past} & \bsy_\textrm{past}
  \end{bmatrix}^\top
  $. The variables $\bsu$ and $\bsy$ are fully determined by $g$ via the $H$ matrix: although they are redundant by a mathematical point of view, they are usually explicitly used in the numerical solution of the optimization.
\end{itemize}

\section{Persistence of Excitation}
\label{sec:persistence-excitation}

To construct $H$, the experiment data $(\bsu_\textrm{data},\bsy_\textrm{data})$ must represent the dynamics of the system: dynamics that is not excited cannot be controlled.

The identification of unstable systems requires additional care: typically the unstable system is prestabilized and for identification the system is perturbed around this equilibrium point.

Indeed it is not sufficient to simply collect the input/output data from a stabilized system to guarantee the identification because input and output are constrained by the controller. For instance with a proportional controller, the input/output data would be of the form
\begin{equation*}
  \begin{bmatrix}
    \bsu_\textrm{data} \\ K\bsu_\textrm{data}
  \end{bmatrix}
\end{equation*}
and therefore the experimental data span the subspace resulting from the intersection of the plant and the controller spaces. To probe the full sysem dynamics it is sufficient to introduce an external input excitation $v$ with enough persistent excitation, such that the input to the plant becomes now $v - Ky$.

The perturbation may be also applied on more inputs simultaneously to keep the system closer to a stable state: in a quadcopter, for instance, activating a propeller at the time would rapidly render the system uncontrollable and the identification is done by activating all or pairs of rotors\footnote{Not clear here: if I activate the sum of two controllers, would I not need to perturb also with the difference?}.


\section{Extention of DeePC to the MIMO Case}

The \emph{lag} of a system is defined as the smallest $\ell$ such that the observability matrix
\begin{equation*}
  \mathcal{O}_\ell =
  \begin{bmatrix}
    C \\ CA \\ \ldots \\ CA^{\ell-1}
  \end{bmatrix}
\end{equation*}
has full rank $n$. For a SISO system, we saw that $\ell=n$, under the assumption that the system is observable.

For a MIMO system of input size $m$, output $p$ and state space $n$, a past trajectory of length $\ell$ is needed to determine the initial condition and an input/output sequence of length
\begin{equation*}
  T \ge (m+1)(\ell + K) + n -1
\end{equation*}
for constructing $H$.

Since $n\ge \ell$ always, a practical approach to write $H$ is to assume that $\ell = n$, where $n$ is determined by first principles.

\section{The Effect of Noise}
\label{sec:deepc-presence-of-noise}

The rank of the $H$ matrix is $n+L$ (or $n+mL$ for a MIMO system). If the data is noiseless, additional columns $
\begin{bmatrix}
  \bsu_L & \bsy_L
\end{bmatrix}^\top
$ do not further increase the rank of $H$. The presence of noise in the data will spuriously increase ita rank. SVD of $H$ will however reveal this by the small values of the singular eigenvalues past the element $n+L$. The spurious increase of rank will manifest also in the controller using a large $g$ and therefore large control inputs to affect the uncontrollable outputs\footnote{We have assumed that the matrix is observable}.

One can force the $H$ matrix to have rank $n+L$ using the nested optimization
\begin{align}
  \label{eq:deepc-cost-function}
  \min_{\bsu,\bsy,g,\hat{\bsu}_\textrm{data},\hat{\bsy}_\textrm{data}} & \sum_{k=0}^{K-1} g_k(y_k,u_k) + g_K(y_K) \\
  \textrm{subject to } &
                         \begin{aligned}
                           (\hat{\bsu}_\textrm{data},\hat{\bsy}_\textrm{data}) = &\ \arg \min ||\hat{\bsu}_\textrm{data} - \bsu_\textrm{data}||^2 + ||\hat{\bsy}_\textrm{data} - \bsy_\textrm{data}||^2 \\
                                                                                 & \textrm{subject to } \rank \left(
                                                                                   \begin{bmatrix}
                                                                                     \mathcal{H}_L(\hat{\bsu}_\textrm{data}) \\
                                                                                     \mathcal{H}_L(\hat{\bsy}_\textrm{data})
                                                                                   \end{bmatrix}
                                                                                   \right) = L+n
                                                                                   \label{eq:deepc-nested-optimization}
                         \end{aligned} \\
                                                                       & \begin{bmatrix}
                                                                           \mathcal{H}_L(\hat{\bsu}_\textrm{data}) \\
                                                                           \mathcal{H}_L(\hat{\bsy}_\textrm{data})
                                                                         \end{bmatrix}g =
                                                                         \begin{bmatrix}
                                                                           \bsu_\textrm{data} \\ \bsu \\ \bsy_\textrm{data} \\ \bsy
                                                                         \end{bmatrix} \\
                                                                       & \begin{aligned}
                                                                           u_k \in \mathcal{U}_k\ \forall k \\
                                                                           y_k \in \mathcal{Y}_k\ \forall k
                                                                         \end{aligned}
\end{align}
by using the closes points $(\hat{\bsu}_\textrm{data},\hat{\bsy}_\textrm{data})$ to the real data $(\bsu_\textrm{data},\bsy_\textrm{data})$ to construct $H$. Using a quadratic cost for the nested optimization to remove the noise is arbitrary but a different choice would make little practical difference.

The optimization is computationally hard because the computation of the rank makes the problem non-convex. It can however be simplified by the four following transformations:
\begin{itemize}
\item impose the additional constraint that $g$ has at most $L+n$ elements: $||g||_0\le L+n$. This is equivalent to the rank and does not therefore alter the original problem;
\item drop the $\rank$ constraint in eq.~\eqref{eq:deepc-nested-optimization}. The $\arg \min$ problem becomes trivial because it picks $(\hat{\bsu}_\textrm{data},\hat{\bsy}_\textrm{data})$ to be $(\bsu_\textrm{data},\bsy_\textrm{data})$ but it may produce an infeasible problem. The problem still returns a finite number of elements for $g$ because of the 0-norm;
\item dispense with the nested optimization altogether, eq.~\eqref{eq:deepc-nested-optimization}, and replace the 0-norm by a 1-norm constraint $||g||_1\le \alpha$. This is the Lasso constraint to the sparsity approach and makes the problem convex. The value of $\alpha$ determines the sparsity of the solution. This value must be tuned, but this introduces no more variables that the previous one, since the state of the system $n$ is (often) not known.
\item remove the 1-norm constraint and add it as a renormalization factor into the objective function eq.\eqref{eq:deepc-cost-function}
  \begin{equation*}
    \sum_{k=0}^{K-1} g_k(y_k,u_k) + g_K(y_K) + \lambda_g||g||_1.
  \end{equation*}
  A larger $\lambda_g$ increases robustness to noise but affect optimality and corresponds to selecting a model with smaller $n$. The posteriori sparsity $||g||_0$ of $g$ gives information about the model size.
\end{itemize}

The problem so transformed
\begin{equation*}
  \begin{aligned}
    \min_{\bsu,\bsy,g} & \sum_{k=0}^{K-1} g_k(y_k,u_k) + g_K(y_K) + \lambda_g||g||_1 \\
    \textrm{subject to } &  \begin{bmatrix}
                              \mathcal{H}_L(\bsu_\textrm{data}) \\
                              \mathcal{H}_L(\bsy_\textrm{data})
                            \end{bmatrix}g =
                           \begin{bmatrix}
                             \bsu_\textrm{past} \\ \bsu \\ \bsy_\textrm{past} \\ \bsy
                           \end{bmatrix} \\
                       & u_k \in \mathcal{U}_k\ \forall k \\
                       & y_k \in \mathcal{Y}_k\ \forall k
  \end{aligned}
\end{equation*}
makes the problem efficient to solve and usable in practice.

Noise may also be present in the ``initial conditions'' $(\bsu_\textrm{past},\bsy_\textrm{past})$. This is taken care by an additional regularization term $\lambda_\sigma ||\sigma||_1$
\begin{equation*}
  \begin{aligned}
    \min_{\bsu,\bsy,g,\sigma} & \sum_{k=0}^{K-1} g_k(y_k,u_k) + g_K(y_K) + \lambda_g||g||_1 + \lambda\sigma ||\sigma||_1\\
    \textrm{subject to } &  \begin{bmatrix}
                              \mathcal{H}_L(\bsu_\textrm{data}) \\
                              \mathcal{H}_L(\bsy_\textrm{data})
                            \end{bmatrix}g =
                           \begin{bmatrix}
                             \bsu_\textrm{past} \\ \bsu \\ \bsy_\textrm{past}+\sigma \\ \bsy
                           \end{bmatrix} \\
                       & u_k \in \mathcal{U}_k\ \forall k \\
                       & y_k \in \mathcal{Y}_k\ \forall k
  \end{aligned}.
\end{equation*}

\section{Concluding Remarks}
\label{sec:deepc-concluding-remarks}

\begin{itemize}
\item Data-driven control is a control approach that does not make use of the state of the system: if the state is known and its values need to be controlled, then a model-based approach may work better\footnote{If a state must be controlled, it must be measurable and it enters the constraint $y\in \mathcal{Y}_k$.}.
\item Sometimes prior information (weight, moment of inertia) is available: system identification can typically incorporate this information. On the contrary, DeePC is agnostic to this information.
\item DeePC strongly relies on a LTI system. An extension to non-linear problems, while possible, is cumbersome.
\item The computational complexity is larger in terms of memory and computation time.
\end{itemize}


%%% Local Variables:
%%% mode: latex
%%% TeX-master: "notes"
%%% End: