05-design-of-adaptivity.tex

\chapter{Design of Adaptivity}
\label{chap:design-of-adaptivity}

\itquote[1mm]{
We are developing what one could flippantly call stupid tutoring systems: tutors that are not, in and of themselves, behaving in an intelligent fashion. But tutors that are designed intelligently, and that leverage human intelligence.
}{R. Baker}

Among the strategies to support learning and motivation,
we focus on recommending tasks of optimal difficulty,
which is the secret sauce of adaptive learning systems.

%In the previous chapter, we have described how we incorporated many of the
%strategies to support learning and motivation into the game.
%In this chapter, we discuss another strategy which helps to achieve the state of
%flow, the one that is specific to ALS  % TODO:reformulate
%-- recommending tasks of optimal difficulty.
% ... "which is the secret sauce of adaptive learning systems"
% ... "which is powered by the AI"
% TODO: More links to learning/motivation \label{sec:motivation.challenge}
% TODO: link to the adaptivity chapter

\section{Expected Behavior}  % or "Specification", "Goals", "System Behavior", "Behavior specification"
\label{sec:robomission.behavior}

% NOTE: This section should provide a useful basis for:
% - guide how to design ALS components
% - setting rewards for a a RL/planning agent
% - defining "good recommendations" (~good performance) for supervised
%   learning and for use in monitoring/evaluation (domain metric)

% TODO: clarify the relationship between this section and AL.analysis (or
% possibly also this-chapter.analysis) sections.

A wide range of difficulties, combined with the adaptive behavior,
should make the system useful for everybody who wants to learn programming.
However, %this long-term goal requires many iterations; % and a lot of data.
currently the system targets children 10-15 years old.
% "between 10 and 15 years."
The main use case of the system is a 1-2 hour in-classroom tutorial.
% (Hour of Code format) %  as described in \cref{sec:hoc}),
% possibly followed by individual practice at home.
In line with the mission outlined in \cref{sec:mission}, goals of this tutorial
are to teach the student basic concepts of programming,
help them to spend the time in the state of flow,
%help them to spend most of the time in the state of flow,
and motivate them for the further learning of programming. %computer science.
% TODO: better link to fullfilling human needs (as stated in table in chapter 2)
% TODO: better terminoglogy, because "You cannot motivate people... (motivation
% goes fron inside)"
% NOTE: flow is a goal; but at the same time it supports the other 2 goals

After each solved task, the student is shown % a dialog window with one
a recommended task and a link to the page with an overview of all tasks.
% Terminology: \emph{good recommendation}
Ideally, each recommended task would lead to the state of flow.
% TODO: How to measure? One way: looking ath the long-term objectives
As we cannot directly observe whether the student is in the state of flow,
we must rely on proxy data.
Currently, we only use objective attributable metrics based on observed performance
% which should be neither too low, nor too high
(\cref{sec:robomission.student}). % TODO: specific subsection
%(details in \cref{sec:robomission.student}). % TODO: specific subsection
% "qualitative" vs. "quantitative"
% TODO: REF:performance
% as we have discussed in section on attributable metrics (\cref{sec:live-evaluation}).
% long-term or attributable metrics \cref{sec:long-term-objectives,sec:live-evaluation}
% TODO: the performance compression should be described in this chapter
% in student modeling section, ideally with an analysis supporting the
% decsions or sketching some possible directions.
From the general goals, we derive weaker,
\emph{necessary, but not sufficient} requirements, that can be more reliably observed,
% (with less noise)
to guide us during the design of the system,
and to provide us with more specific evaluation:  % more actionable
% NOTE: It reminds me of hypothesis testing of goodness-of-fit: there are
% various tests (metrics we can measure), that can tell us, that the given
% sample is definitely not from the compared distribution; some of these
% tests are very simple and intuitive (e.g., for exp. distribution, ratio
% of mean and stdev^2 should be 1), but they also have quite small power
% (i.e. they are often not able to reject false hypotheses).
% TODO: shrink vertical spaces
\begin{itemize}
\item All students are able to solve all tasks recommended by the system
  in a reasonable time (at most 15 minutes).
  Furthermore, they are able to solve the first few tasks quickly
  (each in 2 minutes), and progress to the second level in at most 10 minutes.
\item The best-performing students % (supposedly with a prior programming skill)
progress through initial levels quickly, spending at most 5 minutes
on tasks with only sequences of commands, and get to the more
challenging tasks containing both types of loops and conditional statements
in at most 20 minutes.  % 10, 15?
\item At most one new programming concept and one new game concept appear
  in a task. Another concept is not introduced until the student
  solves a task with the last introduced concept with a good
  performance. % (they must sometimes come together, it is not a problem)
\item All students should gradually start practicing all basic programming concepts
  (sequence, loop, conditional statement) during the first hour. % of the tutorial.
  % TODO: check gvid
% EXT:
%\item Average difficulty of the recommended tasks gradually increases.
%  (Occasionally, an easy tasks might be recommended in order to improve
%  exploration, but these easy tasks should not be too frequent.)
%  %, and they should be differentiated for the student,
%  %e.g., as \emph{speed challenges} in order to explain the sudden change in
%  %difficulty.)
% TODO: perform analysis if it's the case in the current system.
%Each new concept (e.g., block) is explained in the task where it appears for the first time. (REF: img), so the student understands all elements in the game world and blocks available in the toolbox.
\end{itemize}
% TODO: compare these requiremnts againts collected data and gvid-report

%% EXT:
%TODO: about subjective data (future)
%which can be either subjective
%(using perceived difficulty ranking),
%\item If asked after each task, students should report that the task was
%  neither too easy, nor too difficult for them in at least half of the tasks
%  during the tutorial.  % or another negative tag, such as boring, weird, ...
%\item Neither task that was reported as too easy should take them
%  more than 1 minute, and there should never be more then 3 too easy tasks
%  in a row (with the exception of the first level).
%  %(Student is not bored by neither too easy task requiring many commands (i.e.
%  %taking more than 1 minute to build the streightforward solution), nor by the
%  %sequence of either too easy or too similar tasks.
%  % TODO: Maybe it would be better to state it in combined max too-easy time?
%  % NOTE: It is ok to have sometimes an easy task (esp. at the beginning), but
%  % they should be quick to solve and not too frequent
%\item If asked at the beginning and at the end of the tutorial about their
%  interest in learning computer science, the end report should be statistically
%  higher.
%% TODO: Not sure if to mention the self-report points, as they are not
%% currently used in the system.
%% TODO: (Another flow-related) If the student marks the current task as too
%% easy, the next task is strictly more difficult. If the student marks the
%% current task as too difficult, the next task is as most as difficult as the
%% current one.

%\subsection{Main Use Case}
%\label{sec:robomission.use-case}
%
%\begin{itemize}
%\item Student visits the home page of the project, reads the "promotional slides" and tries the game with manual controls. On the last slide, they click on the recommended task from the 1st level.
%(REF: img)
%\item Student creates the program using Blockly blocks and can run the program as many times as needed. (REF: img) Program execution is visualized. Student can change the speed of the execution.
%\item After each unsuccessful execution, a short message explaining why the task was not solved is shown (e.g., "The spaceship must reach the final row.") (REF: img)
%\item Student is able to solve the first few tasks quickly (within 2 minutes).
%\item After solving each task, student is shown a visualization of obtaining points (called credits) (REF: printscreen). After a few solved tasks, student progresses to next level.
%\item After each solved task, student is shown a dialog with one recommended task, and also a link to the page with overview of all tasks (REF: img).
%\item Student is able to solve any task recommended by the system (within 15 minutes).
%\item Each new concept (e.g., block) is explained in the task where it appears for the first time. (REF: img), so the student understands all elements in the game world and blocks available in the toolbox.
%\item Students with some prior programming skill should progress through first few levels quickly (within 10 minutes) and get to the more challenging tasks containing both types of loops, conditionals etc.
%\item Student is not bored by neither too easy task requiring many commands (i.e. taking more than 1 minute to build the streightforward solution), nor by the sequence of either too easy or too similar tasks.
%\item Student can sign up (or log in without registration through their social accounts) at any moment to save their progress. Even without singing up, the system can associate the student with its progress using session cookie (but also provides a button to clear the history).
%\item Student can provide a feedback or report a bug easily (and the feedback is send to admins by an email).
%\end{itemize}
%(TBA: add diagram with images for all these steps linked by arrows showing transitions)

% TODO: consider some of the following notes
% - Design of tasks for the system is described in \cref{sec:robomission.tasks}.
% - Adaptive aspect of the behavior is described in \cref{sec:robomission.adaptability}.
%\item intuitive and simple user interface crucial (aiming at children, they need to focus on learning programming, it would be bad to waste their mental power on understanding a complex interface)
%\item mini-instructions (ref to the Google research on ignoring instructions, show how it was solved in Blockly Games; ref figure)
%\item mini-explations (difference from instructions: after the fact) (ref figure) (they also serve as a convenient mean to game resetting)
%\item motivation: intrinsic (fun challenging game + optimal difficulty) and simple external motivation scheme: credits and levels


%\subsection{Four Modes of Usage}
%\label{sec:robomission.use-cases}
%
%In addition to the main use case described in the previous section,
%which assumes a new student without any context (e.g., a classroom),
%the system can be potentially used in other (or more specific) ways.
%
%\begin{itemize}
%\item "Hour of Code" mode
%  \begin{itemize}
%  \item single hour
%  \item mainly as a motivation to programming
%  \item using RoboBlocks
%  \item directly at elemenatry and high schools, or at home
%  \item plus: MjUNI workshop
%  \item shorter promotianal version for DODs (?) ("10 minutes of code")
%  \item (should be strictly time-limited; certificate at the end)
%  \end{itemize}
%\item "Foundations mode" individual learning of elementary programming (individual at home or in a classroom, from several days to several weeks, depending on the prior skill); natural continuation of the first "Hour of Code" (next levels, with RoboBlocks)
%\item "University mode" -- levels with RoboCode/Python, at home / secondary schools, KSI (0th problem set), IB111 (0th/1st motivational lesson - needs Python and to be better than turtle)
%\item "Competition mode" -- competitions such as Purkiada, Pevnost FI, KSI
%(advanced problem sets), new FIBot (physical version already in InterSoB 2017,
%then in Sob 2018); this also includes testing mode for RH interns
%\end{itemize}
%
%All these modes can be naturally implemented as distinct levels,
%going from the easy tasks using RoboBlocks for "Hour of Code",
%gradually transitioning to the RoboCode during learning the "Foundations",
%using full-fledged Python for "University mode"
%and offering both blocks and Python for the individual competitions.
%Levels from the past competitions can be made public for all students.


\section{Domain Model}

Currently, we do not model overlapping concepts, because that would require
more data to properly analyze their interactions.
Instead, we divide tasks into linearly ordered disjoint hierarchical problem
sets (\cref{fig:robomission.domain}). % , which are easier to handle.
The hierarchy has two levels: the top-level problem sets (\emph{levels})  % "missions?"
contain about ten tasks, which are further split into three smaller
problem sets (\emph{phases}).

Levels and phases are ordered, while tasks within a given phase are not,
since all the tasks in a single phase should be similarly difficult.
The first two levels gradually introduce game elements
(e.g., diamonds, meteoroids, wormholes),
while the later levels usually focus on practicing one new programming
concept (e.g., repeat loop, while loop, if statement).
% (initialy human-estimated, later refined using collected data
% NOTE: Data collected befor division into phases (TODO: mention in the
% analysis chapter...)

The contribution of refining levels into phases is threefold:
First, phases enforce critical prerequisites within a level (e.g.,
introducing wormholes before using them in more advanced tasks).
Second, phases are approximately homogeneous, i.e. tasks in a phase %practice the same concepts, and
have similar difficulty,
which allows for simpler tutor models.
Third, phases help to achieve a balanced composition \cite{progression-analysis};
  many levels follow a pattern of
  introducing new concepts in the first phase,
  recombining them with previously learned concepts in the second phase,
  and further reinforcing known concepts in the third phase. % OR:practicing
% TODO: Although this division is only a guide, doesn't hold exatly...
% TODO: elaborate on the relevance of the paper
% TODO: check the paper (terms of the phases, their "oreder" and meaning)
% TODO: link to what presented in earlier chapters

% TODO: consider to increase font of the phase names
\begin{figure}[htb]
\centering
\includegraphics[width=\textwidth]{img/robomission-domain}
\caption{%
  Domain model used in RoboMission: ordered hierarchical problem sets (dark rectangles)
  containing unordered tasks (light squares).}
\label{fig:robomission.domain}
\end{figure}


\section{Student Model}
\label{sec:robomission.student}

% TODO: terminology and links from \cref{sec:student-modeling}
For each student, the system keeps track of her skill for each problem set. % phase and mission.
The skill $s \in [0, 1]$ represents manifested ability to solve tasks in
a given problem set. The initial skill is 0, and it is increased after each solved
task from the problem set. Once the skill reaches 1, the problem set is solved.

While the model has a structure of a dynamic Bayesian network
(\cref{fig:robomission.student-model}), %, \cref{sec:dbn}),
it does not model the probability of the student solving a task with some performance,
but rather an amount of verified skill;
i.e., the model is only descriptive.
An advantage is that this skill can be shown directly to the student,
e.g., visualized as a progress bar towards completion of the current level,
since it never decreases (which would be demotivating).

We use a discrete representation of performance with three levels (poor, good,
excellent). Currently, only solving time is considered for the performance
measurement: for each task, we define a threshold for a solution to be
considered as good or excellent
(1.5 and 0.75 multiples of median time).
% TODO: support the decision by an analysis (REF)
% TODO: Note that it's huge simplification and a subject of future research.

After each solved task session, the corresponding skill is increased by
an amount $p \in [0, 1]$ which depends only on the performance.
The increment does not depend on the specific task solved, because we assume homogeneous phases.
For each performance level, we set the increment such that $1/p$ tasks solved
with such performance are enough to manifest mastery in this phase.
%, and it does not depend on our previous estimate
Either a single task solved with an excellent performance, or two
tasks solved with a good performance are enough to solve the phase,
translating into $p_{excellent} = 1$ and $p_{good} = 0.5$.
Furthermore, solving all tasks in a phase (even if with a poor performance),
should be always enough to solve the phase.
%(this is a technical requirement, stemming from the fact that we do not want
%to present students with the same task they already solved again)
Therefore, the update rule is:
$s \leftarrow \min(1, s + \max(p, \frac{1}{n}))$,
where $n$ is the number of tasks in the phase.
% The $1/n$ term ensures that the student eventually masters the phase.
% NOTE: we never want to decrease skill of a student
% TODO: link to the student model described in theory part
To aggregate the skills of the phases into the skill of the level,
we average the phase skills, which means
that all phases must be solved in order to solve the level.
%We plan to use a more sophisticated aggregation in the future,
%which would allow to compensate unsolved tasks in the first phase
%by more difficult tasks in the later phases.

\begin{figure}[htb]
\centering
\includegraphics[height=31mm]{img/robomission-student-model}
\caption{%
  Student model used in RoboMission includes skill for each phase
  (computed from its previous value and a newly observed performance)
  and skill for each level (computed from skills of its phases).}
\label{fig:robomission.student-model}
\end{figure}

\section{Tutor Model}
\label{robomission.tutor}

We focus on the outer-loop tutor modeling;
concerning the inner loop, the system provides only a basic
rule-based tutor for displaying instructions and explanations,
as described in \cref{sec:game.explanations}.

The outer-loop tutor is hierarchical;
it first selects a problem set, then a task from this problem set.
In order to assess if a problem set is mastered,
the corresponding verified skill
computed by the student model described in the previous section
is compared to the threshold of 1. % ?
% TODO: diagram of PS, TS selection and mastery decison

The tutor selects for practice the first unsolved phase of the first
unsolved level, using the total ordering provided by the domain model.
The system contains only a small number of problem sets,
so it is reasonable to assume that the student would like to solve all of
them,
% TODO: further increased by the tendency to "get everything green"
in which case it is preferable to solve them in the order from the
easiest to the most difficult.
Homogeneity of phases allows to safely select a task uniformly at random from
all unsolved tasks in that phase, which is a strategy maximizing exploration.
% UI:
Recommendations provided by the system are \emph{soft};
i.e., students can ignore them and select any task
from the overview of all tasks.

Progression through the problem sets guarantees a gradual increase of difficulty,
but the increase is not monotonous.
% TODO: Reformulate not to use so many "increase"
Perceived difficulty rises after progress to a new phase,
but it decreases during solving tasks from the same phase,
as the student's skill improves, while the objective difficulty does not change
%stays approximately same
(\cref{fig:robomission.flow}).
On the one hand, it means that the difficulty does not match student's
skills perfectly, possibly slightly overshooting at the beginning of a phase,
and undershooting at the end.
On the other hand, the perceived difficulty level is not
same all the time; instead, it has a wavy character, which creates a more
engaging experience for students % \emph{dramaturgy of flow}
\cite{book-of-lenses}.

\begin{figure}[htb]
\centering
\begin{tikzpicture}[font=\sffamily,scale=4.1]
\large
\draw [dashed] (0.1,0) -- (1,0.8);
\draw [dashed] (0,0.1) -- (0.8,1);
\draw [thick] (0.0,0.05) -- (0.1,0.05)
           -- (0.1,0.15) -- (0.2,0.15)
           -- (0.2,0.25) -- (0.3,0.25)
           -- (0.3,0.375) -- (0.45,0.375)
           -- (0.45,0.505) -- (0.6,0.505)
           -- (0.6,0.7) -- (0.8,0.7)
           -- (0.8,0.9) -- (1.0,0.9);
\draw [thick, <->] (0,1) node [left] {\emph{difficulty}} -- (0,0) -- (1,0) node [below right] {\emph{time}, \emph{skill}};
\node at (0.27,0.82) {frustration};
%\node at (0.95,0.95) {flow};
\node at (0.7,0.2) {boredom};
\end{tikzpicture}
% result: waves, but more "spiky"
% (but the objective difficulty does not decrease, it stays constant
\caption{Dramaturgy of task difficulty in RoboMission.}
\label{fig:robomission.flow}
\end{figure}


% TODO? UI? Probably not enough material for this


\section{Analysis Layer}
\label{sec:robomission.analysis-layer}

In order to iteratively improve adaptivity, as well as the programming game
and other aspects of the learning system,
an analysis layer
that provides several different views on the behavior of the system
is crucial (\cref{sec:metrics-and-evaluation}).
Our analysis layer includes the following components:
Google Analytics, monitoring dashboard, error reports, user feedback,
data exports, and logs.

Google Analytics
  shows the distribution of users with respect to space and time
  (\cref{fig:google-analytics}).
  In addition to page views,
  it can also process events sent from the frontend
  (such as clicking on the execution button),
  and divide them into groups, e.g., by the task being solved.
  % or group for AB experiment.

\imgW{google-analytics}{%
  Preview from Google Analytics (January--March 2018) shows,
  for example,
  that an average user spends on the page over 20 minutes.} % (breakdown for "execution" event).}

Monitoring dashboard
  shows several long-term objectives, % metrics related to the long-term objectives
  %(\cref{sec:long-term-objectives})
  namely daily active students, daily solved tasks, and
  solving hours (total time spent on successful attempts).
  %(\cref{fig:monitoring-dashboard-fragment}).
  % + "success ratio" = proportion of successful attempts,
  The system also computes a few metrics for each task
  (solved count, median time, and success ratio) in order to detect
  issues with tasks. % such as a misplaced task or error in task setting
The dashboard is implemented as a weekly-recomputed Jupyter Notebook \cite{jupyter-notebooks},
  that performs several analyses
  and creates visualizations using the latest data
  (\cref{fig:dashboard-metrics,fig:monitoring-notebook}).
  It is easy to extend the dashboard simply by adding
  a cell in the notebook and testing it on historical data;
  no other modification to the backend or frontend is needed.
  %NOTE: + investigation notebook that with a command that generates exports
  %and serves them directly as pandas data frames

%\emph{Error Reports and User Feedback}.
  If an unhandled top-level error occurs on the server,
  it is not only logged but also sent to the administrators.
  Administrators also receive emails with messages provided by users via
  a feedback form that can be invoked on any page.
%\emph{Data Exports and Logs}.
%  In addition to the unhandled errors, all requests and performed actions are
%  logged to text files on the server for manual inspection.
  Collected data are exported every week as a zip bundle containing
  CSV files prepared for %convenient offline analysis.
  offline analysis.
  (Structure of these CSV files is described in \cref{sec:attachment.collected-data}.)

\begin{figure}[htb]
\centering
\frame{\includegraphics[width=\textwidth]{img/dashboard-metrics}}
\caption{%
  Number of students (users who attempted a task), weekly averaged, visualized
  in the monitoring dashboard (March 31, 2018)}.
\label{fig:dashboard-metrics}
\end{figure}

% TODO: check if the table lines looks ok when printed
\begin{figure}[htb]
\centering
\begin{subfigure}{.48\textwidth}
\centering
\includegraphics[height=56mm]{img/monitoring-notebook-levels}
\caption{Overview of all levels.}
\end{subfigure}
\begin{subfigure}{.51\textwidth}
\centering
\includegraphics[height=56mm]{img/monitoring-notebook-repeat}
\caption{Level 3: Repeat loop.}
\end{subfigure}
\caption{%
  Mean success and median time visualized in the monitoring notebook
  (March 31, 2018). This simple analysis reveals
  that the 7th level (Comparing) is probably more difficult than the 8th (If-else)
  % it's not clear, because the population of students is different; they
  % might already learn something important in the 7th level for the 8th
  % level
  and that Clean Your Path is significantly more difficult than
  the other tasks in the level.}
% TODO: observation about clean-your-path
\label{fig:monitoring-notebook}
\end{figure}

% TODO: update investigation notebook (fix error in last cell, current data;
% should show something interesting, or at least some descriptive analylis)
%\imgW[0.7]{investigation-notebook}{Template of jupyter notebook for investigation of live data.}


% NOTE on iterative development: first prototype: 2016, one year of
% development, thrown away; for testing our initial ideas and find what works
% and what not; 50 tasks with a robot in maze; problems with the robot in the
% maze: not fun (boring, not inovative, repetitiveness), did not allow for a
% plenty of diverse easy tasks (which is necessary for adaptive systems),
% required a lot of blocks even for simple programs (compared to the SpaceGame)
% + problems with the codebase (maintainability, extensibility - why?); and the
% good things? (SPA, explored/verified useful technologies, such as Blockly and
% Django)

%\imgW{prototype-task-environment}{First prototype of the system, with a classic robot-in-maze game.}


%\subsection{Admin Requirements}
%\label{sec:admin-requirements}
%Similarly to regular users, administrators also have requirements on the systems:
%\begin{itemize}
%\item Admin can immediately see how much is the system used and how the system behaves with respect to the short-term ("live-evaluation") metrics (...)
%\item Admin receives feedback from provided by users, and error reports on unhandled exceptions.
%\item Admin can see metrics on individual tasks (to quickly detect issues with a task).
%\end{itemize}


% TODO: consider if to include non-functional requirements or not
%\section{Non-functional Requirements}
%\label{sec:robomission.nonfunctional-requirements}
%
%\begin{itemize}
%\item easy to understand code, pleasure to read and write (extend)
%\item easy to refactor and add new things (new tasks, levels, recommendation strategies etc.)
%\item robust, efficient, interpretable behavior
%\end{itemize}