Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 6 additions & 6 deletions examples/tutorial.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -1660,7 +1660,7 @@
"> ⚠️ **NOTE**:\n",
"> You do not need to understand the content of the next code cell where a plotting function is defined.\n",
"\n",
"The `plot_battery_soc_profiles` function plots the building-level battery state of charge (SoC) profiles can also be used to compare different control agents:"
"The `plot_battery_soc_profiles` function plots the building-level battery state of charge (SoC) profiles. It can also be used to compare different control agents:"
]
},
{
Expand All @@ -1672,7 +1672,7 @@
"outputs": [],
"source": [
"def plot_battery_soc_profiles(envs: Mapping[str, CityLearnEnv]) -> plt.Figure:\n",
" \"\"\"Plots building-level battery SoC profiles fro different control agents.\n",
" \"\"\"Plots building-level battery SoC profiles for different control agents.\n",
"\n",
" Parameters\n",
" ----------\n",
Expand Down Expand Up @@ -2105,7 +2105,7 @@
"# An Introduction to Tabular Q-Learning Algorithm as an Adaptive Controller\n",
"---\n",
"\n",
"Tuning your RBC must have revealed that it is a cumbersome and labor intensive process, especially as the number of buildings, time period and variance in load profiles increase. What we will be ideal is an adaptive controller that can adjust to different occupant preferences and behaviors in each building that influence load profiles and adjust to different weather conditions that affect the seasonal variance in load profiles.\n",
"Tuning your RBC must have revealed that it is a cumbersome and labor intensive process, especially as the number of buildings, time period and variance in load profiles increase. What would be ideal is an adaptive controller that can adjust to different occupant preferences and behaviors in each building that influence load profiles and adjust to different weather conditions that affect the seasonal variance in load profiles.\n",
"\n",
"Moreover, we want a controller that is able to learn with little to no knowledge about the environment model it is controlling unlike the RBC tuning process where you probably chose your charge and discharge proportion by visually inspecting the building load and generation profiles. Instead, we want a controller that can learn those patterns in a data-driven fashion."
]
Expand All @@ -2125,7 +2125,7 @@
"Q(s, a) = Q(s, a) + \\alpha [r + \\gamma \\max_{a'} Q(s', a') - Q(s, a)]\n",
"$$\n",
"\n",
"where $Q(s, a)$ is the Q-value for taking action $a$ in state $s$, $\\alpha ∈ [0, 1]$ is the learning rate, which explicitly defines to what degree new knowledge overrides old knowledge: for $\\alpha = 0$, no learning happens, while for $\\alpha = 1$, all prior knowledge is lost. $\\gamma$ is the discount factor which allow to balance between an agent that considers only immediate rewards ($\\gamma$ = 0) and one that strives towards long term rewards ($\\gamma$ = 1). $\\max_{a'} Q(s', a')$ is the maximum Q-value for all actions $a'$ in the next state $s'$ that is reached after taking action $a$ in state $s$.\n",
"where $Q(s, a)$ is the Q-value for taking action $a$ in state $s$, $\\alpha ∈ [0, 1]$ is the learning rate, which explicitly defines to what degree new knowledge overrides old knowledge: for $\\alpha = 0$, no learning happens, while for $\\alpha = 1$, all prior knowledge is lost. $\\gamma$ is the discount factor which allows balancing between an agent that considers only immediate rewards ($\\gamma$ = 0) and one that strives towards long term rewards ($\\gamma$ = 1). $\\max_{a'} Q(s', a')$ is the maximum Q-value for all actions $a'$ in the next state $s'$ that is reached after taking action $a$ in state $s$.\n",
"\n",
"In other words, the optimal policy, $\\pi$, results from taking those actions $a$ that maximize the respective Q-values in each state, $s$. In order for the algorithm to converge to the optimal policy, the requirement is that each state-action pair $(s, a)$ be visited infinitely many times, such that the Q-values have converged."
]
Expand Down Expand Up @@ -2573,7 +2573,7 @@
"source": [
"## Replacing the Q-Table with a Function Approximator\n",
"\n",
"Tabular Q-Learning is affected by the curse of dimensionality: as the size of the state space increases due to, e.g., continuous sensor inputs, the size of the Q-table has to necessarily increase is well. In particular for building control, the curse of dimensionality is significant, considering the potentially large number of sensors measuring various quantities (temperature, humidity, energy consumption, etc.) continuously. This means that the agent has an exponentially increasing number of state-action pairs to explore before it can converge to an optimal solution. Function approximators, e.g., linear regression or artificial neural networks ([Haykin (2009)](https://www.pearson.com/en-us/subject-catalog/p/neural-networks-and-learning-machines/P200000003278/9780133002553)), have been proposed as solutions that allow generalization by directly mapping the state-action pairs, $(s, a)$, to their respective Q-value, $Q(s, a)$. Refer to [Reinforcement learning for intelligent environments](https://www.taylorfrancis.com/chapters/edit/10.4324/9781315142074-37/reinforcement-learning-intelligent-environments-zoltan-nagy-june-young-park-josé-ramón-vázquez-canteli) for more information on how to make use of function approximators to improve learning in reinforcement learning control (RLC).\n",
"Tabular Q-Learning is affected by the curse of dimensionality: as the size of the state space increases due to, e.g., continuous sensor inputs, the size of the Q-table has to necessarily increase as well. In particular for building control, the curse of dimensionality is significant, considering the potentially large number of sensors measuring various quantities (temperature, humidity, energy consumption, etc.) continuously. This means that the agent has an exponentially increasing number of state-action pairs to explore before it can converge to an optimal solution. Function approximators, e.g., linear regression or artificial neural networks ([Haykin (2009)](https://www.pearson.com/en-us/subject-catalog/p/neural-networks-and-learning-machines/P200000003278/9780133002553)), have been proposed as solutions that allow generalization by directly mapping the state-action pairs, $(s, a)$, to their respective Q-value, $Q(s, a)$. Refer to [Reinforcement learning for intelligent environments](https://www.taylorfrancis.com/chapters/edit/10.4324/9781315142074-37/reinforcement-learning-intelligent-environments-zoltan-nagy-june-young-park-josé-ramón-vázquez-canteli) for more information on how to make use of function approximators to improve learning in reinforcement learning control (RLC).\n",
"\n",
"In the next section, we will introduce the soft-actor critic (SAC) algorithm, which is a model-free Q-Learning algorithm, that uses a neural network to approximate the Q-values thus, reducing the cost of training compared to Tabular Q-Learning."
]
Expand Down Expand Up @@ -3277,7 +3277,7 @@
"source": [
"## Train\n",
"\n",
"Here we define one function that performs all the procedures we took to train the SAC agent from selecting buildings, simulation period and active observations to initializing and wrapping the environment, initializing the agent, training it a nd reporting it's results:"
"Here we define one function that performs all the procedures we took to train the SAC agent from selecting buildings, simulation period and active observations to initializing and wrapping the environment, initializing the agent, training it and reporting it's results:"
]
},
{
Expand Down