Here I will be posting resources and projects related to my journey of learning artificial intelligence.
Originally, I started my Computer Science journey with web development.
| # | Project Name | Description | Demo | GitHub |
|---|---|---|---|---|
| 1 | DChat - PDF Q&A Application | Web application where users can upload PDFs and ask questions about their content using RAG (Retrieval-Augmented Generation) | LinkedIn Demo | DChat Repository |
| 2 | Handwritten Digit Recognition | CNN-based model for recognizing handwritten digits using the MNIST dataset, integrated with Streamlit for user interaction | - | Handwritten Digit Recognition |
| 3 | Gmail Agent - AI Email Assistant | AI agent built with LangChain, Gemini model, and Streamlit to assist users with email composition | LinkedIn Demo | Gmail Agent Repository |
| 4 | GPT from Scratch - Pretraining | Implementation of GPT model from scratch with pretraining capabilities | - | GPT Pretraining |
| 5 | GPT-2 from Scratch - Finetuning | GPT-2 implementation with classification finetuning for downstream tasks | - | Classification Finetuning |
| 6 | Minimal Lunar Lander - DQN | Reinforcement learning project using Deep Q-Networks with Gymnasium and Stable Baselines3 | - | Minimal Lunar Lander DQN |
| 7 | Q-Learning Implementation | From-scratch implementation of Q-learning algorithm in Python for educational purposes | - | Q-Learning Implementation |
| 8 | Weather Agent | A weather forecasting agent built with LangChain and Google's Gemini AI model | - | Weather Agent |
| 9 | Multiplication tool | Integrating tool with LLM | - | Multiplication tool |
| 10 | Naive-Bayes-Spam-Detection | Implements a Naive Bayes classifier to detect spam messages in SMS text | - | Naive-Bayes-Spam-Detection |
| 11 | Q-Learning Agent playing1 FrozenLake-v1 | This is a trained model of a Q-Learning agent playing FrozenLake-v1 | - | Huggingface |
| 12 | Q-Learning Agent playing1 Taxi-v3 | This is a trained model of a Q-Learning agent playing Taxi-v3 | - | Huggingface |
| 13 | DQN Agent playing SpaceInvadersNoFrameskip-v4 | This is a trained model of a DQN agent playing SpaceInvadersNoFrameskip-v4 using the stable-baselines3 library and the RL Zoo. | - | Huggingface |
| 14 | A2C Agent playing PandaReachDense-v3 | Training is a robotic arm that needs to do controls (moving the arm and using the end-effector) | - | Huggingface |
| # | Title | Link |
|---|---|---|
| 1 | Attention Is All You Need | arXiv:1706.03762 |
| 2 | ReAct: Synergizing Reasoning and Acting in Language Models | arXiv:2210.03629 |
| 3 | Foundations of Large Language Models | arXiv:2501.09223 |
| 4 | DeepSeek-R1 | arxiv.org/abs/2501.12948 |
| 5 | Diffusion Models | https://arxiv.org/abs/2209.00796 |
| 6 | Multimodal Large Language Models | https://arxiv.org/pdf/2408.01319v1 |
| 7 | An Introduction to Vision-Language Modeling | https://arxiv.org/abs/2405.17247 |
| 8 | Denoising Diffusion Probabilistic Models (DDPM) | https://arxiv.org/pdf/2006.11239 |
| 9 | Training language models to follow instructions with human feedback | https://arxiv.org/abs/2203.02155 |
| 10 | MimicKit: A Reinforcement Learning Framework for Motion Imitation and Control | https://arxiv.org/abs/2510.13794 |
| # | Title | Author/Source | Link |
|---|---|---|---|
| 1 | How Large Language Models Work | Microsoft Data Science Blog | Medium Article |
| 2 | Introduction to Machine Learning | Ethem Alpaydin | Google Books |
| 3 | LLMs from Scratch | Sebastian Raschka | GitHub Repository |
| 4 | Understanding Reasoning LLMs | Sebastian Raschka | Substack Article |
| 5 | Foundations of Machine Learning | Tong Xiao and Jingbo Zhu | Foundations of Machine Learning |
| 6 | Book on neural networks and large language models in NLP | Tong Xiao and Jingbo Zhu | Book on neural networks and large language models in NLP |
| 7 | PaliGemma | merve, Andreas P. Steiner, Pedro Cuenca | PaliGemma |
| # | Title | Platform/Author | Link |
|---|---|---|---|
| 1 | Deep Learning with PyTorch - Full Course | YouTube | YouTube |
| 2 | Attention in Transformers: Concepts and Code in PyTorch | DeepLearning.AI | Course Link |
| 3 | How Transformer LLMs Work | DeepLearning.AI | Course Link |
| 4 | Build LLM Apps with LangChain.js | DeepLearning.AI | Course Link |
| 5 | Reinforcement Learning - Developing Intelligent Agents | deeplizard | Course Link |
| 6 | Let's build GPT: from scratch | Andrej Karpathy | Link |
| 7 | Neural Networks: Zero to Hero | Andrej Karpathy | Link |
| 8 | building a ChatGPT-like model | Standford | Link |
| 9 | Machine Learning Specialization – Andrew Ng (Autumn 2018) | Standford | Link |
| 10 | Build an LLM from Scratch | Sebastian Raschka | PlayList |
| 11 | Hands on Reinforcement Learning | Vizuara | Link |
| 12 | Neural Networks / Deep Learning | StatQuest with Josh Starmer | Playlist Neural Networks |
| 13 | Deep reinforcement learning (deep RL) | OpenAI | Spinning Up in Deep RL |
| 14 | Reinforcement Learning - Developing Intelligence | deeplizard | Course series |
| 15 | Deep Reinforcement Learning Course by Thomas Simonini | huggingface & Thomas Simonini | Course link |
| # | Topic | Link |
|---|---|---|
| 1 | Mixture of Experts (MoE) | Hugging Face Blog |
| 2 | Multi-agent | Multi-agent-LangGraph |
| 3 | ReAct agent from scratch with Gemini 2.5 and LangGraph | ReAct agent Gemini 2.5 |
| 4 | Humanoid Gymnasium | Humanoid Environment |
| 6 | Call tools LangGraph | Link |
I built an application called DChat, which is a web application where users can upload PDFs and ask questions related to them.
Here, I encountered LLM models like Gemini 2.0 Flash and Mistral AI, which I used in my application. From there, I got curious about how these models were able to give such great responses. After this, I wanted to know how LLMs work internally.
- Demo link: LinkedIn Demo
- GitHub link: DChat Repository
While building this RAG-based chatbot, I spent a large amount of time reading LangChain (Python and JavaScript) documentation and implementing concepts.
Here I was introduced to:
- Embedding data – converting words into numerical form
- Vector databases (Postgres + PGVector)
I found a great article that explained how LLMs work. This blog gave me an idea of the internal workings of LLMs.
I enrolled in the Zero to Hero playlist by Andrej Karpathy.
I started with:
- Let’s build GPT: from scratch, in code, spelled out
We built GPT following the paper “Attention is All You Need”.
Here I learned about:
- Tokenization, train/val split
- Self-attention
- Transformer, feedforward
- Multi-headed self-attention
It was a hands-on learning experience where I coded along in Colab using PyTorch. I was also introduced to the PyTorch documentation. Honestly, I didn’t understand everything, but I was able to get the intuition. I spent 3–4 days completing the 2-hour video.
I also did hands-on work with:
- The spelled-out intro to neural networks and backpropagation: building micrograd
- Learned manual backpropagation, derivation, and the importance of mathematics in neural networks.
Other useful resources:
In my opinion, these videos are great resources but require some prior knowledge to fully follow along.
In this project, I used the MNIST dataset consisting of handwritten digit images and created a Convolutional Neural Network (CNN) model.
-
Trained the CNN model on the dataset, and the model was able to predict handwritten digits.
-
Integrated the model with Streamlit, where users can upload an image and get the predicted digit.
-
GitHub link: Handwritten Digit Recognition
Some great resources available online:
- Stanford CS229 | Machine Learning | Building Large Language Models (LLMs)
- Introduction to Convolutional Neural Networks for Visual Recognition
- Image Classification
- Loss Functions and Optimization
I have completed 4 lectures so far and look forward to completing the rest soon.
I started learning PyTorch and spent about a week jumping between tutorials and the documentation.
Resources I followed:
I built a Gmail Agent project using LangChain, Gemini model, and Streamlit.
Users can simply provide information, and the agent assists accordingly.
- Demo link: LinkedIn Demo
- GitHub link: Gmail Agent Repository
Among all resources, this one is my favorite. I got the chance to build LLMs from scratch with step-by-step explanations and code.
What I learned:
- Tokenization, token IDs, special context tokens, byte pair encoding
- Data sampling with a sliding window
- Token embeddings, positional encoding
- Self-attention mechanism, causal attention mask, multi-head attention
- Transformer blocks with layer normalization, GELU activations, residual connections
- Implementing GPT model and text generation
- Loss functions (cross-entropy, perplexity), training/validation losses
- Saving/loading pretrained weights
- Finetuning LLMs for tasks like spam classification
- Supervised instruction finetuning on datasets
Resources:
This was hands-on and very clear, making it easier to understand concepts.
Some courses that really helped me understand concepts better:
- Attention in Transformers: Concepts and Code in PyTorch
- How Transformer LLMs Work
- Build LLM Apps with LangChain.js
I am reading the book “Introduction to Machine Learning” by Ethem Alpaydin.
Whenever I am not clear about a concept, I refer to this book to strengthen my understanding.
I watched this video to get introduced to Reinforcement Learning (RL):
I built this Reinforcement Learning project using Gymnasium and Stable Baselines3.
- GitHub link: Minimal Lunar Lander DQN
From today, I will document my daily learning progress.
Exploration vs. Exploitation
- Balancing exploration (trying new actions) and exploitation (choosing known best actions) is crucial in reinforcement learning.
- Q-Learning is a technique that helps agents learn optimal actions through experience.
Epsilon-Greedy Strategy
- The agent starts with a high exploration rate (E = 1), taking random actions to discover new possibilities.
- As learning progresses, E decreases, and the agent exploits its knowledge more by choosing actions with higher Q-values.
How Actions Are Chosen
- At the initial state, the agent selects actions randomly due to high exploration.
- Over time, the agent uses the epsilon-greedy strategy to balance exploration and exploitation.
- When exploiting, the agent picks the action with the highest Q-value for the current state from the Q-table.
Q-Learning Process
- After each action, the agent observes the next state and reward, then updates the Q-value in the Q-table for the previous state-action pair.
Resource:
Markov decision processes give us a way to formalize sequential decision making
A Markov Decision Process models the sequential decision-making of an agent interacting with an environment. At each step, the agent selects an action from the current state, transitions to a new state, and receives a reward. This sequence of states, actions, and rewards forms a trajectory.
The agent’s objective is to maximize the cumulative reward over time, not just the immediate reward from each action. This encourages the agent to consider long-term benefits when making decisions.
MDP Mathematical Representation
Resource:
Hands-on Q-Learning Implementation
- Implemented the Q-learning algorithm from scratch in Python to gain deeper intuitive understanding
- The practical coding experience clarified theoretical concepts and made the algorithm more concrete
- Translated the mathematical Q-table update formula into working code:
Key Learning Outcomes:
- Better understanding of how Q-values are calculated and updated iteratively
- Practical experience with epsilon-greedy action selection
- Implementation of the reward feedback mechanism in reinforcement learning
Project: Q-Learning Implementation
Value Functions Overview Value functions are functions of states, or of state-action pairs, that estimate how good it is for an agent to be in a given state, or how good it is for the agent to perform a given action in a given state.
State-Value Function It tells us how good any given state is for an agent following policy.
Action-Value Function (Q-Function) It tells us how good it is for the agent to take any given action from a given state while following policy.
Optimality in Reinforcement Learning It is the goal of reinforcement learning algorithms to find a policy that will yield a lot of rewards for the agent if the agent indeed follows that policy.
Value Iteration
- The Q-learning algorithm iteratively updates the Q-values for each state-action pair using the Bellman equation until the Q-function converges to the optimal Q-function, q*. This approach is called value iteration.
- Q-values will be iteratively updated using value iteration.
Resource:
Experience Replay & Replay Memory
This replay memory data set is what we'll randomly sample from to train the network. The act of gaining experience and sampling from the replay memory that stores these experience is called experience replay.
With experience replay, we store the agent's experiences at each time step in a data set called the replay memory.
Why to choose random samples to train?
If the network learned only from consecutive samples of experience as they occurred sequentially in the environment, the samples would be highly correlated and would therefore lead to inefficient learning. Taking random samples from replay memory breaks this correlation.
- Training a deep Q-network with replay memory
The input state data then forward propagates through the network, using the same forward propagation technique that we've discussed for any other general neural network. The model then outputs an estimated Q-value for each possible action from the given input state. The loss is then calculated. We do this by comparing the Q-value output from the network for the action in the experience tuple we sampled and the corresponding optimal Q-value, or target Q-value, for the same action.
Training the policy network
Gradient descent is then performed to update the weights in the network in attempts to minimize the loss. We'll want to keep repeating this process until we've sufficiently minimized the loss. We do the first pass to calculate the Q-value for the relevant action, and then we do a second pass in order to calculate the target Q-value for this same action.
Potential training issues with deep Q-networks
Given this, when our weights update, our outputted Q-values will update, but so will our target Q-values since the targets are calculated using the same weights. So, our Q-values will be updated with each iteration to move closer to the target Q-values, but the target Q-values will also be moving in the same direction. This makes the optimization appear to be chasing its own tail, which introduces instability.
Solution
-
Rather than doing a second pass to the policy network to calculate the target Q-values, we instead obtain the target Q-values from a completely separate network, appropriately called the target network.
-
The target network is a clone of the policy network. Its weights are frozen with the original policy network's weights, and we update the weights in the target network to the policy network's new weights every certain amount of time steps. This certain amount of time steps can be looked at as yet another hyperparameter that we'll have to test out to see what works best for us. So now, the first pass still occurs with the policy network. The second pass, however, for the following state occurs with the target network. With this target network, we're able to obtain the max Q-value for the next state, and again, plug this value into the Bellman equation in order to calculate the target Q-value for the first state.
Resource:
Implementing the Simplest Policy Gradient
Policy gradient methods are a class of reinforcement learning (RL) algorithms that directly optimize a parameterized policy, to maximize the expected cumulative reward.
1. Making the Policy Network
# make core of policy network
logits_net = mlp(sizes=[obs_dim]+hidden_sizes+[n_acts])
# make function to compute action distribution
def get_policy(obs):
logits = logits_net(obs)
return Categorical(logits=logits)
# make action selection function (outputs int actions, sampled from policy)
def get_action(obs):
return get_policy(obs).sample().item()Resources:
2. Making the Loss Function.
# make loss function whose gradient, for the right data, is policy gradient
def compute_loss(obs, act, weights):
logp = get_policy(obs).log_prob(act)
return -(logp * weights).mean()- Running One Epoch of Training.
# for training policy
def train_one_epoch():
# make some empty lists for logging.
batch_obs = [] # for observations
batch_acts = [] # for actions
batch_weights = [] # for R(tau) weighting in policy gradient
batch_rets = [] # for measuring episode returns
batch_lens = [] # for measuring episode lengths
# reset episode-specific variables
obs = env.reset() # first obs comes from starting distribution
done = False # signal from environment that episode is over
ep_rews = [] # list for rewards accrued throughout ep
# render first episode of each epoch
finished_rendering_this_epoch = False
# collect experience by acting in the environment with current policy
while True:
# rendering
if (not finished_rendering_this_epoch) and render:
env.render()
# save obs
batch_obs.append(obs.copy())
# act in the environment
act = get_action(torch.as_tensor(obs, dtype=torch.float32))
obs, rew, done, _ = env.step(act)
# save action, reward
batch_acts.append(act)
ep_rews.append(rew)
if done:
# if episode is over, record info about episode
ep_ret, ep_len = sum(ep_rews), len(ep_rews)
batch_rets.append(ep_ret)
batch_lens.append(ep_len)
# the weight for each logprob(a|s) is R(tau)
batch_weights += [ep_ret] * ep_len
# reset episode-specific variables
obs, done, ep_rews = env.reset(), False, []
# won't render again this epoch
finished_rendering_this_epoch = True
# end experience loop if we have enough of it
if len(batch_obs) > batch_size:
break
# take a single policy gradient update step
optimizer.zero_grad()
batch_loss = compute_loss(obs=torch.as_tensor(batch_obs, dtype=torch.float32),
act=torch.as_tensor(batch_acts, dtype=torch.int32),
weights=torch.as_tensor(batch_weights, dtype=torch.float32)
)
batch_loss.backward()
optimizer.step()
return batch_loss, batch_rets, batch_lensResources:
Understanding Reasoning LLMs
- By Sebastian Raschka
Content:
- Explain the meaning of "reasoning model"
- Discuss the advantages and disadvantages of reasoning models
- Outline the methodology behind DeepSeek R1
- Describe the four main approaches to building and improving reasoning models
- Share thoughts on the LLM landscape following the DeepSeek V3 and R1 releases
- Provide tips for developing reasoning models on a tight budget
What is reasoning?
In this article, I define "reasoning" as the process of answering questions that require complex, multi-step generation with intermediate steps. For example, factual question-answering like "What is the capital of France?" does not involve reasoning. In contrast, a question like "If a train is moving at 60 mph and travels for 3 hours, how far does it go?" requires some simple reasoning. For instance, it requires recognizing the relationship between distance, speed, and time before arriving at the answer.
When do we need a reasoning model? Reasoning models are designed to be good at complex tasks such as solving puzzles, advanced math problems, and challenging coding tasks.
Resources:
A brief look at the DeepSeek training pipeline
The exact workings of o1 and o3 remain unknown outside of OpenAI. However, they are rumored to leverage a combination of both inference and training techniques.
The 4 main ways to build and improve reasoning models
- Inference-time scaling
- Pure reinforcement learning (RL)
- Supervised finetuning and reinforcement learning (SFT + RL)
- Pure supervised finetuning (SFT) and distillation
Chain-of-thought
One straightforward approach to inference-time scaling is clever prompt engineering. A classic example is chain-of-thought (CoT) prompting, where phrases like "think step by step" are included in the input prompt. This encourages the model to generate intermediate reasoning steps rather than jumping directly to the final answer, which can often (but not always) lead to more accurate results on more complex problems.
For rewards, instead of using a reward model trained on human preferences, they employed two types of rewards: an accuracy reward and a format reward.
– The accuracy reward uses the LeetCode compiler to verify coding answers and a deterministic system to evaluate mathematical responses. – The format reward relies on an LLM judge to ensure responses follow the expected format, such as placing reasoning steps inside tags.
Surprisingly, this approach was enough for the LLM to develop basic reasoning skills. The researchers observed an "Aha!" moment, where the model began generating reasoning traces as part of its responses despite not being explicitly trained to do so, as shown in the figure below.
Resources:
Distillation LLM distillation is a technique that creates smaller, more efficient large language models (LLMs) by transferring knowledge from a large, high-performing "teacher" model to a smaller "student" model.
Why did they develop these distilled models? In my opinion, there are two key reasons:
-
Smaller models are more efficient. This means they are cheaper to run, but they also can run on lower-end hardware, which makes these especially interesting for many researchers and tinkerers like me.
-
A case study in pure SFT. These distilled models serve as an interesting benchmark, showing how far pure supervised fine-tuning (SFT) can take a model without reinforcement learning.
Mixture of Experts (MoE)? Mixture of Experts enable models to be pretrained with far less compute, which means you can dramatically scale up the model or dataset size with the same compute budget as a dense model. In particular, a MoE model should achieve the same quality as its dense counterpart much faster during pretraining.
What exactly is a MoE?
- Sparse MoE layers are used instead of dense feed-forward network (FFN) layers. MoE layers have a certain number of “experts” (e.g. 8), where each expert is a neural network. In practice, the experts are FFNs, but they can also be more complex networks or even a MoE itself, leading to hierarchical MoEs!
- A gate network or router, that determines which tokens are sent to which expert. For example, in the image below, the token “More” is sent to the second expert, and the token "Parameters” is sent to the first network. As we’ll explore later, we can send a token to more than one expert. How to route a token to an expert is one of the big decisions when working with MoEs - the router is composed of learned parameters and is pretrained at the same time as the rest of the network.
Resources:
Humanoid
- Exploring humanoid in gymnasium.
- Humanoid is a part of Mujoco environment.
What is MuJoCo?
MuJoCo is a free and open source physics engine that aims to facilitate research and development in robotics, biomechanics, graphics and animation, and other areas where fast and accurate simulation is needed. Source
The 3D bipedal robot is designed to simulate a human.
Action Space & Observation Space
Why not DQN?
- Stable Baselines3 DQN algorithm only supports discrete action spaces.
- Humanoid-v5 environment from Gymnasium has a continuous action space.
- DQN requires a finite set of discrete actions to compute Q-values for each.
- Therefore, DQN cannot be used for this environment.
I used Soft Actor Critic (SAC) and It worked.
Resources
Recap
What is Action Space?
Different environments allow different kinds of actions. The set of all valid actions in a given environment is often called the action space. Some environments, like Atari and Go, have discrete action spaces, where only a finite number of moves are available to the agent. Other environments, like where the agent controls a robot in a physical world, have continuous action spaces. Source
What is Policy?
A policy is a rule used by an agent to decide what actions to take. It can be deterministic, in which case it is usually denoted by . Source
What is on-policy?
On-policy methods are about learning from what you are currently doing. Imagine you're trying to teach a robot to navigate a maze. In on-policy learning, the robot learns based on the actions it is currently taking.
What is off-policy?
Off-policy methods, on the other hand, are like learning from someone else's experience. In this approach, the robot might watch another robot navigate the maze and learn from its actions.
Why Multi-agent Systems?
An agent is a system that uses an LLM to decide the control flow of an application. As you develop these systems, they might grow more complex over time, making them harder to manage and scale.
Some problems:
- agent has too many tools at its disposal and makes poor decisions about which tool to call next
- context grows too complex for a single agent to keep track of
- there is a need for multiple specialization areas in the system (e.g. planner, researcher, math expert, etc.) To tackle these, you might consider breaking your application into multiple smaller, independent agents and composing them into a multi-agent system. Source
Understanding ReAct?
Yao et al., 2022 introduced a framework named ReAct where LLMs are used to generate both reasoning traces and task-specific actions in an interleaved manner.
Generating reasoning traces allow the model to induce, track, and update action plans, and even handle exceptions. The action step allows to interface with and gather information from external sources such as knowledge bases or environments.
The ReAct framework can allow LLMs to interact with external tools to retrieve additional information that leads to more reliable and factual responses. Source
What is HandOffs
In multi-agent architectures, agents can be represented as graph nodes. Each agent node executes its step(s) and decides whether to finish execution or route to another agent, including potentially routing to itself (e.g., running in a loop). A common pattern in multi-agent interactions is handoffs, where one agent hands off control to another. Source
Two of the most popular multi-agent architectures are:
- supervisor — individual agents are coordinated by a central supervisor agent. The supervisor controls all communication flow and task delegation, making decisions about which agent to invoke based on the current context and task requirements.
- swarm — agents dynamically hand off control to one another based on their specializations. The system remembers which agent was last active, ensuring that on subsequent interactions, the conversation resumes with that agent.
I implemented both architectures in code from LangGraph
Following ReAct agent from scratch with Gemini 2.5 and LangGraph from Google AI for Developers
-
In this tutorial, I will create a simple agent whose goal is to use a tool to find the current weather for a specified location.
-
LangGraph is a framework for building stateful LLM applications, making it a good choice for constructing ReAct (Reasoning and Acting) Agents.
LangGraph models agents as graphs using three key components:
- State: Shared data structure (typically TypedDict or Pydantic BaseModel) representing the application's current snapshot.
- Nodes: Encodes logic of your agents. They receive the current State as input, perform some computation or side-effect, and return an updated State, such as LLM calls or tool calls.
- Edges: Define the next Node to execute based on the current State, allowing for conditional logic and fixed transitions.
Output:
What is Tools?
Many AI applications interact with users via natural language. However, some use cases require models to interface directly with external systems—such as APIs, databases, or file systems—using structured input. In these scenarios, tool calling enables models to generate requests that conform to a specified input schema.
Question:
- Multiply 23746278364 * 23648723678 ?
Actual Answer:
Grok response:
- Grok didn’t use any tool for calculation so, the response was incorrect because LLM can't handle complex calculation.
Gemini response:
- It use python as a tool/calculator so, that it can handle complex calculation.
I realized the importance of tools in LLM after watching “How I use LLMs” video from Andrej Karpathy.
I wanted to build similar project where tools are integrated with LLM to handle situation where LLM hallucinate.
Input: what's 8282 x 99191?
Output
I was able to make my custom tool and integrate with LLM model by following this documentation. Resource
How model is able to choose tools?
Tool calling is typically conditional. Based on the user input and available tools, the model may choose to issue a tool call request. This request is returned in an AIMessage object, which includes a tool_calls field that specifies the tool name and input arguments:
If the input is unrelated to any tool, the model returns only a natural language message:
Importantly, the model does not execute the tool—it only generates a request. A separate executor (such as a runtime or agent) is responsible for handling the tool call and returning the result.
Tool execution
While the model determines when to call a tool, execution of the tool call must be handled by a runtime component. LangGraph provides prebuilt components for this:
- ToolNode: A prebuilt node that executes tools.
- create_react_agent: Constructs a full agent that manages tool calling automatically.
Dynamically select tools
Revising Maths concept
Scalar
- A scalar is a number. It is a quantity that has a magnitude but has no direction
Vector
- A vector is an array of scalars, or simply a number list.
Matrix
Identity matrix
- It is a square matrix whose diagonal elements are all 1, and other elements are 0.
Row vector
- A vector with only one row is called a row vector.
Column vector
- A vector with only one column is called a column vector.
Designing a text classifier
Document and Label sample
Each line of the corpus is a tuple of a piece of text (we simply call it a document) and a label that indicates whether the text is about food or not. We call such tuples samples, or more precisely labeled samples.
Next, let us assume that we have a classifier that learns from those samples the way of labeling documents. The classifier is then used to label every new document as “Food” or “Not-Food”. For example, for the text
- Fruit is not my favorite but I can enjoy it. The classifier would categorize it as “Food”.
Modern classifier
Modern classifiers are not a system comprising a set of hand-crafted rules. They instead model the classification problem in a probabilistic manner, making it possible to learn the ability of classification from large-scale labeled data.
Problems in designing text classifier
- The first problem we confront in designing text classification models is how to represent a document.
The bag-of-words (BOW) model
- The bag-of-words model is a feature-based model of representing documents.
Feature in ML
One can define a feature not only as some concrete attribute, such as a name and a gender, but also as a quantity that is countable for machine learning systems, such as a real number
In bag-of-words, feature = occurrence times
- The bag-of-words model defines a vector space6. In this space, the similarity of two vectors is measured in some way like dot-product. It helps when one wants to establish the relationship between documents — two documents with more overlapping words are more similar.
Sources:
Linear Classifiers
Naive Bayes
Great explanation on Naive Bayes
Naive Bayes Spam Detection Project
A machine learning project that implements a Naive Bayes classifier to detect spam messages in SMS text data. This project uses scikit-learn's MultinomialNB algorithm combined with feature engineering to achieve effective spam classification.
Output
Project Github Link
What is an estimator in ml?
An algorithm or function that takes input data and provides an estimate of an unknown parameter or function, often a target variable, to make predictions or train a model.
Two major types of models in NLP pre-training:
- Sequence Encoding Models
- Sequence Generation Models
Single-layer perceptrons
Single-layer perceptrons (or perceptrons for short) may be the simplest neural networks that have been developed for practical uses. Often, it is thought of as a biologically-inspired program that transforms some input to some output. A perceptron comprises a number of neurons connecting with input and output variables. Below figure shows a perceptron where there is only one neuron. In this example, there are two real-valued variables x1 and x2 for input and a binary variable y for output.
Activation function
There are many different ways to perform activation. For example, we can use the Softmax function if we want a probability distribution-like output; we can use the Sigmoid function if we want a monotonic, continuous, easy-to-optimize output; we can use the ReLU function if we want a ramp-shaped output.
Sources
Neural Network
- In this multi-layer neural network, the output of every neuron of a layer is connected to all neurons of the following layer. So the network is fully connected.
- Depth of a neural network is measured in terms of the number of layers. It is called model depth sometimes.
- A common measure for the width of a layer is the number of neurons in the layer.
- Stacking layers results in a very common kind of neural network — feed-forward neural networks (FFNNs).
- These networks are called “feed-forward” because there are no cycles in connections between layers and all the data moves in one direction.
- Neural Network can fit squiggle to the data no matter how complex the data is.
- Neural Network consists of nodes and connection between the nodes.
- The number along each connection represent parameter values that were estimated when this Neural Network was fit to the data.
Activation function
From the above diagram, I understood that activation function are just the line of various shape to connection data. The curved or bent lines are called Activation functions.
The layers of Nodes between input and output nodes are called hidden layers.
How green squiggle (line / curve to connect data) is drawn?
Finally green squiggle made
Parameter that we multiply are called weights
Parameter that we add are called biases
Activation function
NN can fit a green squiggle to just about any dataset, no matter how complicated.
Sources:
If someone says he is this much weight then we can predict his height by refering to the line.
Chain rule in maths
Gradient Descent
How Gradient descent can fit a line to data by finding the optimal values for the Intercept and the Slope.
Some steps in gradient descent:
- Put 0.64 as a slope
- Use gradient descent to find optimal value for the Intercept
- Pick random value for Intercept.
Sum of the Squared Residuals & Loss Function (Residuals – something that remains or left over)
Sources:
Residual = Observed height – predicted height Adding weight from the diagram
Sum of square residual is one type of loss function
Gradient descent Algorithm
When we have millions of data points, it can take a long time.
So there is a thing called Stochastic Gradient Descent that uses a randomly selected subset of the data at every step rather than the full dataset. This reduces the time spent calculating the derivatives of Loss Function.
Source:
Backpropagation starts from last by assuming optimal parameter values.
We get a green squiggle by the help of 4 curves. 4 curves are made by setting the values of x and y coordinates. Maths is given below
We give (b3 = 0) as an initial value
We update b3 (bias) value to (b3 = 1), and got SSR value 7.8 (SSR = 7.8). Which is optimal than (b3 = 0, SSR = 20.4).
After updating b3 continuously, we can take optimal SSR value which is of course closer to ‘0’.
b3 = 0, SSR = 20.4 b3 = 1, SSR = 7.8 . .
Summation notation
Green is the sum of the blue and orange curves.
green squiggle = blue + orange + b3
Derivate of SSR by b3
Got derivates of both the parts
Source:
Understanding cosine similarity
Finding cosine similarity of Hello and Hello World We need to find the angle difference between these 2 words. Cosine similarity = Cos(Theta)
- Note:- Cosine similarity is only determined by angle not length.
Mathematical Formula for cosine similarity
Calculating cosine similarity
Source:
Vector difference between the man with hat and without the hat is almost closer to hat value.
With hat – Without hat = x
List
Hat = 0.165 Cap = 0.113
Adding noise DDPM
Without adding noise
DDPM equation can be express as Stochastic differential equations
Blue line (first expression) = motion of our point with vector field Grey line (second expression) = random motion
DDIM need Less number of compute to generate high quality image
Source:
Gemini Robotics-ER 1.5 is a vision-language model (VLM)
It's designed for advanced reasoning in the physical world, allowing robots to interpret complex visual data, perform spatial reasoning.
Getting started: Finding objects in a scene
It shows how to pass an image and a text prompt to the model using the generateContent method to get a list of identified objects with their corresponding 2D points.
Code:-
Input image
Final Output
Source:
Pointing to objects
Pointing and finding objects in images or video frames is a common use case for vision-and-language models (VLMs) in robotics. The following example asks the model to find specific objects within an image and return their coordinates in an image.
Code:
Output scissor, pen, envelope objects coordinates in an image.
Trajectories
-
Gemini Robotics-ER 1.5 can generate sequences of points that define a trajectory, useful for guiding robot movement.
-
This example requests a trajectory to move a red pen to an organizer, including the starting point and a series of intermediate points.
Orchestration
Making room for a laptop
This example shows how Gemini Robotics-ER can reason about a space. The prompt asks the model to identify which object needs to be moved to create space for another item.
Source:
CLIP model only go one direction (From image to vector embedding) We can get embedded vectors with image and text but can’t generate image and text with embedded vectors.
Understanding Relu activation function
Source:
Multiple Inputs and Outputs
Multiply with weight and sum with biases After that the value is pass to Relu and got y-axis as 1.6
When petal width = (0, ..., 1) and sepal width = 0,
When petal width = (0, ..., 1) and sepal width = 0.2,
When petal width = (0, ..., 1) and sepal width = (0, ...,1)
When multiply with (-0.1), all point drop to surface level
We do the same for orange dot and get the surface for orange.
We sum the orange and blue dot and get the final value which is green dot
We do for every single point and get the green surface
We get the final output for Setosa
When petal width is close to 1 (the widest), then we will get a high score for virginica.
Now with petal and sepal width, we can predict which can the type of flower
Source:
CNN
Dot product, add bias and add in feature map
Slide over 1 pixel, however other CNN might move over 2 or more pixels Fill up the feature map by moving 1 pixel.
Apply Relu activation It results in all negative number to ‘0’. F(x) = max(0, x)
New feature map with max pooling (find maximum value in each area)
New feature map is converted into 4 inputs and passed in neural network with 2 outputs
Multiply with weights and bias Got ‘0.34’ as y-coordinates from Relu activation function
Multiply with weight and bias Final output 0.99 which is 1
When picture is of letter ‘O’ CNN predicts as letter ‘O’ by giving it 1
Filter or Kernel (How filter is made)
Also got final output for ‘X’ Image
Source:
Explain Continuous control RL?
An agent's ability to produce actions that are not limited to a finite set of choices but instead can be any value within a range, such as motor torques or steering angles.
Unlike discrete actions (e.g., "left" or "right"), continuous actions require an agent to make fine-grained adjustments to achieve a goal in a dynamic environment.
Algorithms like Deep Deterministic Policy Gradient (DDPG), Proximal Policy Optimization (PPO), and Soft Actor-Critic (SAC) are used to learn these continuous control policies by optimizing continuous outputs based on rewards received from the environment.
Markov Decision Process (MDP)
Markov Property implies that our agent needs only the current state to decide what action to take and not the history of all the states and actions they took before.
Action Space types
Discrete space: the number of possible actions is finite.
- In Super Mario Bros, we have only 4 possible actions: left, right, up (jumping) and down (crouching). Continuous space: the number of possible actions is infinite.
- A Self Driving Car agent has an infinite number of possible actions since it can turn left 20°, 21,1°, 21,2°, honk, turn right 20°…
The Policy π: the agent’s brain
The Policy π is the brain of our Agent, it’s the function that tells us what action to take given the state we are in. So it defines the agent’s behavior at a given time.
There are two approaches to train our agent to find this optimal policy π:*
- Directly, by teaching the agent to learn which action to take, given the current state: Policy-Based Methods.
- Indirectly, teach the agent to learn which state is more valuable and then take the action that leads to the more valuable states: Value-Based Methods.
Trained model and pushed in huggingface
Source:
Today I learned to:
- Push my trained model in huggingface.
- Use other trained model from huggingface using repo id.
Source:
Action-value function
It determines the value of being at a particular state and taking a specific action at that state.
A small recap of Q-Learning
Q-Learning is the RL algorithm that:
- Trains Q-Function, an action-value function that encoded, in internal memory, by a Q-table that contains all the state-action pair values.
- Given a state and action, our Q-Function will search the Q-table for the corresponding value.
- When the training is done,we have an optimal Q-Function, so an optimal Q-Table.
- And if we have an optimal Q-function, we have an optimal policy, since we know for, each state, the best action to take.
I read the article on Meta Superintelligence’s surprising first paper and read overview of paper from MSI.
Train Q-Learning agent
We can have two sizes of environment:
- map_name="4x4": a 4x4 grid version
- map_name="8x8": a 8x8 grid version
The action space (the set of possible actions the agent can take) is discrete with 4 actions available 🎮:
- 0: GO LEFT
- 1: GO DOWN
- 2: GO RIGHT
- 3: GO UP
Reward function 💰:
- Reach goal: +1
- Reach hole: 0
- Reach frozen: 0
Define Greedy Policy
Remember we have two policies since Q-Learning is an off-policy algorithm. This means we're using a different policy for acting and updating the value function.
- Epsilon-greedy policy (acting policy)
- Greedy-policy (updating policy) The greedy policy will also be the final policy we'll have when the Q-learning agent completes training. The greedy policy is used to select an action using the Q-table.
Define the epsilon-greedy policy
Epsilon-greedy is the training policy that handles the exploration/exploitation trade-off. The idea with epsilon-greedy:
- With probability 1 - ɛ : we do exploitation (i.e. our agent selects the action with the highest state-action pair value).
- With probability ɛ: we do exploration (trying a random action). As the training continues, we progressively reduce the epsilon value since we will need less and less exploration and more exploitation.
Define the hyperparameters
The exploration related hyperparamters are some of the most important ones.
-
We need to make sure that our agent explores enough of the state space to learn a good value approximation. To do that, we need to have progressive decay of the epsilon.
-
If you decrease epsilon too fast (too high decay_rate), you take the risk that your agent will be stuck, since your agent didn't explore enough of the state space and hence can't solve the problem.
-
Agent Training code This implementation is based on the Hugging Face Deep Reinforcement Learning Course:
Training Taxi-v3 environment
Before training Q-table
After training Q-table
Learned to Load Model from Hub
This is the architecture of our Deep Q-Learning network:
Why do we stack four frames together?
Hands-on
Train Deep Q-Learning agent to play Atari Games
Adjusting hyper parameter to train our deep q-learning agent to play space invaders.
Training a model
Evaluate agent
I have trained and uploaded my Deep Q-Learning agent using RL-Baselines-3 Zoo.
We’ll first try to optimize the parameters of the DQN studied in the last unit manually. We’ll then learn how to automate the search using Optuna.
Which algorithm should I use?
The first distinction comes from your action space, i.e., do you have discrete (e.g. LEFT, RIGHT, …) or continuous actions (ex: go to a certain speed)? Whether you can parallelized your training or not.
Notes:
# Define and train a A2C model
# Verbosity level: 0 for no output, 1 for info messages, 2 for debug messages
# seed: Generating a random number
a2c_model = A2C("MlpPolicy", env_id, seed=0, verbose=0).learn(budget_pendulum)
# Evaluate the train A2C model
# n_envs where n_envs is number of environment copies running in parallel
mean_reward, std_reward = evaluate_policy(a2c_model, eval_envs, n_eval_episodes=100, deterministic=True)
print(f"A2C Mean episode reward: {mean_reward:.2f} +/- {std_reward:.2f}")What is on-policy?
On-policy methods are about learning from what you are currently doing. Imagine you're trying to teach a robot to navigate a maze. In on-policy learning, the robot learns based on the actions it is currently taking. It's like learning to cook by trying out different recipes yourself.
What is off-policy?
Off-policy methods, on the other hand, are like learning from someone else's experience. In this approach, the robot might watch another robot navigate the maze and learn from its actions. It doesn't have to follow the same policy as the robot it's observing. It involves learning the value of the optimal policy independently of the agent's actions. These methods enable the agent to learn from observations about the optimal policy, even when it's not following it. This is useful for learning from a fixed dataset or a teaching policy.
In some cases training longer is not a solution
Training for “4000 steps (20 episodes)”
Training Longer PPO
Tuned Hyperparameters
Reward Increased drastically
Result:
Not tuned: -1158 After tuned: -159
We will create a script that allows to search for the best hyperparameters automatically.
Hyperparameters
Policy Gradient with Pytorch
With policy-based methods, we want to optimize the policy directly without having an intermediate step of learning a value function. We’ll learn about policy-based methods and study a subset of these methods called policy gradient.
We’ll implement our first policy gradient algorithm called Monte Carlo Reinforce from scratch using PyTorch.
What are the policy-based methods?
The main goal of Reinforcement learning is to
find the optimal policyπ∗π∗ that will maximize the expected cumulative reward.
A stochastic policy in reinforcement learning (RL) It dictates the probability of taking each action in a given state, rather than a single, predetermined action.
Value-based methods,
- The idea is that an optimal value function leads to an optimal policyπ∗
- Our objective is to minimize the loss between the predicted and target value to approximate the true action-value function.
Policy-based methods
- The idea is to parameterize the policy.
- Our objective then is to maximize the performance of the parameterized policy using gradient ascent.
Source
Reinforce Algorithm
What is Unity ML-Agents?
Unity ML-Agents is a toolkit for the game engine Unity that allows us to create environments using Unity or use pre-made environments to train our agents.
With Unity ML-Agents, you have six essential components:
- The first is the Learning Environment, which contains the Unity scene (the environment) and the environment elements (game characters).
- The second is the Python Low-level API, which contains the low-level Python interface for interacting and manipulating the environment. It’s the API we use to launch the training.
- Then, we have the External Communicator that connects the Learning Environment (made with C#) with the low level Python API (Python).
- The Python trainers: the Reinforcement algorithms made with PyTorch (PPO, SAC…).
- The Gym wrapper: to encapsulate the RL environment in a gym wrapper.
- The PettingZoo wrapper: PettingZoo is the multi-agents version of the gym wrapper.
The observation space
- Regarding observations, we don’t use normal vision (frame), but we use raycasts.
- Think of raycasts as lasers that will detect if they pass through an object.
It is a hybrid architecture combining value-based and Policy-Based methods that helps to stabilize the training by reducing the variance using:
- An Actor that controls how our agent behaves (Policy-Based method)
- A Critic that measures how good the taken action is (Value-Based method)
To understand the Actor-Critic, imagine you’re playing a video game. You can play with a friend that will provide you with some feedback. You’re the Actor and your friend is the Critic.
This is the idea behind Actor-Critic. We learn two function approximations:
- A policy that controls how our agent acts:
- A value function to assist the policy update by measuring how good the action taken is:
Training is a robotic arm that needs to do controls (moving the arm and using the end-effector).
- Source code
- Train A2C agent using Stable-Baselines3 in a robotic environment to move arm to the correct position.
- Huggingface A2C Agent
Multi-Agent System
Since the beginning of this course, we learned to train agents in a single-agent system where our agent was alone in its environment: it was not cooperating or collaborating with other agents.
But, as humans, we live in a multi-agent world. Our intelligence comes from interaction with other agents. And so, our goal is to create agents that can interact with other humans and other agents.
- Multi-Agent example:- Football match
We have two solutions to design this multi-agent reinforcement learning system (MARL).
Decentralized approach
The benefit is that since no information is shared between agents, these vacuums can be designed and trained like we train single agents.
The idea here is that our training agent will consider other agents as part of the environment dynamics. Not as agents.
Centralized approach
The intuition behind PPO
The idea with Proximal Policy Optimization (PPO) is that we want to improve the training stability of the policy by limiting the change you make to the policy at each training epoch: we want to avoid having too large of a policy update.
With PPO, the idea is to constrain our policy update with a new objective function called the Clipped surrogate objective function that will constrain the policy change in a small range using a clip.
Implement PPO from scratch
Developing Research Project
- Start by exploring the literature to become aware of topics in the field.
- If you’re looking for inspiration, or just want to get a rough sense of what’s out there, check out Spinning Up’s key papers list.
- Find a paper that you enjoy on one of these subjects—something that inspires you—and read it thoroughly.
- Use the related work section and citations to find closely-related papers and do a deep dive in the literature.
- You’ll start to figure out where the unsolved problems are and where you can make an impact.
Approaches to idea-generation
Frame 1: Improving on an Existing Approach
- This is the incrementalist angle, where you try to get performance gains in an established problem setting by tweaking an existing algorithm.
Frame 2: Focusing on Unsolved Benchmarks
- Instead of thinking about how to improve an existing method, you aim to succeed on a task that no one has solved before.
Frame 3: Create a New Problem Setting
- Instead of thinking about existing methods or current grand challenges, think of an entirely different conceptual problem that hasn’t been studied yet.
- Avoid reinventing the wheel.
Model-based Vs Model-free RL
Model-based RL:
- The agent first builds an internal model of the environment, which predicts future states and rewards. It then uses this model to plan and simulate actions before acting in the real world.
Model-free RL:
- The agent skips the model-building step and learns directly from interacting with the environment. This can be simpler for environments where building an accurate model is difficult.
Source:
Most of these deep RL methods primarily focus on learning different tasks in isolation, making it challenging to utilize shared information between tasks to develop a generalized policy.
Multi-task reinforcement learning (MTRL) aims to master a set of RL tasks effectively. By leveraging the potential information sharing among different tasks, joint multi-task learning typically exhibits higher sample efficiency than training each task individually.
Challenge in MTRL
A significant challenge in MTRL lies in determining what information should be shared and how to share it effectively.
For instance, someone who can ride a bicycle can quickly learn to ride a motorcycle by referring to related skills, such as operating controls, maintaining balance, and executing turns. Likewise, a motorcyclist adept in these skills can also quickly learn to ride a bicycle. This ability allows humans to efficiently master multiple tasks by selectively referring to skills previously learned.
Cross-Task Policy Guidance (CTPG)
CTPG is a generalized MTRL framework that can be combined with various existing parameter sharing methods. Among these, we choose several classical approaches and integrate them with CTPG, achieving significant improvement in sample efficiency and final performance on both manipulation and locomotion MTRL benchmarks.
Source:
Supervised Fine-tuning(SFT) & Reinforcement Learning from Human Feedback(RLHF)
Aligning = Aligning means to make our model to response politely and helpfully.
Note: After pretraining model can predict next word but it is not align.
Supervise finetuning = Using users prompt and response to train a model
Supervise finetuning allow model to generate polite and helpful responses but only to the specific trained prompt. Note: Model cannot politely respond to new prompt.
How do we train the model to respond to new prompt?
Answer: Super huge Fine-Tuning dataset Note: Need huge amount of money to collect data and train model on huge dataset.
Alternative: RLHF
Model generates multiple responses and human selects the best response.
- Training a reward model with loss function.
- After reward model is trained, Supervised Fine-tune model is trained by rewarding to the responses generated by it.
Source:
Stanford AI Club: Jason Wei on 3 Key Ideas in AI in 2025
Note: The impact of AI will be seen earliest on tasks that are digital, easy for humans, and data abundant. Implications:
- Certain fields will be heavily accelerated by AI (eg: Software development)
- Other fields will remain untouch(e.g., hairdressing)
AlphaEvolve
Understanding AlphaEvolve
Source:
Statistical Machine Learning
As intuitive as it sounds from its name, statistical machine learning involves using statistical techniques to develop models that can learn from data and make predictions or decisions.
The principles of statistics are the very pillars that uphold the structure of machine learning.
-
Constructing machine learning models: Statistics provides the methodologies and principles for creating models in machine learning. For instance, the linear regression model leverages the statistical method of least squares to estimate the coefficients.
-
Interpreting results: Statistical concepts allow us to interpret the results generated by machine learning models. Measures such as p-value, confidence intervals, R-squared, and others provide us with a statistical perspective on the machine learning model’s performance.
-
Validating models: Statistical techniques are essential for validating and refining the machine learning models. For instance, techniques like hypothesis testing, cross-validation, and bootstrapping help us quantify the performance of models and avoid problems like overfitting.
-
Underpinning advanced techniques: Even some of the more complex machine learning algorithms, such as Neural Networks, have statistical principles at their core. The optimization techniques, like gradient descent, used to train these models are based on statistical theory.
Source:
Trying to understand MimicKit
This framework is intended to support research and applications in computer graphics and robotics by providing a unified training framework, along with standardized environment, agent, and data structures.
DeepMimic
We show that well-known reinforcement learning (RL) methods can be adapted to learn robust control policies capable of imitating a broad range of example motion clips, while also learning complex recoveries, adapting to changes in morphology, and accomplishing userspecified goals.
Our method handles keyframed motions, highly-dynamic actions such as motion-captured flips and spins, and retargeted motions. By combining a motion-imitation objective with a task objective, we can train characters that react intelligently in interactive settings, e.g., by walking in a desired direction or throwing a ball at a user-specified target.
















































































































































