Learning Artificial Intelligence

Here I will be posting resources and projects related to my journey of learning artificial intelligence.

Originally, I started my Computer Science journey with web development.

Projects

#	Project Name	Description	Demo	GitHub
1	DChat - PDF Q&A Application	Web application where users can upload PDFs and ask questions about their content using RAG (Retrieval-Augmented Generation)	LinkedIn Demo	DChat Repository
2	Handwritten Digit Recognition	CNN-based model for recognizing handwritten digits using the MNIST dataset, integrated with Streamlit for user interaction	-	Handwritten Digit Recognition
3	Gmail Agent - AI Email Assistant	AI agent built with LangChain, Gemini model, and Streamlit to assist users with email composition	LinkedIn Demo	Gmail Agent Repository
4	GPT from Scratch - Pretraining	Implementation of GPT model from scratch with pretraining capabilities	-	GPT Pretraining
5	GPT-2 from Scratch - Finetuning	GPT-2 implementation with classification finetuning for downstream tasks	-	Classification Finetuning
6	Minimal Lunar Lander - DQN	Reinforcement learning project using Deep Q-Networks with Gymnasium and Stable Baselines3	-	Minimal Lunar Lander DQN
7	Q-Learning Implementation	From-scratch implementation of Q-learning algorithm in Python for educational purposes	-	Q-Learning Implementation
8	Weather Agent	A weather forecasting agent built with LangChain and Google's Gemini AI model	-	Weather Agent
9	Multiplication tool	Integrating tool with LLM	-	Multiplication tool
10	Naive-Bayes-Spam-Detection	Implements a Naive Bayes classifier to detect spam messages in SMS text	-	Naive-Bayes-Spam-Detection
11	Q-Learning Agent playing1 FrozenLake-v1	This is a trained model of a Q-Learning agent playing FrozenLake-v1	-	Huggingface
12	Q-Learning Agent playing1 Taxi-v3	This is a trained model of a Q-Learning agent playing Taxi-v3	-	Huggingface
13	DQN Agent playing SpaceInvadersNoFrameskip-v4	This is a trained model of a DQN agent playing SpaceInvadersNoFrameskip-v4 using the stable-baselines3 library and the RL Zoo.	-	Huggingface
14	A2C Agent playing PandaReachDense-v3	Training is a robotic arm that needs to do controls (moving the arm and using the end-effector)	-	Huggingface

Resources & References

Research Papers

#	Title	Link
1	Attention Is All You Need	arXiv:1706.03762
2	ReAct: Synergizing Reasoning and Acting in Language Models	arXiv:2210.03629
3	Foundations of Large Language Models	arXiv:2501.09223
4	DeepSeek-R1	arxiv.org/abs/2501.12948
5	Diffusion Models	https://arxiv.org/abs/2209.00796
6	Multimodal Large Language Models	https://arxiv.org/pdf/2408.01319v1
7	An Introduction to Vision-Language Modeling	https://arxiv.org/abs/2405.17247
8	Denoising Diffusion Probabilistic Models (DDPM)	https://arxiv.org/pdf/2006.11239
9	Training language models to follow instructions with human feedback	https://arxiv.org/abs/2203.02155
10	MimicKit: A Reinforcement Learning Framework for Motion Imitation and Control	https://arxiv.org/abs/2510.13794

Books & Articles

#	Title	Author/Source	Link
1	How Large Language Models Work	Microsoft Data Science Blog	Medium Article
2	Introduction to Machine Learning	Ethem Alpaydin	Google Books
3	LLMs from Scratch	Sebastian Raschka	GitHub Repository
4	Understanding Reasoning LLMs	Sebastian Raschka	Substack Article
5	Foundations of Machine Learning	Tong Xiao and Jingbo Zhu	Foundations of Machine Learning
6	Book on neural networks and large language models in NLP	Tong Xiao and Jingbo Zhu	Book on neural networks and large language models in NLP
7	PaliGemma	merve, Andreas P. Steiner, Pedro Cuenca	PaliGemma

Courses & Tutorials

#	Title	Platform/Author	Link
1	Deep Learning with PyTorch - Full Course	YouTube	YouTube
2	Attention in Transformers: Concepts and Code in PyTorch	DeepLearning.AI	Course Link
3	How Transformer LLMs Work	DeepLearning.AI	Course Link
4	Build LLM Apps with LangChain.js	DeepLearning.AI	Course Link
5	Reinforcement Learning - Developing Intelligent Agents	deeplizard	Course Link
6	Let's build GPT: from scratch	Andrej Karpathy	Link
7	Neural Networks: Zero to Hero	Andrej Karpathy	Link
8	building a ChatGPT-like model	Standford	Link
9	Machine Learning Specialization – Andrew Ng (Autumn 2018)	Standford	Link
10	Build an LLM from Scratch	Sebastian Raschka	PlayList
11	Hands on Reinforcement Learning	Vizuara	Link
12	Neural Networks / Deep Learning	StatQuest with Josh Starmer	Playlist Neural Networks
13	Deep reinforcement learning (deep RL)	OpenAI	Spinning Up in Deep RL
14	Reinforcement Learning - Developing Intelligence	deeplizard	Course series
15	Deep Reinforcement Learning Course by Thomas Simonini	huggingface & Thomas Simonini	Course link

Other Resources

#	Topic	Link
1	Mixture of Experts (MoE)	Hugging Face Blog
2	Multi-agent	Multi-agent-LangGraph
3	ReAct agent from scratch with Gemini 2.5 and LangGraph	ReAct agent Gemini 2.5
4	Humanoid Gymnasium	Humanoid Environment
6	Call tools LangGraph	Link

1. DChat Application

I built an application called DChat, which is a web application where users can upload PDFs and ask questions related to them.
Here, I encountered LLM models like Gemini 2.0 Flash and Mistral AI, which I used in my application. From there, I got curious about how these models were able to give such great responses. After this, I wanted to know how LLMs work internally.

Demo link: LinkedIn Demo
GitHub link: DChat Repository

While building this RAG-based chatbot, I spent a large amount of time reading LangChain (Python and JavaScript) documentation and implementing concepts.

Here I was introduced to:

Embedding data – converting words into numerical form
Vector databases (Postgres + PGVector)

2. Article: How LLMs Work

I found a great article that explained how LLMs work. This blog gave me an idea of the internal workings of LLMs.

Read here

3. Andrej Karpathy – Neural Networks: Zero to Hero

I enrolled in the Zero to Hero playlist by Andrej Karpathy.

I started with:

Let’s build GPT: from scratch, in code, spelled out
- YouTube Link

We built GPT following the paper “Attention is All You Need”.

Here I learned about:

Tokenization, train/val split
Self-attention
Transformer, feedforward
Multi-headed self-attention

It was a hands-on learning experience where I coded along in Colab using PyTorch. I was also introduced to the PyTorch documentation. Honestly, I didn’t understand everything, but I was able to get the intuition. I spent 3–4 days completing the 2-hour video.

I also did hands-on work with:

The spelled-out intro to neural networks and backpropagation: building micrograd
- Learned manual backpropagation, derivation, and the importance of mathematics in neural networks.

Other useful resources:

In my opinion, these videos are great resources but require some prior knowledge to fully follow along.

4. Project: Handwritten Digit Recognition

In this project, I used the MNIST dataset consisting of handwritten digit images and created a Convolutional Neural Network (CNN) model.

Trained the CNN model on the dataset, and the model was able to predict handwritten digits.
Integrated the model with Streamlit, where users can upload an image and get the predicted digit.
GitHub link: Handwritten Digit Recognition

5. Stanford YouTube Videos

Some great resources available online:

6. Machine Learning Specialization – Andrew Ng (Autumn 2018)

I have completed 4 lectures so far and look forward to completing the rest soon.

7. PyTorch

I started learning PyTorch and spent about a week jumping between tutorials and the documentation.

Resources I followed:

8. Project: AI Agent (Gmail Agent)

I built a Gmail Agent project using LangChain, Gemini model, and Streamlit.
Users can simply provide information, and the agent assists accordingly.

Demo link: LinkedIn Demo
GitHub link: Gmail Agent Repository

9. Sebastian Raschka – LLMs from Scratch

Among all resources, this one is my favorite. I got the chance to build LLMs from scratch with step-by-step explanations and code.

What I learned:

Tokenization, token IDs, special context tokens, byte pair encoding
Data sampling with a sliding window
Token embeddings, positional encoding
Self-attention mechanism, causal attention mask, multi-head attention
Transformer blocks with layer normalization, GELU activations, residual connections
Implementing GPT model and text generation
Loss functions (cross-entropy, perplexity), training/validation losses
Saving/loading pretrained weights
Finetuning LLMs for tasks like spam classification
Supervised instruction finetuning on datasets

Resources:

This was hands-on and very clear, making it easier to understand concepts.

10. Courses

Some courses that really helped me understand concepts better:

11. Book

I am reading the book “Introduction to Machine Learning” by Ethem Alpaydin.
Whenever I am not clear about a concept, I refer to this book to strengthen my understanding.

Book Link

12. Hands-On Reinforcement Learning

I watched this video to get introduced to Reinforcement Learning (RL):

Introduction to RL

13. Project: Minimal Lunar Lander – DQN

I built this Reinforcement Learning project using Gymnasium and Stable Baselines3.

GitHub link: Minimal Lunar Lander DQN

From today, I will document my daily learning progress.

Day 1: Q-Learning

Exploration vs. Exploitation

Balancing exploration (trying new actions) and exploitation (choosing known best actions) is crucial in reinforcement learning.
Q-Learning is a technique that helps agents learn optimal actions through experience.

Epsilon-Greedy Strategy

The agent starts with a high exploration rate (E = 1), taking random actions to discover new possibilities.
As learning progresses, E decreases, and the agent exploits its knowledge more by choosing actions with higher Q-values.

How Actions Are Chosen

At the initial state, the agent selects actions randomly due to high exploration.
Over time, the agent uses the epsilon-greedy strategy to balance exploration and exploitation.
When exploiting, the agent picks the action with the highest Q-value for the current state from the Q-table.

Q-Learning Process

After each action, the agent observes the next state and reward, then updates the Q-value in the Q-table for the previous state-action pair.

Resource:

Q-Learning Video

Day 2: Markov Decision Processes (MDPs)

Markov decision processes give us a way to formalize sequential decision making

A Markov Decision Process models the sequential decision-making of an agent interacting with an environment. At each step, the agent selects an action from the current state, transitions to a new state, and receives a reward. This sequence of states, actions, and rewards forms a trajectory.

The agent’s objective is to maximize the cumulative reward over time, not just the immediate reward from each action. This encourages the agent to consider long-term benefits when making decisions.

MDP Mathematical Representation

Resource:

MDP

Day 3: Implementing Q-Learning in Python

Hands-on Q-Learning Implementation

Implemented the Q-learning algorithm from scratch in Python to gain deeper intuitive understanding
The practical coding experience clarified theoretical concepts and made the algorithm more concrete
Translated the mathematical Q-table update formula into working code:

Key Learning Outcomes:

Better understanding of how Q-values are calculated and updated iteratively
Practical experience with epsilon-greedy action selection
Implementation of the reward feedback mechanism in reinforcement learning

Project: Q-Learning Implementation

Day 4: Value Functions and Optimality in Reinforcement Learning

Value Functions Overview Value functions are functions of states, or of state-action pairs, that estimate how good it is for an agent to be in a given state, or how good it is for the agent to perform a given action in a given state.

State-Value Function It tells us how good any given state is for an agent following policy.

Action-Value Function (Q-Function) It tells us how good it is for the agent to take any given action from a given state while following policy.

Optimality in Reinforcement Learning It is the goal of reinforcement learning algorithms to find a policy that will yield a lot of rewards for the agent if the agent indeed follows that policy.

Value Iteration

The Q-learning algorithm iteratively updates the Q-values for each state-action pair using the Bellman equation until the Q-function converges to the optimal Q-function, q*. This approach is called value iteration.
Q-values will be iteratively updated using value iteration.

Resource:

Value Functions and Optimality

Day 5

Experience Replay & Replay Memory

This replay memory data set is what we'll randomly sample from to train the network. The act of gaining experience and sampling from the replay memory that stores these experience is called experience replay.

With experience replay, we store the agent's experiences at each time step in a data set called the replay memory.

Why to choose random samples to train?

If the network learned only from consecutive samples of experience as they occurred sequentially in the environment, the samples would be highly correlated and would therefore lead to inefficient learning. Taking random samples from replay memory breaks this correlation.

Training a deep Q-network with replay memory

The input state data then forward propagates through the network, using the same forward propagation technique that we've discussed for any other general neural network. The model then outputs an estimated Q-value for each possible action from the given input state. The loss is then calculated. We do this by comparing the Q-value output from the network for the action in the experience tuple we sampled and the corresponding optimal Q-value, or target Q-value, for the same action.

Training the policy network

Gradient descent is then performed to update the weights in the network in attempts to minimize the loss. We'll want to keep repeating this process until we've sufficiently minimized the loss. We do the first pass to calculate the Q-value for the relevant action, and then we do a second pass in order to calculate the target Q-value for this same action.

Potential training issues with deep Q-networks

Given this, when our weights update, our outputted Q-values will update, but so will our target Q-values since the targets are calculated using the same weights. So, our Q-values will be updated with each iteration to move closer to the target Q-values, but the target Q-values will also be moving in the same direction. This makes the optimization appear to be chasing its own tail, which introduces instability.

Solution

Rather than doing a second pass to the policy network to calculate the target Q-values, we instead obtain the target Q-values from a completely separate network, appropriately called the target network.
The target network is a clone of the policy network. Its weights are frozen with the original policy network's weights, and we update the weights in the target network to the policy network's new weights every certain amount of time steps. This certain amount of time steps can be looked at as yet another hyperparameter that we'll have to test out to see what works best for us. So now, the first pass still occurs with the policy network. The second pass, however, for the following state occurs with the target network. With this target network, we're able to obtain the max Q-value for the next state, and again, plug this value into the Bellman equation in order to calculate the target Q-value for the first state.

Resource:

Day 6

Implementing the Simplest Policy Gradient

Policy gradient methods are a class of reinforcement learning (RL) algorithms that directly optimize a parameterized policy, to maximize the expected cumulative reward.

1. Making the Policy Network

# make core of policy network
logits_net = mlp(sizes=[obs_dim]+hidden_sizes+[n_acts])
# make function to compute action distribution
def get_policy(obs):
    logits = logits_net(obs)
    return Categorical(logits=logits)
# make action selection function (outputs int actions, sampled from policy)
def get_action(obs):
    return get_policy(obs).sample().item()

Resources:

Day 7

2. Making the Loss Function.

# make loss function whose gradient, for the right data, is policy gradient
def compute_loss(obs, act, weights):
    logp = get_policy(obs).log_prob(act)
    return -(logp * weights).mean()

Running One Epoch of Training.

# for training policy
def train_one_epoch():
    # make some empty lists for logging.
    batch_obs = []          # for observations
    batch_acts = []         # for actions
    batch_weights = []      # for R(tau) weighting in policy gradient
    batch_rets = []         # for measuring episode returns
    batch_lens = []         # for measuring episode lengths

    # reset episode-specific variables
    obs = env.reset()       # first obs comes from starting distribution
    done = False            # signal from environment that episode is over
    ep_rews = []            # list for rewards accrued throughout ep

    # render first episode of each epoch
    finished_rendering_this_epoch = False

    # collect experience by acting in the environment with current policy
    while True:

        # rendering
        if (not finished_rendering_this_epoch) and render:
            env.render()

        # save obs
        batch_obs.append(obs.copy())

        # act in the environment
        act = get_action(torch.as_tensor(obs, dtype=torch.float32))
        obs, rew, done, _ = env.step(act)

        # save action, reward
        batch_acts.append(act)
        ep_rews.append(rew)

        if done:
            # if episode is over, record info about episode
            ep_ret, ep_len = sum(ep_rews), len(ep_rews)
            batch_rets.append(ep_ret)
            batch_lens.append(ep_len)

            # the weight for each logprob(a|s) is R(tau)
            batch_weights += [ep_ret] * ep_len

            # reset episode-specific variables
            obs, done, ep_rews = env.reset(), False, []

            # won't render again this epoch
            finished_rendering_this_epoch = True

            # end experience loop if we have enough of it
            if len(batch_obs) > batch_size:
                break

    # take a single policy gradient update step
    optimizer.zero_grad()
    batch_loss = compute_loss(obs=torch.as_tensor(batch_obs, dtype=torch.float32),
                              act=torch.as_tensor(batch_acts, dtype=torch.int32),
                              weights=torch.as_tensor(batch_weights, dtype=torch.float32)
                              )
    batch_loss.backward()
    optimizer.step()
    return batch_loss, batch_rets, batch_lens

Resources:

Day 8

Understanding Reasoning LLMs

By Sebastian Raschka

Content:

Explain the meaning of "reasoning model"
Discuss the advantages and disadvantages of reasoning models
Outline the methodology behind DeepSeek R1
Describe the four main approaches to building and improving reasoning models
Share thoughts on the LLM landscape following the DeepSeek V3 and R1 releases
Provide tips for developing reasoning models on a tight budget

What is reasoning?

In this article, I define "reasoning" as the process of answering questions that require complex, multi-step generation with intermediate steps. For example, factual question-answering like "What is the capital of France?" does not involve reasoning. In contrast, a question like "If a train is moving at 60 mph and travels for 3 hours, how far does it go?" requires some simple reasoning. For instance, it requires recognizing the relationship between distance, speed, and time before arriving at the answer.

When do we need a reasoning model? Reasoning models are designed to be good at complex tasks such as solving puzzles, advanced math problems, and challenging coding tasks.

Resources:

Understanding Reasoning LLMs

Day 9

A brief look at the DeepSeek training pipeline

The exact workings of o1 and o3 remain unknown outside of OpenAI. However, they are rumored to leverage a combination of both inference and training techniques.

The 4 main ways to build and improve reasoning models

Inference-time scaling
Pure reinforcement learning (RL)
Supervised finetuning and reinforcement learning (SFT + RL)
Pure supervised finetuning (SFT) and distillation

Chain-of-thought

One straightforward approach to inference-time scaling is clever prompt engineering. A classic example is chain-of-thought (CoT) prompting, where phrases like "think step by step" are included in the input prompt. This encourages the model to generate intermediate reasoning steps rather than jumping directly to the final answer, which can often (but not always) lead to more accurate results on more complex problems.

For rewards, instead of using a reward model trained on human preferences, they employed two types of rewards: an accuracy reward and a format reward.

– The accuracy reward uses the LeetCode compiler to verify coding answers and a deterministic system to evaluate mathematical responses. – The format reward relies on an LLM judge to ensure responses follow the expected format, such as placing reasoning steps inside tags.

Surprisingly, this approach was enough for the LLM to develop basic reasoning skills. The researchers observed an "Aha!" moment, where the model began generating reasoning traces as part of its responses despite not being explicitly trained to do so, as shown in the figure below.

Resources:

Understanding Reasoning LLMs

Day 10

Distillation LLM distillation is a technique that creates smaller, more efficient large language models (LLMs) by transferring knowledge from a large, high-performing "teacher" model to a smaller "student" model.

Why did they develop these distilled models? In my opinion, there are two key reasons:

Smaller models are more efficient. This means they are cheaper to run, but they also can run on lower-end hardware, which makes these especially interesting for many researchers and tinkerers like me.
A case study in pure SFT. These distilled models serve as an interesting benchmark, showing how far pure supervised fine-tuning (SFT) can take a model without reinforcement learning.

Mixture of Experts (MoE)? Mixture of Experts enable models to be pretrained with far less compute, which means you can dramatically scale up the model or dataset size with the same compute budget as a dense model. In particular, a MoE model should achieve the same quality as its dense counterpart much faster during pretraining.

What exactly is a MoE?

Sparse MoE layers are used instead of dense feed-forward network (FFN) layers. MoE layers have a certain number of “experts” (e.g. 8), where each expert is a neural network. In practice, the experts are FFNs, but they can also be more complex networks or even a MoE itself, leading to hierarchical MoEs!
A gate network or router, that determines which tokens are sent to which expert. For example, in the image below, the token “More” is sent to the second expert, and the token "Parameters” is sent to the first network. As we’ll explore later, we can send a token to more than one expert. How to route a token to an expert is one of the big decisions when working with MoEs - the router is composed of learned parameters and is pretrained at the same time as the rest of the network.

Resources:

Understanding Reasoning LLMs

Day 11

Humanoid

Exploring humanoid in gymnasium.
Humanoid is a part of Mujoco environment.

What is MuJoCo?

MuJoCo is a free and open source physics engine that aims to facilitate research and development in robotics, biomechanics, graphics and animation, and other areas where fast and accurate simulation is needed. Source

The 3D bipedal robot is designed to simulate a human.

Action Space & Observation Space

Why not DQN?

Stable Baselines3 DQN algorithm only supports discrete action spaces.
Humanoid-v5 environment from Gymnasium has a continuous action space.
DQN requires a finite set of discrete actions to compute Q-values for each.
Therefore, DQN cannot be used for this environment.

I used Soft Actor Critic (SAC) and It worked.

Resources

Day 12

Recap

What is Action Space?

Different environments allow different kinds of actions. The set of all valid actions in a given environment is often called the action space. Some environments, like Atari and Go, have discrete action spaces, where only a finite number of moves are available to the agent. Other environments, like where the agent controls a robot in a physical world, have continuous action spaces. Source

What is Policy?

A policy is a rule used by an agent to decide what actions to take. It can be deterministic, in which case it is usually denoted by . Source

What is on-policy?

On-policy methods are about learning from what you are currently doing. Imagine you're trying to teach a robot to navigate a maze. In on-policy learning, the robot learns based on the actions it is currently taking.

What is off-policy?

Off-policy methods, on the other hand, are like learning from someone else's experience. In this approach, the robot might watch another robot navigate the maze and learn from its actions.

Day 13

Why Multi-agent Systems?

An agent is a system that uses an LLM to decide the control flow of an application. As you develop these systems, they might grow more complex over time, making them harder to manage and scale.

Some problems:

agent has too many tools at its disposal and makes poor decisions about which tool to call next
context grows too complex for a single agent to keep track of
there is a need for multiple specialization areas in the system (e.g. planner, researcher, math expert, etc.) To tackle these, you might consider breaking your application into multiple smaller, independent agents and composing them into a multi-agent system. Source

Understanding ReAct?

Yao et al., 2022 introduced a framework named ReAct where LLMs are used to generate both reasoning traces and task-specific actions in an interleaved manner.

Generating reasoning traces allow the model to induce, track, and update action plans, and even handle exceptions. The action step allows to interface with and gather information from external sources such as knowledge bases or environments.

The ReAct framework can allow LLMs to interact with external tools to retrieve additional information that leads to more reliable and factual responses. Source

Img Source

Day 14

What is HandOffs

In multi-agent architectures, agents can be represented as graph nodes. Each agent node executes its step(s) and decides whether to finish execution or route to another agent, including potentially routing to itself (e.g., running in a loop). A common pattern in multi-agent interactions is handoffs, where one agent hands off control to another. Source

Two of the most popular multi-agent architectures are:

supervisor — individual agents are coordinated by a central supervisor agent. The supervisor controls all communication flow and task delegation, making decisions about which agent to invoke based on the current context and task requirements.
swarm — agents dynamically hand off control to one another based on their specializations. The system remembers which agent was last active, ensuring that on subsequent interactions, the conversation resumes with that agent.

I implemented both architectures in code from LangGraph

Day 15

Following ReAct agent from scratch with Gemini 2.5 and LangGraph from Google AI for Developers

In this tutorial, I will create a simple agent whose goal is to use a tool to find the current weather for a specified location.
LangGraph is a framework for building stateful LLM applications, making it a good choice for constructing ReAct (Reasoning and Acting) Agents.

LangGraph models agents as graphs using three key components:

State: Shared data structure (typically TypedDict or Pydantic BaseModel) representing the application's current snapshot.
Nodes: Encodes logic of your agents. They receive the current State as input, perform some computation or side-effect, and return an updated State, such as LLM calls or tool calls.
Edges: Define the next Node to execute based on the current State, allowing for conditional logic and fixed transitions.

Tutorial with code

Output:

Resource

Day 16

What is Tools?

Many AI applications interact with users via natural language. However, some use cases require models to interface directly with external systems—such as APIs, databases, or file systems—using structured input. In these scenarios, tool calling enables models to generate requests that conform to a specified input schema.

Question:

Multiply 23746278364 * 23648723678 ?

Actual Answer:

Grok response:

Grok didn’t use any tool for calculation so, the response was incorrect because LLM can't handle complex calculation.

Gemini response:

It use python as a tool/calculator so, that it can handle complex calculation.

I realized the importance of tools in LLM after watching “How I use LLMs” video from Andrej Karpathy.

How I use LLMs

I wanted to build similar project where tools are integrated with LLM to handle situation where LLM hallucinate.

Input: what's 8282 x 99191?

Output

I was able to make my custom tool and integrate with LLM model by following this documentation. Resource

Day 17

How model is able to choose tools?

Tool calling is typically conditional. Based on the user input and available tools, the model may choose to issue a tool call request. This request is returned in an AIMessage object, which includes a tool_calls field that specifies the tool name and input arguments:

If the input is unrelated to any tool, the model returns only a natural language message:

Importantly, the model does not execute the tool—it only generates a request. A separate executor (such as a runtime or agent) is responsible for handling the tool call and returning the result.

Tool execution

While the model determines when to call a tool, execution of the tool call must be handled by a runtime component. LangGraph provides prebuilt components for this:

ToolNode: A prebuilt node that executes tools.
create_react_agent: Constructs a full agent that manages tool calling automatically.

Dynamically select tools

Dynamically select tools

Day 18

Revising Maths concept

Scalar

A scalar is a number. It is a quantity that has a magnitude but has no direction

Vector

A vector is an array of scalars, or simply a number list.

Matrix

A matrix is a rectangular array of scalars.

Identity matrix

It is a square matrix whose diagonal elements are all 1, and other elements are 0.

Row vector

A vector with only one row is called a row vector.

Column vector

A vector with only one column is called a column vector.

Designing a text classifier

Document and Label sample

Each line of the corpus is a tuple of a piece of text (we simply call it a document) and a label that indicates whether the text is about food or not. We call such tuples samples, or more precisely labeled samples.

Next, let us assume that we have a classifier that learns from those samples the way of labeling documents. The classifier is then used to label every new document as “Food” or “Not-Food”. For example, for the text

Fruit is not my favorite but I can enjoy it. The classifier would categorize it as “Food”.

Modern classifier

Modern classifiers are not a system comprising a set of hand-crafted rules. They instead model the classification problem in a probabilistic manner, making it possible to learn the ability of classification from large-scale labeled data.

Problems in designing text classifier

The first problem we confront in designing text classification models is how to represent a document.

The bag-of-words (BOW) model

The bag-of-words model is a feature-based model of representing documents.

Feature in ML

One can define a feature not only as some concrete attribute, such as a name and a gender, but also as a quantity that is countable for machine learning systems, such as a real number

In bag-of-words, feature = occurrence times

The bag-of-words model defines a vector space6. In this space, the similarity of two vectors is measured in some way like dot-product. It helps when one wants to establish the relationship between documents — two documents with more overlapping words are more similar.

Sources:

Day 19

Linear Classifiers

Naive Bayes

Great explanation on Naive Bayes

Youtube video link

Day 20

Naive Bayes Spam Detection Project

A machine learning project that implements a Naive Bayes classifier to detect spam messages in SMS text data. This project uses scikit-learn's MultinomialNB algorithm combined with feature engineering to achieve effective spam classification.

Output

Project Github Link

Project Link

Day 21

What is an estimator in ml?

An algorithm or function that takes input data and provides an estimate of an unknown parameter or function, often a target variable, to make predictions or train a model.

Two major types of models in NLP pre-training:

Sequence Encoding Models
Sequence Generation Models

Single-layer perceptrons

Single-layer perceptrons (or perceptrons for short) may be the simplest neural networks that have been developed for practical uses. Often, it is thought of as a biologically-inspired program that transforms some input to some output. A perceptron comprises a number of neurons connecting with input and output variables. Below figure shows a perceptron where there is only one neuron. In this example, there are two real-valued variables x1 and x2 for input and a binary variable y for output.

Activation function

There are many different ways to perform activation. For example, we can use the Softmax function if we want a probability distribution-like output; we can use the Sigmoid function if we want a monotonic, continuous, easy-to-optimize output; we can use the ReLU function if we want a ramp-shaped output.

Sources

Neural Networks and Large Language Models

Day 22 (Understanding Neural Network)

Neural Network

In this multi-layer neural network, the output of every neuron of a layer is connected to all neurons of the following layer. So the network is fully connected.
Depth of a neural network is measured in terms of the number of layers. It is called model depth sometimes.
A common measure for the width of a layer is the number of neurons in the layer.
Stacking layers results in a very common kind of neural network — feed-forward neural networks (FFNNs).
These networks are called “feed-forward” because there are no cycles in connections between layers and all the data moves in one direction.

Neural Network can fit squiggle to the data no matter how complex the data is.

Neural Network consists of nodes and connection between the nodes.
The number along each connection represent parameter values that were estimated when this Neural Network was fit to the data.

Activation function

From the above diagram, I understood that activation function are just the line of various shape to connection data. The curved or bent lines are called Activation functions.

The layers of Nodes between input and output nodes are called hidden layers.

How green squiggle (line / curve to connect data) is drawn?

Finally green squiggle made

Parameter that we multiply are called weights

Parameter that we add are called biases

Activation function

NN can fit a green squiggle to just about any dataset, no matter how complicated.

Sources:

Day 23 (The Chain Rule)

If someone says he is this much weight then we can predict his height by refering to the line.

Chain rule in maths

Gradient Descent

How Gradient descent can fit a line to data by finding the optimal values for the Intercept and the Slope.

Some steps in gradient descent:

Put 0.64 as a slope
Use gradient descent to find optimal value for the Intercept
Pick random value for Intercept.

Sum of the Squared Residuals & Loss Function (Residuals – something that remains or left over)

Sources:

Day 24 (Gradient Descent, Step-by-Step)

Residual = Observed height – predicted height Adding weight from the diagram

Sum of square residual is one type of loss function

Gradient descent Algorithm

When we have millions of data points, it can take a long time.

So there is a thing called Stochastic Gradient Descent that uses a randomly selected subset of the data at every step rather than the full dataset. This reduces the time spent calculating the derivatives of Loss Function.

Source:

YouTube Gradient Descent

Day 25 (Backpropagation Main Ideas)

Backpropagation starts from last by assuming optimal parameter values.

We get a green squiggle by the help of 4 curves. 4 curves are made by setting the values of x and y coordinates. Maths is given below

We give (b3 = 0) as an initial value

We update b3 (bias) value to (b3 = 1), and got SSR value 7.8 (SSR = 7.8). Which is optimal than (b3 = 0, SSR = 20.4).

After updating b3 continuously, we can take optimal SSR value which is of course closer to ‘0’.

b3 = 0, SSR = 20.4 b3 = 1, SSR = 7.8 . .

Summation notation

Green is the sum of the blue and orange curves.

green squiggle = blue + orange + b3

Derivate of SSR by b3

Got derivates of both the parts

Source:

YouTube Link Backpropagation Main Ideas

Day 26 (Cosine similarity & how do AI images and videos actually work?)

Understanding cosine similarity

Finding cosine similarity of Hello and Hello World We need to find the angle difference between these 2 words. Cosine similarity = Cos(Theta)

Note:- Cosine similarity is only determined by angle not length.

Mathematical Formula for cosine similarity

Calculating cosine similarity

Source:

Day 27 (Image & Video generation)

Vector difference between the man with hat and without the hat is almost closer to hat value.

With hat – Without hat = x

List

Hat = 0.165 Cap = 0.113

Adding noise DDPM

Without adding noise

DDPM equation can be express as Stochastic differential equations

Blue line (first expression) = motion of our point with vector field Grey line (second expression) = random motion

DDIM need Less number of compute to generate high quality image

Source:

Day 28 (Gemini Robotics-ER 1.5)

Gemini Robotics-ER 1.5 is a vision-language model (VLM)

It's designed for advanced reasoning in the physical world, allowing robots to interpret complex visual data, perform spatial reasoning.

Getting started: Finding objects in a scene

It shows how to pass an image and a text prompt to the model using the generateContent method to get a list of identified objects with their corresponding 2D points.

Code:-

Input image

Final Output

Source:

Gemini robotics doc

Day 29

Pointing to objects

Pointing and finding objects in images or video frames is a common use case for vision-and-language models (VLMs) in robotics. The following example asks the model to find specific objects within an image and return their coordinates in an image.

Code:

Output scissor, pen, envelope objects coordinates in an image.

Trajectories

Gemini Robotics-ER 1.5 can generate sequences of points that define a trajectory, useful for guiding robot movement.
This example requests a trajectory to move a red pen to an organizer, including the starting point and a series of intermediate points.

Orchestration

Making room for a laptop

This example shows how Gemini Robotics-ER can reason about a space. The prompt asks the model to identify which object needs to be moved to create space for another item.

Source:

Gemini robotics doc

Day 30 (Understanding Images, CLIP, Relu)

CLIP model only go one direction (From image to vector embedding) We can get embedded vectors with image and text but can’t generate image and text with embedded vectors.

Understanding Relu activation function

Source:

Day 31

Multiple Inputs and Outputs

Multiply with weight and sum with biases After that the value is pass to Relu and got y-axis as 1.6

When petal width = (0, ..., 1) and sepal width = 0,

When petal width = (0, ..., 1) and sepal width = 0.2,

When petal width = (0, ..., 1) and sepal width = (0, ...,1)

When multiply with (-0.1), all point drop to surface level

We do the same for orange dot and get the surface for orange.

We sum the orange and blue dot and get the final value which is green dot

We do for every single point and get the green surface

We get the final output for Setosa

When petal width is close to 1 (the widest), then we will get a high score for virginica.

Now with petal and sepal width, we can predict which can the type of flower

Source:

Multiple inputs and outputs

Day 32 (Image classification & CNN)

CNN

Dot product, add bias and add in feature map

Slide over 1 pixel, however other CNN might move over 2 or more pixels Fill up the feature map by moving 1 pixel.

Apply Relu activation It results in all negative number to ‘0’. F(x) = max(0, x)

New feature map with max pooling (find maximum value in each area)

New feature map is converted into 4 inputs and passed in neural network with 2 outputs

Multiply with weights and bias Got ‘0.34’ as y-coordinates from Relu activation function

Multiply with weight and bias Final output 0.99 which is 1

When picture is of letter ‘O’ CNN predicts as letter ‘O’ by giving it 1

Filter or Kernel (How filter is made)

Also got final output for ‘X’ Image

Source:

Image classification & CNN

Day 33 (Continuous control RL - actor-critic, DDPG, PPO, SAC)

Explain Continuous control RL?

An agent's ability to produce actions that are not limited to a finite set of choices but instead can be any value within a range, such as motor torques or steering angles.

Unlike discrete actions (e.g., "left" or "right"), continuous actions require an agent to make fine-grained adjustments to achieve a goal in a dynamic environment.

Algorithms like Deep Deterministic Policy Gradient (DDPG), Proximal Policy Optimization (PPO), and Soft Actor-Critic (SAC) are used to learn these continuous control policies by optimizing continuous outputs based on rewards received from the environment.

Markov Decision Process (MDP)

Markov Property implies that our agent needs only the current state to decide what action to take and not the history of all the states and actions they took before.

Action Space types

Discrete space: the number of possible actions is finite.

In Super Mario Bros, we have only 4 possible actions: left, right, up (jumping) and down (crouching). Continuous space: the number of possible actions is infinite.
A Self Driving Car agent has an infinite number of possible actions since it can turn left 20°, 21,1°, 21,2°, honk, turn right 20°…

The Policy π: the agent’s brain

The Policy π is the brain of our Agent, it’s the function that tells us what action to take given the state we are in. So it defines the agent’s behavior at a given time.

There are two approaches to train our agent to find this optimal policy π:*

Directly, by teaching the agent to learn which action to take, given the current state: Policy-Based Methods.
Indirectly, teach the agent to learn which state is more valuable and then take the action that leads to the more valuable states: Value-Based Methods.

Trained model and pushed in huggingface

Source:

Day 34

Today I learned to:

Push my trained model in huggingface.
Use other trained model from huggingface using repo id.

Source:

Day 35 (Q Learning From scratch, FrozenLake-v1 and Taxi-v3)

Action-value function

It determines the value of being at a particular state and taking a specific action at that state.

A small recap of Q-Learning

Q-Learning is the RL algorithm that:

Trains Q-Function, an action-value function that encoded, in internal memory, by a Q-table that contains all the state-action pair values.
Given a state and action, our Q-Function will search the Q-table for the corresponding value.

When the training is done,we have an optimal Q-Function, so an optimal Q-Table.
And if we have an optimal Q-function, we have an optimal policy, since we know for, each state, the best action to take.

I read the article on Meta Superintelligence’s surprising first paper and read overview of paper from MSI.

Day 36 (Q Learning From scratch, FrozenLake-v1 and Taxi-v3)

Train Q-Learning agent

We can have two sizes of environment:

map_name="4x4": a 4x4 grid version
map_name="8x8": a 8x8 grid version

The action space (the set of possible actions the agent can take) is discrete with 4 actions available 🎮:

0: GO LEFT
1: GO DOWN
2: GO RIGHT
3: GO UP

Reward function 💰:

Reach goal: +1
Reach hole: 0
Reach frozen: 0

Define Greedy Policy

Remember we have two policies since Q-Learning is an off-policy algorithm. This means we're using a different policy for acting and updating the value function.

Epsilon-greedy policy (acting policy)
Greedy-policy (updating policy) The greedy policy will also be the final policy we'll have when the Q-learning agent completes training. The greedy policy is used to select an action using the Q-table.

Define the epsilon-greedy policy

Epsilon-greedy is the training policy that handles the exploration/exploitation trade-off. The idea with epsilon-greedy:

With probability 1 - ɛ : we do exploitation (i.e. our agent selects the action with the highest state-action pair value).
With probability ɛ: we do exploration (trying a random action). As the training continues, we progressively reduce the epsilon value since we will need less and less exploration and more exploitation.

Define the hyperparameters

The exploration related hyperparamters are some of the most important ones.

We need to make sure that our agent explores enough of the state space to learn a good value approximation. To do that, we need to have progressive decay of the epsilon.
If you decrease epsilon too fast (too high decay_rate), you take the risk that your agent will be stuck, since your agent didn't explore enough of the state space and hence can't solve the problem.
Huggingface Q-learning agent frozen lake
Agent Training code This implementation is based on the Hugging Face Deep Reinforcement Learning Course:
Deep RL course

Day 37 (Taxi-v3)

Training Taxi-v3 environment

Before training Q-table

After training Q-table

Learned to Load Model from Hub

Day 38

This is the architecture of our Deep Q-Learning network:

Why do we stack four frames together?

Hands-on

Train Deep Q-Learning agent to play Atari Games

Adjusting hyper parameter to train our deep q-learning agent to play space invaders.

Training a model

Evaluate agent

I have trained and uploaded my Deep Q-Learning agent using RL-Baselines-3 Zoo.

Day 39 (Learning Optuna)

We’ll first try to optimize the parameters of the DQN studied in the last unit manually. We’ll then learn how to automate the search using Optuna.

Which algorithm should I use?

The first distinction comes from your action space, i.e., do you have discrete (e.g. LEFT, RIGHT, …) or continuous actions (ex: go to a certain speed)? Whether you can parallelized your training or not.

Notes:

# Define and train a A2C model
# Verbosity level: 0 for no output, 1 for info messages, 2 for debug messages
# seed: Generating a random number

a2c_model = A2C("MlpPolicy", env_id, seed=0, verbose=0).learn(budget_pendulum)


# Evaluate the train A2C model
# n_envs where n_envs is number of environment copies running in parallel

mean_reward, std_reward = evaluate_policy(a2c_model, eval_envs, n_eval_episodes=100, deterministic=True)
print(f"A2C Mean episode reward: {mean_reward:.2f} +/- {std_reward:.2f}")

What is on-policy?

On-policy methods are about learning from what you are currently doing. Imagine you're trying to teach a robot to navigate a maze. In on-policy learning, the robot learns based on the actions it is currently taking. It's like learning to cook by trying out different recipes yourself.

What is off-policy?

Off-policy methods, on the other hand, are like learning from someone else's experience. In this approach, the robot might watch another robot navigate the maze and learn from its actions. It doesn't have to follow the same policy as the robot it's observing. It involves learning the value of the optimal policy independently of the agent's actions. These methods enable the agent to learn from observations about the optimal policy, even when it's not following it. This is useful for learning from a fixed dataset or a teaching policy.

In some cases training longer is not a solution

Training for “4000 steps (20 episodes)”

Training Longer PPO

Tuned Hyperparameters

Reward Increased drastically

Result:

Not tuned: -1158 After tuned: -159

Day 40 (Policy Gradient with Pytorch)

We will create a script that allows to search for the best hyperparameters automatically.

Hyperparameters

Policy Gradient with Pytorch

With policy-based methods, we want to optimize the policy directly without having an intermediate step of learning a value function. We’ll learn about policy-based methods and study a subset of these methods called policy gradient.

We’ll implement our first policy gradient algorithm called Monte Carlo Reinforce from scratch using PyTorch.

What are the policy-based methods?

The main goal of Reinforcement learning is to

find the optimal policyπ∗π∗ that will maximize the expected cumulative reward.

A stochastic policy in reinforcement learning (RL) It dictates the probability of taking each action in a given state, rather than a single, predetermined action.

Value-based methods,

The idea is that an optimal value function leads to an optimal policyπ∗
Our objective is to minimize the loss between the predicted and target value to approximate the true action-value function.

Policy-based methods

The idea is to parameterize the policy.
Our objective then is to maximize the performance of the parameterized policy using gradient ascent.

Source

Policy Gradient

Day 41

Reinforce Algorithm

What is Unity ML-Agents?

Unity ML-Agents is a toolkit for the game engine Unity that allows us to create environments using Unity or use pre-made environments to train our agents.

With Unity ML-Agents, you have six essential components:

The first is the Learning Environment, which contains the Unity scene (the environment) and the environment elements (game characters).
The second is the Python Low-level API, which contains the low-level Python interface for interacting and manipulating the environment. It’s the API we use to launch the training.
Then, we have the External Communicator that connects the Learning Environment (made with C#) with the low level Python API (Python).
The Python trainers: the Reinforcement algorithms made with PyTorch (PPO, SAC…).
The Gym wrapper: to encapsulate the RL environment in a gym wrapper.
The PettingZoo wrapper: PettingZoo is the multi-agents version of the gym wrapper.

The observation space

Regarding observations, we don’t use normal vision (frame), but we use raycasts.
Think of raycasts as lasers that will detect if they pass through an object.

Day 42 (Actor-Critic methods)

It is a hybrid architecture combining value-based and Policy-Based methods that helps to stabilize the training by reducing the variance using:

An Actor that controls how our agent behaves (Policy-Based method)
A Critic that measures how good the taken action is (Value-Based method)

To understand the Actor-Critic, imagine you’re playing a video game. You can play with a friend that will provide you with some feedback. You’re the Actor and your friend is the Critic.

This is the idea behind Actor-Critic. We learn two function approximations:

A policy that controls how our agent acts:
A value function to assist the policy update by measuring how good the action taken is:

Training is a robotic arm that needs to do controls (moving the arm and using the end-effector).

a2

Source code
Train A2C agent using Stable-Baselines3 in a robotic environment to move arm to the correct position.
Huggingface A2C Agent

Day 43 (Multi-Agent System)

Multi-Agent System

Since the beginning of this course, we learned to train agents in a single-agent system where our agent was alone in its environment: it was not cooperating or collaborating with other agents.

But, as humans, we live in a multi-agent world. Our intelligence comes from interaction with other agents. And so, our goal is to create agents that can interact with other humans and other agents.

Multi-Agent example:- Football match

We have two solutions to design this multi-agent reinforcement learning system (MARL).

Decentralized approach

The benefit is that since no information is shared between agents, these vacuums can be designed and trained like we train single agents.

The idea here is that our training agent will consider other agents as part of the environment dynamics. Not as agents.

Centralized approach

The intuition behind PPO

The idea with Proximal Policy Optimization (PPO) is that we want to improve the training stability of the policy by limiting the change you make to the policy at each training epoch: we want to avoid having too large of a policy update.

With PPO, the idea is to constrain our policy update with a new objective function called the Clipped surrogate objective function that will constrain the policy change in a small range using a clip.

Implement PPO from scratch

PPO from scratch

Day 44 (Deep RL Research)

Developing Research Project

Start by exploring the literature to become aware of topics in the field.
If you’re looking for inspiration, or just want to get a rough sense of what’s out there, check out Spinning Up’s key papers list.
Find a paper that you enjoy on one of these subjects—something that inspires you—and read it thoroughly.
Use the related work section and citations to find closely-related papers and do a deep dive in the literature.
You’ll start to figure out where the unsolved problems are and where you can make an impact.

Approaches to idea-generation

Frame 1: Improving on an Existing Approach

This is the incrementalist angle, where you try to get performance gains in an established problem setting by tweaking an existing algorithm.

Frame 2: Focusing on Unsolved Benchmarks

Instead of thinking about how to improve an existing method, you aim to succeed on a task that no one has solved before.

Frame 3: Create a New Problem Setting

Instead of thinking about existing methods or current grand challenges, think of an entirely different conceptual problem that hasn’t been studied yet.
Avoid reinventing the wheel.

Model-based Vs Model-free RL

Model-based RL:

The agent first builds an internal model of the environment, which predicts future states and rewards. It then uses this model to plan and simulate actions before acting in the real world.

Model-free RL:

The agent skips the model-building step and learns directly from interacting with the environment. This can be simpler for environments where building an accurate model is difficult.

Source:

Deep RL research detail Information

Day 45 (Multitask RL)

Most of these deep RL methods primarily focus on learning different tasks in isolation, making it challenging to utilize shared information between tasks to develop a generalized policy.

Multi-task reinforcement learning (MTRL) aims to master a set of RL tasks effectively. By leveraging the potential information sharing among different tasks, joint multi-task learning typically exhibits higher sample efficiency than training each task individually.

Challenge in MTRL

A significant challenge in MTRL lies in determining what information should be shared and how to share it effectively.

For instance, someone who can ride a bicycle can quickly learn to ride a motorcycle by referring to related skills, such as operating controls, maintaining balance, and executing turns. Likewise, a motorcyclist adept in these skills can also quickly learn to ride a bicycle. This ability allows humans to efficiently master multiple tasks by selectively referring to skills previously learned.

Cross-Task Policy Guidance (CTPG)

CTPG is a generalized MTRL framework that can be combined with various existing parameter sharing methods. Among these, we choose several classical approaches and integrate them with CTPG, achieving significant improvement in sample efficiency and final performance on both manipulation and locomotion MTRL benchmarks.

Source:

Efficient Multi-task Reinforcement Learning with Cross-Task Policy Guidance

Day 46 (RLHF)

Supervised Fine-tuning(SFT) & Reinforcement Learning from Human Feedback(RLHF)

Aligning = Aligning means to make our model to response politely and helpfully.

Note: After pretraining model can predict next word but it is not align.

Supervise finetuning = Using users prompt and response to train a model

Supervise finetuning allow model to generate polite and helpful responses but only to the specific trained prompt. Note: Model cannot politely respond to new prompt.

How do we train the model to respond to new prompt?

Answer: Super huge Fine-Tuning dataset Note: Need huge amount of money to collect data and train model on huge dataset.

Alternative: RLHF

Model generates multiple responses and human selects the best response.

Training a reward model with loss function.
After reward model is trained, Supervised Fine-tune model is trained by rewarding to the responses generated by it.

Source:

Reinforcement Learning with Human Feedback (RLHF)

Day 47

Stanford AI Club: Jason Wei on 3 Key Ideas in AI in 2025

Note: The impact of AI will be seen earliest on tasks that are digital, easy for humans, and data abundant. Implications:

Certain fields will be heavily accelerated by AI (eg: Software development)
Other fields will remain untouch(e.g., hairdressing)

AlphaEvolve

Understanding AlphaEvolve

Source:

Day 48

Statistical Machine Learning

As intuitive as it sounds from its name, statistical machine learning involves using statistical techniques to develop models that can learn from data and make predictions or decisions.

The principles of statistics are the very pillars that uphold the structure of machine learning.

Constructing machine learning models: Statistics provides the methodologies and principles for creating models in machine learning. For instance, the linear regression model leverages the statistical method of least squares to estimate the coefficients.
Interpreting results: Statistical concepts allow us to interpret the results generated by machine learning models. Measures such as p-value, confidence intervals, R-squared, and others provide us with a statistical perspective on the machine learning model’s performance.
Validating models: Statistical techniques are essential for validating and refining the machine learning models. For instance, techniques like hypothesis testing, cross-validation, and bootstrapping help us quantify the performance of models and avoid problems like overfitting.
Underpinning advanced techniques: Even some of the more complex machine learning algorithms, such as Neural Networks, have statistical principles at their core. The optimization techniques, like gradient descent, used to train these models are based on statistical theory.

Source:

Intro to SML

Day 49

Trying to understand MimicKit

This framework is intended to support research and applications in computer graphics and robotics by providing a unified training framework, along with standardized environment, agent, and data structures.

Day 50

DeepMimic

We show that well-known reinforcement learning (RL) methods can be adapted to learn robust control policies capable of imitating a broad range of example motion clips, while also learning complex recoveries, adapting to changes in morphology, and accomplishing userspecified goals.

Our method handles keyframed motions, highly-dynamic actions such as motion-captured flips and spins, and retargeted motions. By combining a motion-imitation objective with a task objective, we can train characters that react intelligently in interactive settings, e.g., by walking in a desired direction or throwing a ball at a user-specified target.

DeepMimic

Name		Name	Last commit message	Last commit date
Latest commit History 80 Commits
images		images
.gitignore		.gitignore
README.md		README.md

thinley4/Learning_Artificial_Intelligence

Folders and files

Latest commit

History

Repository files navigation

Learning Artificial Intelligence

Projects

Resources & References

Research Papers

Books & Articles

Courses & Tutorials

Other Resources

1. DChat Application

2. Article: How LLMs Work

3. Andrej Karpathy – Neural Networks: Zero to Hero

4. Project: Handwritten Digit Recognition

5. Stanford YouTube Videos

6. Machine Learning Specialization – Andrew Ng (Autumn 2018)

7. PyTorch

8. Project: AI Agent (Gmail Agent)

9. Sebastian Raschka – LLMs from Scratch

10. Courses

11. Book

12. Hands-On Reinforcement Learning

13. Project: Minimal Lunar Lander – DQN

Day 1: Q-Learning

Day 2: Markov Decision Processes (MDPs)

Day 3: Implementing Q-Learning in Python

Day 4: Value Functions and Optimality in Reinforcement Learning

Day 5

Day 6

Day 7

Day 8

Day 9

Day 10

Day 11

Day 12

Day 13

Day 14

Day 15

Day 16

Day 17

Day 18

Day 19

Day 20

Day 21

Day 22 (Understanding Neural Network)

Day 23 (The Chain Rule)

Day 24 (Gradient Descent, Step-by-Step)

Day 25 (Backpropagation Main Ideas)

Day 26 (Cosine similarity & how do AI images and videos actually work?)

Day 27 (Image & Video generation)

Day 28 (Gemini Robotics-ER 1.5)

Day 29

Day 30 (Understanding Images, CLIP, Relu)

Day 31

Day 32 (Image classification & CNN)

Day 33 (Continuous control RL - actor-critic, DDPG, PPO, SAC)

Day 34

Day 35 (Q Learning From scratch, FrozenLake-v1 and Taxi-v3)

Day 36 (Q Learning From scratch, FrozenLake-v1 and Taxi-v3)

Day 37 (Taxi-v3)

Day 38

Day 39 (Learning Optuna)

Day 40 (Policy Gradient with Pytorch)

Day 41

Day 42 (Actor-Critic methods)

Day 43 (Multi-Agent System)

Day 44 (Deep RL Research)

Day 45 (Multitask RL)

Day 46 (RLHF)

Day 47

Day 48

Day 49

Day 50

About

Resources

Uh oh!

Stars

Watchers

Packages