diff --git a/.gitignore b/.gitignore index 45c8c19..e6bf6ce 100644 --- a/.gitignore +++ b/.gitignore @@ -11,7 +11,9 @@ test-trainer bert_dataset trained_model results +results_qlora output +qlora_final_model src tmp diff --git a/README.md b/README.md index 9091e98..5670640 100644 --- a/README.md +++ b/README.md @@ -16,4 +16,4 @@ A sandbox environment to experiment with large language models and various NLP t - **Fine-tuning Series**: Develop a series of proof-of-concept (POC) repositories for fine-tuning LLMs, hosted on GitHub (e.g.,[ft2](https://github.com/locchh/ft2), [ft3](https://github.com/locchh/ft3), etc.). 🔥 - **LLM Blog**: Write and publish insightful blog posts on LinkedIn, sharing perspectives and research related to LLMs (e.g.,[Artificial Intelligence For Programming](https://www.linkedin.com/posts/chuong-loc_artificialintelligence-softwaredevelopment-activity-7278039943480856576-2o3E?utm_source=share), [The Limitation of Context](https://www.linkedin.com/posts/chuong-loc_%F0%9D%90%93%F0%9D%90%A1%F0%9D%90%9E-%F0%9D%90%8B%F0%9D%90%A2%F0%9D%90%A6%F0%9D%90%A2%F0%9D%90%AD%F0%9D%90%AC-%F0%9D%90%A8%F0%9D%90%9F-%F0%9D%90%86%F0%9D%90%A2%F0%9D%90%AF%F0%9D%90%A2%F0%9D%90%A7%F0%9D%90%A0-%F0%9D%90%82%F0%9D%90%A8%F0%9D%90%A7%F0%9D%90%AD%F0%9D%90%9E%F0%9D%90%B1%F0%9D%90%AD-activity-7274037081000034304-tTY4?utm_source=share), etc.). 📝🌐 - **POC Repositories**: Create and share repositories showcasing new technologies and quick demonstrations for easy exploration (e.g.,[lr4lg](https://github.com/locchh/lr4lg), [debugger](https://github.com/locchh/debugger), etc.). 📂 -- **Build Continuously Growing Repositories**: Develop and maintain long-term, evolving repositories for ongoing projects (e.g., [Synthflow](https://github.com/locchh/synthflow), [Antflow](https://github.com/locchh/antflow), [Antelligence](https://github.com/locchh/antelligence), etc.). 🔄 +- **Build Continuously Growing Repositories**: Develop and maintain long-term, evolving repositories for ongoing projects (e.g., [Synthflow](https://github.com/locchh/synthflow), [Antflow](https://github.com/locchh/antflow), [Antelligence](https://github.com/locchh/antelligence), etc.). 🔄 \ No newline at end of file diff --git a/docs/render.html b/docs/generative_AI _engineering_and_fine-tuning_transformers.html similarity index 100% rename from docs/render.html rename to docs/generative_AI _engineering_and_fine-tuning_transformers.html diff --git a/docs/instruction_tuning_best_practices.md b/docs/instruction_tuning_best_practices.md new file mode 100644 index 0000000..b5fd7a5 --- /dev/null +++ b/docs/instruction_tuning_best_practices.md @@ -0,0 +1,43 @@ + +# Best Practices for Instruction-Tuning Large Language Models + +Instruction-tuning is a powerful fine-tuning approach that adapts large language models (LLMs) to follow specific instructions more effectively, enhancing their usefulness in practical applications. Below, we outline best practices for optimizing instruction-tuning in LLMs. + +## Data Selection for Instruction-Tuning + +High-quality data is crucial for effective instruction-tuning. The selected data should reflect diverse instructions and responses to help the model generalize and respond accurately across varied scenarios. + +- **Diverse Dataset Collection**: Use datasets that cover a wide range of topics, contexts, and instructions. Including different prompt types and response styles helps the model handle a broader set of instructions. +- **Balance of Specialized and General Data**: While it's beneficial to include domain-specific instructions, balancing this with general data improves versatility, allowing the model to perform well across various domains. + +## Optimize Prompt Engineering + +Effective prompt engineering enables the model to understand and respond appropriately to different instructions. + +- **Contextual Prompt Design**: Design prompts that reflect real-world use cases and specific contexts the model might encounter. For instance, instructions could vary in formality, complexity, or specificity, helping the model adapt to different audiences. +- **Testing Prompt Variability**: Experiment with different prompts to assess how well the model generalizes to unseen instructions. This helps ensure that the model doesn't overly rely on specific patterns or structures. + +## Measure Response Consistency + +Consistency in response quality is key to creating a reliable model. + +- **Evaluate Accuracy and Consistency**: Regularly test the model with similar instructions to measure consistency. Consistent and accurate responses to repeated instructions indicate a well-tuned model. +- **Monitor Task-Specific Performance**: If the model is tuned for a specialized application, evaluate its performance across task-specific scenarios to ensure consistency within that context. + +## Limit Overfitting on Instruction Style + +Overfitting on specific instruction styles or tones can reduce the model’s adaptability. + +- **Style Variety in Instructions**: Include a variety of tones and structures in the instruction dataset to avoid making the model too reliant on specific formats. +- **Balance Precision and Flexibility**: Fine-tune the model to be precise in its responses without limiting its ability to adapt to different instruction types. This balance helps create a model that is accurate yet flexible in understanding various instructions. + +## Implement Regular Evaluation Metrics + +Regular evaluation of the fine-tuned model ensures it meets the desired quality standards. + +- **Use Metrics for Instruction Adherence**: Implement metrics that evaluate how closely the model's responses align with provided instructions. +- **Human Review and Quality Checks**: Regular human review of model responses provides insights that are difficult to capture with automated metrics, adding another layer of evaluation for adherence and appropriateness. + +## Conclusion + +Following these best practices for instruction-tuning can significantly enhance an LLM's performance, enabling it to respond more accurately and flexibly to a wide array of instructions. By focusing on quality data, diverse prompt engineering, and regular evaluation, you can create an instruction-tuned model that is both effective and reliable in real-world applications. diff --git a/docs/peft.md b/docs/peft.md new file mode 100644 index 0000000..c64ae06 --- /dev/null +++ b/docs/peft.md @@ -0,0 +1,136 @@ + +### **LoRA (Low-Rank Adaptation):** + +- **Core Idea:** + LoRA introduces trainable low-rank matrices into the model, targeting specific weight matrices (e.g., attention layers). Instead of fine-tuning the full model, LoRA adds a low-rank decomposition to certain layers and fine-tunes only these low-rank matrices. + +- **Implementation:** + - Adds two small low-rank matrices \( A \) and \( B \) to the original weights \( W \) of the model, where \( \Delta W = AB^T \). + - The original pre-trained weights \( W \) remain frozen during training, and only \( A \) and \( B \) are updated. + +- **Advantages:** + - Memory efficient: Fewer parameters are updated. + - Doesn't require modifying the model architecture significantly. + - Easy to integrate into transformer-based models. + +- **Use Case:** + - Primarily used for fine-tuning language models (e.g., LLaMA, GPT). + - Common in NLP and generative tasks. + +--- + +### **Adapters:** + +- **Core Idea:** + Adapters are small trainable modules inserted into the layers of a pre-trained model. These modules are trained while keeping the original model weights frozen. + +- **Implementation:** + - Adapters are typically lightweight neural network modules (e.g., feedforward layers) inserted between layers of a model, such as attention or feedforward blocks. + - During training, only the parameters of the adapters are updated, while the rest of the model remains fixed. + +- **Advantages:** + - Modular: Different tasks can have separate adapters without modifying the original model. + - Memory efficient: Reduces the need to fine-tune the entire model. + - Easy task-switching by replacing adapters. + +- **Use Case:** + - Popular in multi-task learning and scenarios requiring task-specific fine-tuning. + - Useful in both NLP and multimodal applications. + +--- + +### **Key Differences:** + +| Feature | LoRA | Adapters | +|-----------------------|-------------------------------|--------------------------------| +| **Architecture** | Adds low-rank matrices to weight updates. | Inserts small modules between layers. | +| **Frozen Parameters** | Original model weights are frozen. | Original model weights are frozen. | +| **Parameter Updates** | Updates low-rank matrices \( A, B \). | Updates adapter parameters only. | +| **Overhead** | Minimal: modifies specific weight matrices. | Moderate: introduces new modules into the model. | +| **Use Cases** | NLP fine-tuning, generative tasks. | Multi-task learning, task-specific adaptation. | + +--- + +Both methods are highly efficient and suitable for scenarios where training large models directly is infeasible. However, the choice depends on your use case, such as the number of tasks, modularity requirements, and computational constraints. + + +### **1. LoRA for Fine-Tuning** + +#### **Advantages:** + +1. **Efficiency:** + - LoRA modifies only a small subset of trainable parameters (the low-rank matrices), making it highly parameter-efficient. +2. **Minimal Overhead:** + - No additional modules are inserted into the architecture; only the weight matrices of certain layers are modified. +3. **Better for Single Tasks:** + - Works well for scenarios where you want to fine-tune on a single or specific domain/task without modularity concerns. +4. **Speed:** + - Training and inference are faster compared to adapters due to the lightweight nature of LoRA’s modifications. + +#### **Use Cases:** +- Single-task fine-tuning. +- Resource-constrained environments (e.g., limited GPU memory). +- Generative tasks such as text generation or summarization. + +#### **Limitations:** +- Less modular: Difficult to manage multiple tasks or transfer fine-tuned components across models. +- Potentially less effective when task-switching is required. + +--- + +### **2. Adapters for Fine-Tuning** + +#### **Advantages:** + +1. **Modularity:** + - Adapters are ideal for multi-task setups since each task can have its own adapter, enabling seamless task switching. +2. **Transfer Learning:** + - Adapters trained on one task can be reused or adapted for related tasks, making them versatile. +3. **Isolation:** + - Fine-tuning with adapters avoids interference between tasks, which is especially useful in multi-task or federated learning setups. + +#### **Use Cases:** +- Multi-task learning or scenarios requiring task switching. +- Incremental fine-tuning on new tasks/domains. +- Applications where modularity and reusability of components are important. + +#### **Limitations:** +- Higher computational overhead compared to LoRA due to added modules. +- Slightly more complex integration into existing architectures. + +--- + +### **Comparison Table:** + +| Feature | LoRA | Adapters | +|--------------------------|---------------------------------|---------------------------------| +| **Parameter Efficiency** | High | Moderate | +| **Modularity** | Low | High | +| **Fine-Tuning Overhead** | Minimal | Moderate | +| **Use Case** | Single-task fine-tuning | Multi-task or modular setups | +| **Scalability** | Limited for multiple tasks | Scalable for multi-task setups | +| **Inference Efficiency** | Higher | Lower due to added modules | + +--- + +### **What’s Better?** + +- **LoRA** is generally better if: + - You are working on a single task or domain. + - You prioritize efficiency and minimal computational overhead. + - You have limited resources (e.g., GPU memory). + +- **Adapters** are better if: + - You need to fine-tune on multiple tasks or domains. + - Task modularity and reusability are important. + - You want a framework that supports incremental learning. + +--- + +### **Best Practices:** +- **Experimentation:** + If resources allow, experiment with both methods to see which performs better for your specific task. +- **Hybrid Approaches:** + Recent work combines both methods for more efficient and effective fine-tuning. +- **Task-Specific Considerations:** + Consider the complexity of the task, the expected level of generalization, and memory constraints. \ No newline at end of file diff --git a/docs/render.pdf b/docs/soft_prompts.pdf similarity index 100% rename from docs/render.pdf rename to docs/soft_prompts.pdf diff --git a/notebooks/LLM_Specialization/Adapters in PyTorch.ipynb b/notebooks/LLM_Specialization/Adapters in PyTorch.ipynb deleted file mode 100644 index 7f236d0..0000000 --- a/notebooks/LLM_Specialization/Adapters in PyTorch.ipynb +++ /dev/null @@ -1,3389 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "id": "b75e0832-1651-4d74-85dd-14fd49ef56e3", - "metadata": {}, - "source": [ - "

\n", - " \n", - " \"Skills\n", - " \n", - "

\n", - "\n", - "# **Adapters in PyTorch**\n", - "\n", - "Estimated time needed: **45** minutes\n", - "\n", - "**_Note to advanced users_: If you are already familiar with classical fine-tuning and you only want to see the section that relates to adapters, skip forward to Adapters and run all of the cells above that section by going to _Run --> Run All Above Selected Cell_**\n", - "\n", - "You can fine-tune a neural network in several ways. Common strategies include adjusting only the final layer or fine-tuning all layers. However, these methods have their drawbacks: fine-tuning just the final layer often leads to less than optimal results, while fine-tuning all layers can be very time-consuming.\n", - "\n", - "To address these issues, researchers have developed various parameter efficient fine-tuning (PEFT) techniques. One such technique involves the use of adapters. Adapters enable modular training, where small, task-specific modules are trained within the model without changing the pre-existing pretrained parameters. This approach efficiently tailors the model to new tasks with a reduced risk of overfitting. However, adapters are not a cure-all solution. While they are less likely to overfit and are computationally efficient, they might not always reach the same level of accuracy as full model fine-tuning, particularly if the task necessitates substantial changes from the pretrained model's original capabilities.\n", - "\n", - "In this hands-on lab, you learn how to apply an adapter to a transformer-based neural network that has been trained on the AG News data set, with the aim of using this model on the IMDB data set. You also evaluate and compare the performance of this method with that of a fully fine-tuned model and a model where only the last layer is fine-tuned.\n", - "\n", - "---\n", - "\n", - "# __Table of contents__\n", - "\n", - "
    \n", - "
  1. Objectives
  2. \n", - "
  3. \n", - " Setup\n", - "
      \n", - "
    1. Install required libraries
    2. \n", - "
    3. Import required libraries
    4. \n", - "
    5. Defining helper functions
    6. \n", - "
    \n", - "
  4. \n", - "
  5. Positional encodings
  6. \n", - "
  7. Import IMDB data set
  8. \n", - "
      \n", - "
    1. IMDB data set overview
    2. \n", - "
        \n", - "
      1. Data set composition
      2. \n", - "
      3. Applications
      4. \n", - "
      5. Challenges
      6. \n", - "
      7. Data set splits
      8. \n", - "
      9. Data loader
      10. \n", - "
      11. Neural network
      12. \n", - "
      \n", - "
    \n", - "
  9. \n", - " Training\n", - "
      \n", - "
    1. Train IMDB
    2. \n", - "
    3. Fine-tune a model pretrained on the AG News data set
    4. \n", - "
    5. Fine-tune the final layer only
    6. \n", - "
    \n", - "
  10. \n", - "
  11. \n", - " Adapters\n", - "
      \n", - "
    1. Benefits of using adapters in neural networks
    2. \n", - "
    \n", - "
  12. \n", - "
  13. \n", - " Exercise: Adapt linear layers in a different network\n", - "
  14. \n", - "
\n", - "\n", - "---" - ] - }, - { - "cell_type": "markdown", - "id": "9e4d071d-35d8-4d69-9168-a40d25df2205", - "metadata": {}, - "source": [ - "# Objectives\n", - "\n", - "After completing this lab, you are able to:\n", - "\n", - "- Define and pretrain a transformer-based neural network using PyTorch for a classification task [Optional]\n", - "- Fully fine-tune the pretrained model for a different classification task [Optional]\n", - "- Compare results by fine-tuning only the last layer of the pretrained model [Optional]\n", - "- Understand how adapters work\n", - "- Apply adapters to linear layers in a neural network\n", - "- Train a neural network in a parameter efficient way by training just the adapted layers\n", - "\n", - "---" - ] - }, - { - "cell_type": "markdown", - "id": "35a85867-b5fa-4780-b512-35bdf829b33c", - "metadata": {}, - "source": [ - "# Setup\n", - "\n", - "### Install required libraries\n", - "\n", - "For this lab, you use the following libraries, which are __not__ preinstalled in the Skills Network Labs environment. __You must run the code in the following cell__ to install them.\n", - "\n", - "```bash\n", - "!pip install --upgrade portalocker==2.8.2 torchtext==0.17.0 torchdata==0.7.1 pandas==2.2.2 matplotlib==3.9.0 scikit-learn==1.5.0 torch==2.2.0 numpy==1.26.4\n", - "```\n", - "\n", - "---" - ] - }, - { - "cell_type": "markdown", - "id": "a2d7c12f-2141-4af9-91a8-b0c1c8088d66", - "metadata": {}, - "source": [ - "### Import required libraries\n", - "\n", - "The following code imports the required libraries.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "id": "81927908-a8c5-42d6-b24c-97b8b13a42e6", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "True\n", - "Tesla P40\n", - "Import Successfully!\n" - ] - } - ], - "source": [ - "# Environment setup\n", - "import os\n", - "os.environ[\"CUDA_VISIBLE_DEVICES\"] = \"0\"\n", - "\n", - "# Suppress warnings\n", - "import warnings\n", - "warnings.filterwarnings('ignore')\n", - "def warn(*args, **kwargs):\n", - " pass\n", - "warnings.warn = warn\n", - "\n", - "# PyTorch and related libraries\n", - "import torch\n", - "from torch import nn\n", - "from torch.utils.data import DataLoader, Dataset\n", - "from torch.utils.data.dataset import random_split\n", - "from torch.nn.utils.rnn import pad_sequence\n", - "\n", - "# TorchText for NLP tasks\n", - "from torchtext.datasets import AG_NEWS, IMDB\n", - "from torchtext.data.utils import get_tokenizer\n", - "from torchtext.vocab import build_vocab_from_iterator, GloVe, Vectors\n", - "from torchtext.data.functional import to_map_style_dataset\n", - "\n", - "# Utility libraries\n", - "import time\n", - "from itertools import accumulate\n", - "import math\n", - "import pickle\n", - "import io\n", - "from urllib.request import urlopen\n", - "import tarfile\n", - "import tempfile\n", - "\n", - "# Data manipulation and visualization\n", - "import numpy as np\n", - "import pandas as pd\n", - "import matplotlib.pyplot as plt\n", - "from tqdm import tqdm\n", - "\n", - "# Jupyter Notebook utilities\n", - "from IPython.display import Markdown as md\n", - "\n", - "# PyTorch-specific configurations\n", - "torch.set_num_threads(1)\n", - "\n", - "# CUDA-related checks\n", - "print(torch.cuda.is_available())\n", - "print(torch.cuda.get_device_name())\n", - "\n", - "print(\"Import Successfully!\")" - ] - }, - { - "cell_type": "markdown", - "id": "5e710480-51b4-46c7-a3b4-5c80021007b2", - "metadata": {}, - "source": [ - "### Define helper functions\n", - "\n", - "The following code shows some helper functions to help with plotting, saving, and loading files. These functions are not the main focus of this lab, so you do not have to dwell on these too long. However, do run the cells in this section to define these helper functions.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "id": "57d49c2a-d309-4608-8e45-b9fa8ec63cbd", - "metadata": {}, - "outputs": [], - "source": [ - "def plot(COST,ACC):\n", - "\n", - " fig, ax1 = plt.subplots()\n", - " color = 'tab:red'\n", - " ax1.plot(COST, color=color)\n", - " ax1.set_xlabel('epoch', color=color)\n", - " ax1.set_ylabel('total loss', color=color)\n", - " ax1.tick_params(axis='y', color=color)\n", - "\n", - " ax2 = ax1.twinx()\n", - " color = 'tab:blue'\n", - " ax2.set_ylabel('accuracy', color=color) # you already handled the x-label with ax1\n", - " ax2.plot(ACC, color=color)\n", - " ax2.tick_params(axis='y', color=color)\n", - " fig.tight_layout() # otherwise the right y-label is slightly clipped\n", - "\n", - " plt.show()\n", - "\n", - "\n", - "def save_list_to_file(lst, filename):\n", - " \"\"\"\n", - " Save a list to a file using pickle serialization.\n", - "\n", - " Parameters:\n", - " lst (list): The list to be saved.\n", - " filename (str): The name of the file to save the list to.\n", - "\n", - " Returns:\n", - " None\n", - " \"\"\"\n", - " with open(filename, 'wb') as file:\n", - " pickle.dump(lst, file)\n", - "\n", - "\n", - "def load_list_from_file(filename):\n", - " \"\"\"\n", - " Load a list from a file using pickle deserialization.\n", - "\n", - " Parameters:\n", - " filename (str): The name of the file to load the list from.\n", - "\n", - " Returns:\n", - " list: The loaded list.\n", - " \"\"\"\n", - " with open(filename, 'rb') as file:\n", - " loaded_list = pickle.load(file)\n", - " return loaded_list" - ] - }, - { - "cell_type": "markdown", - "id": "fb4827b6-33c9-4370-bfbf-983d89623c98", - "metadata": {}, - "source": [ - "---" - ] - }, - { - "cell_type": "markdown", - "id": "0d6f6a86-020b-4ed5-8c52-5fa69dceca97", - "metadata": {}, - "source": [ - "# Positional encodings\n", - "\n", - "Positional encodings play a pivotal role in transformers and various sequence-to-sequence models, aiding in conveying critical information regarding the positions or sequencing of elements within a given sequence. To illustrate, let's examine the sentences: \"He painted the car red\" and \"He painted the red car.\" Despite their distinct meanings, it's worth noting that the embeddings for these sentences remain identical in the absence of positional encodings. The following class defines positional encodings by inheriting from PyTorch's `Module` class.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "id": "88c9cebf-7dbe-46d0-81d2-5a116120c374", - "metadata": {}, - "outputs": [], - "source": [ - "class PositionalEncoding(nn.Module):\n", - " \"\"\"\n", - " https://pytorch.org/tutorials/beginner/transformer_tutorial.html\n", - " \"\"\"\n", - "\n", - " def __init__(self, d_model, vocab_size=5000, dropout=0.1):\n", - " super().__init__()\n", - " self.dropout = nn.Dropout(p=dropout)\n", - "\n", - " pe = torch.zeros(vocab_size, d_model)\n", - " position = torch.arange(0, vocab_size, dtype=torch.float).unsqueeze(1)\n", - " div_term = torch.exp(\n", - " torch.arange(0, d_model, 2).float()\n", - " * (-math.log(10000.0) / d_model)\n", - " )\n", - " pe[:, 0::2] = torch.sin(position * div_term)\n", - " pe[:, 1::2] = torch.cos(position * div_term)\n", - " pe = pe.unsqueeze(0)\n", - " self.register_buffer(\"pe\", pe)\n", - "\n", - " def forward(self, x):\n", - " x = x + self.pe[:, : x.size(1), :]\n", - " return self.dropout(x)" - ] - }, - { - "cell_type": "markdown", - "id": "aadc814f-d060-47be-963f-c28cfd0618e4", - "metadata": {}, - "source": [ - "# Import IMDB data set\n", - "\n", - "The following code loads the IMDB data set.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "id": "c479b278-01b9-4863-9821-528b1607a74b", - "metadata": {}, - "outputs": [], - "source": [ - "urlopened = urlopen('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/35t-FeC-2uN1ozOwPs7wFg.gz')\n", - "tar = tarfile.open(fileobj=io.BytesIO(urlopened.read()))\n", - "tempdir = tempfile.TemporaryDirectory()\n", - "tar.extractall(tempdir.name)\n", - "tar.close()" - ] - }, - { - "cell_type": "markdown", - "id": "ec7531bd-8b33-45d9-8061-9de43bd188b8", - "metadata": {}, - "source": [ - "## IMDB data set overview\n", - "\n", - "The **IMDB data set** contains movie reviews from the Internet Movie Database (IMDB) and is commonly used for binary sentiment classification tasks. It's a popular data set for training and testing models in natural language processing (NLP), particularly in the context of sentiment analysis.\n", - "\n", - "### Data set composition\n", - "\n", - "- **Reviews**: The data set consists of 50,000 movie reviews, divided evenly into 25,000 training and 25,000 testing samples.\n", - "- **Sentiment labels**: Each review is labeled as either positive or negative, indicating the sentiment expressed in the review. The data set is balanced, with an equal number of positive and negative reviews in both the training and testing sets.\n", - "- **Text content**: Reviews are presented as plain text and have been preprocessed to some extent. For example, HTML tags are removed, but the text retains its original punctuation and capitalization.\n", - "- **Usage**: The data set is commonly used to train models for binary sentiment classification, where the goal is to predict whether a given review is positive or negative based on its text content.\n", - "\n", - "### Applications\n", - "\n", - "- **Sentiment analysis**: The primary application of the IMDB data set is in sentiment analysis, where it serves as a benchmark for various text classification algorithms.\n", - "- **Natural language processing**: The data set is widely used in NLP research and applications, providing a basis for testing the effectiveness of different models and approaches in understanding human language.\n", - "\n", - "### Challenges\n", - "\n", - "The data set is small, so it's hard to train a model from scratch.\n", - "\n", - "The following class is defined to traverse the IMDB data set. The need to define this class arises from the fact that the IMDB data set is split across a large number of files.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "id": "51bc66a7-506d-4ac4-aa0b-1813c6a0e4c5", - "metadata": {}, - "outputs": [], - "source": [ - "class IMDBDataset(Dataset):\n", - " def __init__(self, root_dir, train=True):\n", - " \"\"\"\n", - " root_dir: The base directory of the IMDB dataset.\n", - " train: A boolean flag indicating whether to use training or test data.\n", - " \"\"\"\n", - " self.root_dir = os.path.join(root_dir, \"train\" if train else \"test\")\n", - " self.neg_files = [os.path.join(self.root_dir, \"neg\", f) for f in os.listdir(os.path.join(self.root_dir, \"neg\")) if f.endswith('.txt')]\n", - " self.pos_files = [os.path.join(self.root_dir, \"pos\", f) for f in os.listdir(os.path.join(self.root_dir, \"pos\")) if f.endswith('.txt')]\n", - " self.files = self.neg_files + self.pos_files\n", - " self.labels = [0] * len(self.neg_files) + [1] * len(self.pos_files)\n", - " self.pos_inx=len(self.pos_files)\n", - "\n", - " def __len__(self):\n", - " return len(self.files)\n", - "\n", - " def __getitem__(self, idx):\n", - " file_path = self.files[idx]\n", - " label = self.labels[idx]\n", - " with open(file_path, 'r', encoding='utf-8') as file:\n", - " content = file.read()\n", - " return label, content" - ] - }, - { - "cell_type": "markdown", - "id": "6bb31d42-b20e-413d-96f6-7d680eb83bb2", - "metadata": {}, - "source": [ - "The following code uses the `IMDBDataset` class previously defined to create iterators for the train and test data sets. In the latter part of the cell, you can return 20 examples from the train set.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "id": "01513044-f657-4203-b933-bad10ebeb4c8", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "(0, 'It is important not to be insulted by lack of logic or common sense and those who have any \"gray matters\" will agree that this movie just doesn\\'t work.

The problems lay in the direction, cast selections and lack of depth in the character building. The word comedy was very hard thing to say when i expect to laugh when these words are used. Let\\'s look at the problems in direction/script.

Brother and sister both in their mid 30\\'s seem to be well adjusted. They meet a complete stranger at a park and Heather Graham character walks up to her and asks the most intimate questions that even half sane person would be running the other way or at least scream for a police officer. He then awkwardly walks over and makes some stupid statements and she falls for him. Then after ONE date were they all go out together he falls in love with her and decides to get married in Vegas in a week\\'s time???? Hello does anyone feel stupid yet? He goes out with thousands of women and he meets this one person who says about 10 words that WE see on the screen and he wants to marry her. Not only was there no chemistry it just doesn\\'t make sense. Sure it\\'s a romantic comedy and I want to believe it could, but the direction made it completely flat.

Now Heather falls head over heels with her too and when Heather Graham and Bridget Moynahan (very shallow character) kiss or more to the point it was sloppiest kiss ever that chemistry MIGHT be there. I found it unromantic and unfunny and while many say Heather cannot act i think the reality is Heather was clearly the wrong person for this role.

This was Sue Kramer debut as a director and to me it was just too much for her to chew. It would take a lot of craft to make this movie work and IMHO it could be done with better writers and casting and direction.')\n", - "(0, \"THE SCREAMING SKULL (1 outta 5 stars) This movie boasts some pretty cool opening credits (an offscreen narrator warning that movie patrons will be offered a free burial if they die of fright watching this movie, a scary shot of a skull emerging from a placid pool and the ubiquitous scary music) but, sadly, the movie is all downhill from there. A widowed man takes his new bride to his secluded mansion... admonishing his servants and friends that the new Mrs. has a very fragile disposition due to a tragedy in her past. Well, in no time at all she begins to see and hear mysterious things that no one else can. Her husband assures her that it's all merely in her mind and... well, you can probably see where this all is going. You will have figured out what's going on long before our hapless heroine... because you have probably seen the exact same plot in hundreds of other movies and TV shows (and done better, too). To add to the movie's myriad transgressions, most cuts of this movie (on numerous cheap DVD compilations) seem to be missing a few key scenes. You see the heroine slowly walking towards the window... she goes to open it... you know she is going to see something scary... and then... suddenly the scene cuts to her sobbing in her husband's arms. So what did she see??? I guess we'll never know.\")\n", - "(0, \"

Whether any indictment was intended must be taken into consideration. If in the year 2000 there were still rifts of feeling between Caucasian and Afro-Americans in Georgia, such as shown in this film, obviously there remains a somewhat backward mentality among a lot of people out there. It is rather hypocritical, to say the least, if everyone adores Halle Berry, Whoopie Goldberg, Beyoncé, Noemi Campbell, Denzel Washington, Will Smith, et. al., whilst out in the backs there persist manifest racial divides.

White grandmother suddenly gets black grand-daughter thrust upon her, only to meet up with black grandfather in a very white social backwater. The story is sweet, not lacking tragic overtones, and eminently predictable as in most of these kinds of TV films, though the final scene has you guessing............ will he? won't he.......?

Gena Rowlands in her typical style offers a sincere rendering, and Louis Gossett is a good match for her; the little Penny Bae fortunately does not steal the show.

A `nice' way of relaxing after Sunday lunch without having to force your mind too much, though you might just find yourself having a little siesta in the middle of it.\")\n", - "(0, 'First, it takes a full half hour to get Hackman out of jail and to start doing the job. What a waste of time, we all know Hackman is getting out to do some job for his masters, why waste almost a third of the movie on these sequences. Then Hackman stays in a hotel and the story arc again goes nowhere, simply proving to us that Hackman is under close watch and anything he says or does is know by the masters. Again, another 20 minutes. Then more wasted time showing the reunion with his wife. All of this should have taken 10-15 minutes at most simply as a set-up for the real action, intrigue and plot twists. By the time the real action gets going, I was so bored that I just wanted the movie to end. Hackman is great as usual, and the other actors as well, but this is a dud of the first magnitude.')\n", - "(0, \"Banned as a 'Video Nasty' in the UK, Unhinged has naturally gained quite a bit of notoriety. However, the most shocking thing I found about the film was its amateurishness in all departments. The bloodletting I could handle: the terrible acting, shoddy editing, awful direction, lousy script and abysmal soundtrack were much harder to take.

Three girls on their way to a music festival crash into a ravine during a storm. They are rescued by a friendly stranger who takes them to a nearby house. The owner of the house, a batty old lady, and her spinster daughter, welcome the girls in, allowing them to stay for a few days in order to recuperate. However, someone doesn't want the girls to leave\\x97ever! One by one they fall victim to an unseen assailant.

Taking a long time to get going and featuring some of the worst performances ever in a horror film (and that takes some doing), Unhinged is a truly awful film. The music is a total mess (it sounds like a three year old has been let loose on a synthesiser) and as such, it complements the movie perfectly. Only a couple of bloody scenes towards the end and a bit of gratuitous nudity save Unhinged from getting the lowest possible score.

If you are a horror completist (and unfortunately, I am), you will want to see this in order to tick it off the Video Nasty watch-list. But be warned\\x97it is really, really bad.\")\n", - "(0, \"When great director/actor combinations are talked about the team of J. Lee Thompson and Charles Bronson is not usually mentioned. Probably because the output of nine joint ventures between the two of them runs the gamut from the really good action entertainment to the mediocre. Unfortunately Kinjite: Forbidden Subjects falls in the latter.

That's sad because Kinjite could have been a whole lot better. But for the life of me I don't understand why it was necessary to make the father of the missing Japanese girl, a guy used to getting some cheap jollies because the romance in his marriage has run out. That might have been good for another film altogether, but it served no purpose here.

A straightforward cop drama with Charles Bronson as a vice cop who's seen a bit too much in his line of work and has a strong prejudice against orientals. That part could also have used a little explaining as well. But he's going to have to overcome it if he and patient partner Perry Lopez are going to locate a captured Japanese school girl.

Bronson's time in the vice squad have told him exactly where to look for the kidnapper. A stylish, murderous pimp played by Jaime Fernandez is the guy and he and Bronson have some history. In fact in the film's best scene, Bronson made him eat an expensive rolex watch and set his car on fire.

At one point Fernandez happens to spot Bronson and Lopez in an all night delicatessen and this being after his rolex snack, he sprays the place with an Uzi killing everyone, but Bronson and Lopez. I really think that little incident would have had more than a couple vice cops from the LAPD after Fernandez. But that's another terribly big hole in the plot.

Still there is a very rough justice in the end for Fernandez. I wish the whole film had been better though. This was the last film of the Bronson-Thompson team and J. Lee Thompson's last as a director. He should have gone out with something better.\")\n", - "(0, 'First off, let me say that I am a great believer in Fanpro stuff. I see it as a way to continue a good show long after it has been cancelled. Star Trek Voyages and Star Wars Revelations are examples of decent efforts. So I have a soft-spot for fanpro stuff that means I\\'ll overlook things that I would ordinarily slate badly.

So on to ST: HF. Well, first off the good things. Enthusiasm is a major part of making any show believable and, for the most part, the crew of the various ships all seem to be having a good time with their roles. Next, the effects aren\\'t bad for a home-brew effort, with nothing to make you really wince. The stories aren\\'t too bad either. Nothing particularly innovative, but solid enough stuff and at least there are ongoing story-arcs.

But it has a lot of faults.

First off, although they quite obviously HAVE to rip-off Star Trek footage, set backdrops, music and effects, I see no reason why they proceeded to rip off virtually every other sci-fi musical score ever made. Everything from Aliens to Starship Troopers rears it orchestral head at one point or another. Likewise, much of the footage is from other movies, dutifully CGI\\'d over to make it look different. The Grey warships, for instance, though disguised, are quite obviously Star Destroyers from Star Wars. And the station is also rather obviously Fleet Battle Station Ticonderoga from Starship Troopers. Likewise, sound effects from various Star Wars movies appear in space battles between fighters, as does animated over footage. In one scene in either first or second season, I think, you even see two TIE fighters fly past during a battle, which hardly does your suspension of disbelief any favours.

Acting varies from the reasonable to the hideously painful to watch. Everyone does improve as the seasons progress, though, but expect to grimace at the screen a lot, especially in the early seasons. They\\'ve also made some interesting acting choices. Let\\'s just say that the food replicators on this show seem permanently set to \"cake\" and leave it at that.

Make-up effects are generally quite effective on the whole. But they really ought to mercilessly club to death the person who decided to use cheap Ferengi and Cardassian masks for anything other than background use or \"passing\" shots. They are just beyond unrealistic. Every time I saw one of these (apart from trying not to laugh too much) I kept expecting the unfortunate soul wearing it to pull out a gun and announce that \"This is a stick-up!\" In one scene a \"Cardassian\" actually talks whilst wearing one of these. Not only do the lips not move, but the mask doesn\\'t even have an opening where the mouth should be. Someone needs to be slapped hard for that. Couldn\\'t they have taken a craft knife to it, for goodness\\' sake! There are also some well-done, but unintentionally funny make-up jobs, such as the Herman Munster look alike.

The writing, though coherent, is nothing new. Instead the script runs like a continuation of DS9, with the ships heading out from DS12 on various missions. The new enemy, \"The Grey\" aren\\'t very menacing and the plot line involving them is effectively a reworking of the Borg threads. i.e. Starfleet meet the Grey, the Grey are hugely powerful, Starfleet barely escape with their lives, then through technology they begin to find ways to combat the enemy etc etc. All done before with the Borg.

Another bone of contention is the dialogue. Star Trek writers have long had the ability to write \"insert technobabble here\" into a script. It usually means an exposition of the latest plan to combat the enemy using \"quantum phase discriminators\" or \"isolytic charges\" etc. In other words, nonsense that tells you that they are on the case and a resolution is at hand.

The words are just gibberish really. I\\'ve no problem with this, but where ST:HF makes a mess of it is where they include real-world comments into this concept.

Tactical advice such as \"We need to regroup\" sounds good, but not when uttered by trio of characters already standing in a group. Likewise when asked what the situation is, a tactical officer is heard to reply \"We count three battleships\". He actually needed to count them? C\\'mon! I expected the questioner to ask him \"Are you sure?\" or \"Can you double check\". But my all-time favourite comment is this:

Captain: \"Can we establish two-way communication?\"

Comms officer: \"No, we can only send and receive..\"

Well, duh!.....

Having said all the above, the show does improve as it goes along. Seasons 1 and 2 are pretty bad, 3 shows an improvement but 4 & 5 are where it starts to get noticeably better. Season 6 so far looks quite reasonable.

I do have a problem with their choice of media for the shows though. Quicktime sucks, quite frankly and the sooner they move to divx/avi format the better. Some of us like to actually take our downloaded shows and watch them on decent size screen and not peer at a tiny QT window on a computer monitor. Not only does Quicktime make this difficult, but the 320x180 resolution the shows are in does not scale at all well. In fact, it makes the shows pretty unwatchable, like they were a tenth-generation VHS tape copy. The least they could do was to include a hi-res downloadable option.

Anyway, the show has promise, and I\\'m even beginning to like some of the characters. But that\\'s 40 episodes on, so I\\'m not sure this says that much about character development at all.

But what can you say, it\\'s free....

PS: Out of 28 votes, 19 people rated this show as a 9 or 10. Hmmmm... were we watching the same show? Or are you 19 all three year olds?')\n", - "(0, 'I read in the papers that W.Snipes was broke so no wonder he would take parts in low budget projects like The Contractor.He is just the next action star to join a growing club:the penniless action stars of the 90s (Van Damme,Segal,Lundgren,Snipes). Here he stars the lead in a cheap action flick which was shot in Bulgaria( we are supposed to believe that the location is London, like only a complete moron would buy that)The story is the one of 1000 other movies: retired special forces good guy gets hired by the government again to do a wet job- after that government wants to get rid of him- good guy gets away after killing bad guys (was that a spoiler? guess not!) The star of the movie: the little girl (Eliza Bennett) outperforms everybody else of the cast!!!One star is for her plus one star for eye candy Lena Headey, makes 2 stars. Only for die hard Snipes fans!Everybody else:avoid!')\n", - "(0, \"Did HeidiJean really see this movie? A great Christmas movie? Not even close. Dull, bland and completely lacking in imagination and heart. I kept watching this movie wondering who the hell thought that Carly Pope could play the lead in this movie! The woman has no detectable personality and gives a completely lackluster performance. Baransky was great as usual and provided the only modicum of interesting the whole thing. Probably her involvement was the only reason this project was green lighted to begin with. Maybe I'm expecting too much for a Lifetime movie played 15 days from Christmas but I sat through this thing thinking that with a different director and a recasting JJ with an actress that at least could elicit sympathy this could have been quite a cute little movie.\")\n", - "(0, 'Band Camp was awful, The Naked Mile was a little better, and this third straight to DVD in the American Pie franchise seems the same quality as the predecessor. Basically Erik Stifler (John White) split from his girlfriend after losing his virginity, and now him and Mike \\'Cooze\\' Coozeman (Jake Siegel) are joining Erik\\'s cousin Dwight (Steve Talley) at college. With the promise of many parties, plenty of booze, and enough hot chicks at the Beta House, they only have fifty listed tasks to carry out to become official privileged members. But a threat comes into sight with the rivals, GEK (\"Geek\") House, led by power-hungry nerd (and sheep shagger) Edgar (Tyrone Savage) offering bigger and better than what Beta have. To settle it once and for all, Beta and Gek go into battle with the banned, for forty years, Greek Games to beat each other in, with the loser moving out. The last champion of the games, Noah Levenstein aka Jim\\'s Dad (the only regular Eugene Levy) runs the show, which sees the people unhooking bras, a gladiator duel floating on water, catching a greased pig, Russian Roulette in the mouth with cartridges of aged horse spunk, wife carrying and drinking a full keg of alcohol (with puking not disqualifying). It all comes to the sudden death, with a guy getting stripper lap dancing, and they have to resist cumming, Beta House win when Edgar cums with a girl dressed as a sheep on his lap. Also starring Flubber\\'s Christopher McDonald as Mr. Stifler, Meghan Heffern as Ashley, Dan Petronijevic as Bull, Nic Nac as Bobby, Christine Barger as Margie, Italia Ricci as Laura Johnson, Moshana Halbert as Sara Coleman, Sarah Power as Denise, Andreja Punkris as Stacy and Jordan Prentice as Rock. The nudity amount is very slightly increased, as is the grossness of the jokes, and I could guess it being rated one star out of five, but I like it. Adequate!')\n", - "(1, \"I saw this recently on a cable channel. The movie is great; it's one of the few musicals I have seen that doesn't shy away from the light and dark. It portrays some of the splendour of the age along with a lot of the squalor. Some of the set piece dance sequences so much is going on, I didn't know where to look next. One day I shall go and see this on the big screen, just so that I see what's happening. But what really lifts this to another level is Oliver Reed's performance as Bill Sykes. Not only is a thoroughly mean and menacing man but there is something else, some inner demons. He gave me the impression that if you pushed him into a corner, he was capable of anything. It was almost as if the Sykes character was on the edge of madness, just awaiting the trigger. I have seen the Robert Newton's Bill Sykes from the 1948 movie, and I thought he was 'just' a bad egg, but Oliver Reed's performance intimidated me in my own living room.\")\n", - "(1, '\"The Gingerbread Man is the first thriller I\\'ve ever done!\" \\x96 Robert Altman

In 1955 Charles Laughton directed \"The Night of the Hunter\", a spooky slice of Southern Gothic in which Robert Mitchum plays a scary serial killer. One of the film\\'s more famous sequences consists of two kids escaping from Mitchum on a rowboat, the kids frantically paddling whilst Mitchum wades after them like a monster.

Seven years later Mitchum played an equally spooky killer in \"Cape Fear\", another film set in the American South. That film featured a local attorney trying to protect his family and likewise ended with Mitchum terrorising folks on a boat. In 1991 Martin Scorsese, trying to branch out and tackle something more mainstream, remade \"Cape Fear\", boat scene and all.

Now we have Robert Altman\\'s \"The Gingerbread Man\", another slice of small town Southern Gothic. Altman says he consulted \"The Night of the Hunter\" for inspiration and tackled such a mainstream film purely because he wanted to \"spread his wings and try a popcorn picture\", but what he\\'s secretly attempting to do here is deconstruct the canonical films of the Southern Gothic genre.

So instead of a showdown on small boat, we get a showdown on a giant ship. Instead of two kids being kidnapped, we get two kids being safely returned to the police. Instead of money being hidden, we have money being readily given via a last will and testament. Instead of the righteous attorney of the 1961 film and the deplorable attorney of the 1991 remake, we get a rather three-dimensional lawyer in Kenneth Branagh. Instead of the monster chasing the family we get the hero chasing the bad guys. Instead of the monster breaking into the family\\'s house boat, we have the hero hunting the monster on board the monster\\'s \"house ship\". Similarly, instead of a murderous serial killer we get an innocent weirdo played by Robert Duvall. . .etc etc etc.

Altman goes on and on, reversing everything just a little slightly, pulling at the edges and doing his own thing. His touch is most apparent during the film\\'s first half-hour, the film existing in an uneasy space between conventional plot-driven movie storytelling and Altman\\'s fondness for overlapping dialogue, casual narratives, prowling camera movement and the way that characters aren\\'t so much introduced as they are simply part of what\\'s going on.

Still, despite Altman\\'s best intentions, the film never rises above mediocrity. Altman\\'s too bound to the conventions of the \"thriller format\" to do much damage, his style is too lethargic to generate tension and the film is simply not radical enough to counterpoint other canonical films in the genre. \"Gingerbread Man\" is thus too mainstream to work as a more pure Altman film and too Altman to work as a mainstream thriller.

The film\\'s not a complete waste, though. Robert Downey Junior, Kenneth Branagh and the usually intolerable Daryl Hannah, all turn in juicy performances. The film also has a nice atmosphere, set against a approaching hurricane, and the final act contains some interesting twists and turns. While it\\'s not the complete disaster that Scorsese\\'s \"Cape Fear\" was, the film still never amounts to anything special.

7/10 \\x96 In the late 90s Altman made 3 successive films set in the American South: \"Kansas City\", \"Gingerbread Man\" and \"Cookie\\'s Fortune\". Unlike \"Gingerbread Man\", both \"Kansas City\" and \"Cookie\\'s Fortune\" tackle the genre on the broader, more looser canvases that Altman was most comfortable with.

\"Kansas City\" is the more important of these two films, its hierarchies of class, politics and crime, and its desire to break radically away from typical gangster genre frameworks, would prove influential on all serious 21st century film crime writers (see, for example, \"The Wire\"). That said, \"Cookie\\'s Fortune\", while a much slighter tale, is perhaps the better picture.

Note: Altman claims that this is his first thriller, but he directed \"Images\", an art house thriller, in 1972.

Worth one viewing.')\n", - "(1, '\"Read My Lips (Sur mes lèvres)\" (which probably has different idiomatic resonance in its French title) is a nifty, twisty contemporary tale of office politics that unexpectedly becomes a crime caper as the unusually matched characters slide up and down an ethical and sensual slippery slope.

The two leads are magnetic, Emmanuelle Devos (who I\\'ve never seen before despite her lengthy resume in French movies) and an even more disheveled than usual Vincent Cassel (who has brought a sexy and/or threatening look and voice to some US movies).

The first half of the movie is on her turf in a competitive real estate office and he\\'s the neophyte. The second half is on his turf as an ex-con and her wrenching adaptation to that milieu.

Writer/director Jacques Audiard very cleverly uses the woman\\'s isolating hearing disability as an entrée for us into her perceptions, turning the sound up and down for us to hear as she does (so it\\'s even more annoying than usual when audience members talk), using visuals as sensory reactors as well.

None of the characters act as anticipated (she is not like that pliable victim from \"In the Company of Men,\" not in individual interactions, not in scenes, and not in the overall arc of the unpredictable story line (well, until the last shot, but heck the audience was waiting for that fulfillment) as we move from a hectic modern office, to a hectic disco to romantic and criminal stake-outs.

There is a side story that\\'s thematically redundant and unnecessary, but that just gives us a few minutes to catch our breaths.

This is one of my favorites of the year!

(originally written 7/28/2002)')\n", - "(1, \"This was a pretty decent movie. This movie is good to just sit down and watch and be entertained. Just a typical Hollywood film. This movie will never win an Oscar or anything and definitely doesn't deserve one, but I thought it was pretty good. It's kind of like the show 24 but set into movie format. If you like the whole we've got to stop the terrorist from killing the president kind of movie then you will enjoy this flick. I personally think that storyline has been done WAY too much, but The Sentinel does add a little twist with the mole in the Secret Service. All in all, this movie won't leave your jaw to the floor or change your life, but who says every single movie has to be like that to be good?\")\n", - "(1, 'Actually they could not have chosen a better diversified actor to portray Little Richard than Leon. He captures Little Richard to a most believable essence. The outfits where wonderful and any person watching this movie will definitely keep a smile on their face through the entire movie. Although the movie is a little long, it keeps your attention with the personality and outfits of Little Richard in mind. The ending should have taken a direction of moving Little Richard more into the present where you could see him as he has aged into this new millennium. He will always be the King of Rock-N-Roll as far as I am concerned regardless of what the other media says.')\n", - "(1, \"If you are a fan of Altman's large ensemble casts, as evidenced in major films like M.A.S.H., Nashville, Gosford Park, and lesser seen films like A Wedding, then you will no doubt be entertained by HealtH. Centered around a Health Convention where two women are running for President, HealtH contains many of Altman's latter 70s regulars like Paul Dooley (who helped write the film), Carol Burnett, and Henry Gibson, while also including top star Altman newcomers like Lauren Bacall, James Garner, and Glenda Jackson. Like a lot of Altman ensemble films there are numerous subplots in this film, but it is not nearly as overwhelming as films like Nashville or A Wedding, rather it has a more centered feel, perhaps like M.A.S.H. or Gosford Park. The whole thing is an obvious satire on the Health movement, filled with over-top, outlandish, contradictive characters, with guest stars like Dick Cavett providing a wry commentary on the whole thing. Underlining the whole election process is Altman's characteristic pessimism about politics and public appeal but what is most appealing about this film is the sheer fun most people seem to be having. This would be one of Altman's last films like this for a while!\")\n", - "(1, 'I saw the movie with two grown children. Although it was not as clever as Shrek, I thought it was rather good. In a movie theatre surrounded by children who were on spring break, there was not a sound so I know the children all liked it. There parents also seemed engaged. The death and apparent death of characters brought about the appropriate gasps and comments. Hopefully people realize this movie was made for kids. As such, it was successful although I liked it too. Personally I liked the Scrat!!')\n", - "(1, \"This show has been my escape from reality for the past ten years. I will sadly miss it. Although Atlantis has filled the hole a small bit.

The last ever episode of SG1(on television anyway)was beautifully done. Robert wrote something that felt close to reality. As though he was trying to explain what it was like on the set of the show. (Everyone working closely together for such a long time there are bound to up's and downs. But over the years they've turned into a family). I thought this was a wonderful way to end despite anyone else's criticisms.

SG1 was something special and time and time again it took me across thresholds of disbelief and amazement. The wonderful characters, stories, directors, writers. From episode one I was hooked. The blend of action, science, drama and especially comedy worked so well that made me keep wanting more.

There are no real words in which to completely express what this show meant to me. I can only thank those who kept the show so fresh and entertaining for so many years. It has inspired me to do many things that I thought was impossible.

I look forward to the movies next year and I really hope there will be a number of them. I never want the show to die.

Stargate SG1 - 1997 - 2007?\")\n", - "(1, 'For a long time it seemed like all the good Canadian actors had headed south of the border and (I guessed) all the second rank ones filled the top slots and that left the dregs for the sex comedies.

This film was a real surprise: despite the outlandish plots that are typical of farces, the actors seemed to be trying to put something into their characters and what we, the viewer, got back was almost true suspension of belief. When the extras from the music video attacked the evicting police, you almost believed it was possible.

If you are a fan of some of the better sex farces (Canadian or not) you should definitely seek this one out. And the big surprise, this sex farce is also loaded with some very good nudity.')\n", - "(1, 'Thank God this wasn\\'t based on a true story, because what a story it is. Populated by despicable characters whose depravity knows no bounds, Before The Devil is a mesmerizing, jaw-dropping excursion into perversion which would be laughable (and sometimes is, even with - or perhaps because of - the sickeningly tragic undercurrent of human dysfunction throughout) if it weren\\'t carried out with such magnificent, overwhelming conviction by its stars. The excellent script by Kelly Masterson and superb direction by none other than Sidney Lumet doesn\\'t hurt either.

The main dysfunction here is of a family nature, with the two majorly screwed up brothers (brilliant portrayals from Philip Seymour Hoffman and Ethan Hawke) deciding to rob their own parents\\' jewelry store, an attempt that goes pathetically awry.

The story is told with time-shifts (which are noted on screen, such as: \"Charlie: Two Days Before The Robbery\", so no one should be confused); some people have said they didn\\'t like this device but I thought it worked perfectly, adding to the skeweredness of the whole affair, considering that the two brothers in question are hardly playing with full decks - between them you couldn\\'t make a decent poker hand to save your life. Throw in these cheesy extra tidbits: one of the brothers is a drug addict, married to Gina (Marisa Tomei, also excellent), who is having an affair with the other brother, toss in some monumental sibling rivalry, along with the fact that said drug addict brother hates his father (a wrenching performance from Albert Finney), who has apparently caused him serious past pain, and you\\'ve got a Shakespearean/Greek tragedy on your hands. Proceed with caution.')\n" - ] - } - ], - "source": [ - "root_dir = tempdir.name + '/' + 'imdb_dataset'\n", - "train_iter = IMDBDataset(root_dir=root_dir, train=True) # For training data\n", - "test_iter = IMDBDataset(root_dir=root_dir, train=False) # For test data\n", - "\n", - "start=train_iter.pos_inx\n", - "for i in range(-10,10):\n", - " print(train_iter[start+i])" - ] - }, - { - "cell_type": "markdown", - "id": "1a6c647b-b8ab-49df-8434-becaa0dea775", - "metadata": {}, - "source": [ - "The following code defines the mapping of numeric labels to positive and negative reviews.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "id": "3fca543a-ffc0-4079-bc30-7e3e05576623", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "'positive review'" - ] - }, - "execution_count": 8, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "imdb_label = {0: \" negative review\", 1: \"positive review\"}\n", - "imdb_label[1]" - ] - }, - { - "cell_type": "markdown", - "id": "ac4da5eb-a679-45d2-90bd-cfde475e0f8e", - "metadata": {}, - "source": [ - "The following code checks to ensure that there are exactly two classes in the train data set.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "id": "e78c80a0-0dee-4236-a382-dfe9f75a8fea", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "2" - ] - }, - "execution_count": 9, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "num_class = len(set([label for (label, text) in train_iter]))\n", - "num_class" - ] - }, - { - "cell_type": "markdown", - "id": "9c35ca39-77a1-40d9-9f16-fddb3259fd64", - "metadata": {}, - "source": [ - "The following code loads a basic English tokenizer and defines a function called ```yield_tokens``` that uses the tokenizer to break down text data yielded by an iterator into tokens.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "id": "6a41ea50-7e8b-423b-9de8-36a88e97e96b", - "metadata": {}, - "outputs": [], - "source": [ - "tokenizer = get_tokenizer(\"basic_english\")\n", - "\n", - "def yield_tokens(data_iter):\n", - " \"\"\"Yield tokens for each data sample.\"\"\"\n", - " for _, text in data_iter:\n", - " yield tokenizer(text)" - ] - }, - { - "cell_type": "markdown", - "id": "e1853ae6-c596-4fac-a2f4-185c0a354510", - "metadata": {}, - "source": [ - " The following code loads a pretrained word embedding model called GloVe into a variable called `glove_embedding`.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 11, - "id": "0cbe5e7e-13d0-4d64-8c0c-86b74954c9b2", - "metadata": {}, - "outputs": [], - "source": [ - "# Note that GloVe embeddings are typically downloaded using:\n", - "#glove_embedding = GloVe(name=\"6B\", dim=100)\n", - "# However, the GloVe server is frequently down. The code below offers a workaround\n", - "\n", - "\n", - "class GloVe_override(Vectors):\n", - " url = {\n", - " \"6B\": \"https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/tQdezXocAJMBMPfUJx_iUg/glove-6B.zip\",\n", - " }\n", - "\n", - " def __init__(self, name=\"6B\", dim=100, **kwargs) -> None:\n", - " url = self.url[name]\n", - " name = \"glove.{}.{}d.txt\".format(name, str(dim))\n", - " #name = \"glove.{}/glove.{}.{}d.txt\".format(name, name, str(dim))\n", - " super(GloVe_override, self).__init__(name, url=url, **kwargs)\n", - "\n", - "class GloVe_override2(Vectors):\n", - " url = {\n", - " \"6B\": \"https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/tQdezXocAJMBMPfUJx_iUg/glove-6B.zip\",\n", - " }\n", - "\n", - " def __init__(self, name=\"6B\", dim=100, **kwargs) -> None:\n", - " url = self.url[name]\n", - " #name = \"glove.{}.{}d.txt\".format(name, str(dim))\n", - " name = \"glove.{}/glove.{}.{}d.txt\".format(name, name, str(dim))\n", - " super(GloVe_override2, self).__init__(name, url=url, **kwargs)\n", - "\n", - "try:\n", - " glove_embedding = GloVe_override(name=\"6B\", dim=100)\n", - "except:\n", - " try:\n", - " glove_embedding = GloVe_override2(name=\"6B\", dim=100)\n", - " except:\n", - " glove_embedding = GloVe(name=\"6B\", dim=100)" - ] - }, - { - "cell_type": "markdown", - "id": "e81eda58-67d1-4414-a91a-c238a1434729", - "metadata": {}, - "source": [ - "The following code builds a vocabulary object from a pretrained GloVe word embedding model and sets the default index to the token.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 12, - "id": "5a3dbfda-7d7f-4256-937c-af441df1ee1b", - "metadata": {}, - "outputs": [], - "source": [ - "from torchtext.vocab import GloVe,vocab\n", - "# Build vocab from glove_vectors\n", - "vocab = vocab(glove_embedding .stoi, 0,specials=('', ''))\n", - "vocab.set_default_index(vocab[\"\"])" - ] - }, - { - "cell_type": "markdown", - "id": "e1fc78f6-0d04-4b35-afdd-a6193c66166e", - "metadata": {}, - "source": [ - "Let's count the number of words in the vocab:\n" - ] - }, - { - "cell_type": "code", - "execution_count": 13, - "id": "bc7de385-2867-4836-a6dc-f87a9f0c38f5", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "400002" - ] - }, - "execution_count": 13, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "vocab_size=len(vocab)\n", - "vocab_size" - ] - }, - { - "cell_type": "markdown", - "id": "aa4eca48-cfb4-4fb4-8ca3-180319852e0d", - "metadata": {}, - "source": [ - "Let's test the ```vocab``` function:\n" - ] - }, - { - "cell_type": "code", - "execution_count": 14, - "id": "59689af6-a6ed-49eb-a12e-51303c1eb9da", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "[20]" - ] - }, - "execution_count": 14, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "vocab(['he'])" - ] - }, - { - "cell_type": "markdown", - "id": "6ca4130b-f548-4b92-a68f-b2d83b6b19a5", - "metadata": {}, - "source": [ - "### Data set splits\n", - "\n", - "The following converts the data set into map-style data sets and then performs a random split to create separate training and validation data sets. The training data set will contain 95% of the samples in the original training set, while the validation data set will contain the remaining 5%. These data sets can be used for training and evaluating a machine learning model for text classification on the IMDB data set. The final performance of the model will be evaluated on the hold-out test set.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 15, - "id": "be0403dd-b319-406e-96ea-d215c0a42f6c", - "metadata": {}, - "outputs": [], - "source": [ - "# Convert the training and testing iterators to map-style datasets.\n", - "train_dataset = to_map_style_dataset(train_iter)\n", - "test_dataset = to_map_style_dataset(test_iter)\n", - "\n", - "# Determine the number of samples to be used for training and validation (5% for validation).\n", - "num_train = int(len(train_dataset) * 0.95)\n", - "\n", - "# Randomly split the training dataset into training and validation datasets using `random_split`.\n", - "# The training dataset will contain 95% of the samples, and the validation dataset will contain the remaining 5%.\n", - "split_train_, split_valid_ = random_split(train_dataset, [num_train, len(train_dataset) - num_train])" - ] - }, - { - "cell_type": "markdown", - "id": "82b6143c-d493-49b1-a255-b5aae8f9280a", - "metadata": {}, - "source": [ - "Be aware that the Skills Network currently does not offer GPU access to learners. As a result, training on the full data set could be time-consuming. To address this, you further reduce the size of the training set. This approach helps you mimic the training process as if a GPU were available. However, if you want to train using the full IMDB data set, you must either comment out or remove the two lines in the following code block.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 16, - "id": "0b86c021-06f7-463d-95d7-c83f45de8605", - "metadata": {}, - "outputs": [], - "source": [ - "num_train = int(len(train_dataset) * 0.05)\n", - "split_train_, _ = random_split(split_train_, [num_train, len(split_train_) - num_train])" - ] - }, - { - "cell_type": "markdown", - "id": "e2e0359b-b92d-4aae-9916-aaadbd206ddb", - "metadata": {}, - "source": [ - "The following code checks to see if a CUDA-compatible GPU is available in the system using PyTorch, a popular deep learning framework. If a GPU is available, it assigns the device variable to \"cuda\" (which stands for CUDA, the parallel computing platform and application programming interface model developed by NVIDIA). If a GPU is not available, it assigns the device variable to \"cpu\" (which means the code will run on the CPU instead).\n" - ] - }, - { - "cell_type": "code", - "execution_count": 21, - "id": "3e37cbbd-1c52-4b02-b26c-c7f4b694e944", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "device(type='cuda')" - ] - }, - "execution_count": 21, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n", - "device" - ] - }, - { - "cell_type": "markdown", - "id": "9c0631e5-5eb9-4e91-a84f-315381189cb9", - "metadata": {}, - "source": [ - "### Data loader\n", - "\n", - "The following code prepares the text processing pipeline with the tokenizer and vocabulary. The text pipeline is used to process the raw data strings from the data set iterators.\n", - "\n", - "The function **```text_pipeline```** first tokenizes the input text, then **```vocab```** is applied to get the token indices.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 17, - "id": "6b1807bd-4548-435d-b763-cc1641c95c45", - "metadata": {}, - "outputs": [], - "source": [ - "def text_pipeline(x):\n", - " return vocab(tokenizer(x))" - ] - }, - { - "cell_type": "markdown", - "id": "58976c04-a071-4f9f-8e55-b0207f39a7d1", - "metadata": {}, - "source": [ - "In PyTorch, the **`collate_fn`** function is used in conjunction with data loaders to customize the way batches are created from individual samples. The provided code defines a `collate_batch` function in PyTorch, which is used with data loaders to customize batch creation from individual samples. It processes a batch of data, including labels and text sequences. It applies the `text_pipeline` function to preprocess the text. The processed data is then converted into PyTorch tensors and returned as a tuple containing the label tensor, text tensor, and offsets tensor representing the starting positions of each text sequence in the combined tensor. The function also ensures that the returned tensors are moved to the specified device (for example, GPU) for efficient computation.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 18, - "id": "442be80e-ae39-4001-98f7-0a0c5266bb65", - "metadata": {}, - "outputs": [], - "source": [ - "from torch.nn.utils.rnn import pad_sequence\n", - "\n", - "def collate_batch(batch):\n", - " label_list, text_list = [], []\n", - " for _label, _text in batch:\n", - "\n", - " label_list.append(_label)\n", - " text_list.append(torch.tensor(text_pipeline(_text), dtype=torch.int64))\n", - "\n", - " label_list = torch.tensor(label_list, dtype=torch.int64)\n", - " text_list = pad_sequence(text_list, batch_first=True)\n", - "\n", - " return label_list.to(device), text_list.to(device)" - ] - }, - { - "cell_type": "markdown", - "id": "fb6f5466-b4b7-4772-9f09-9836249d4279", - "metadata": {}, - "source": [ - "You can convert the data set objects to data loaders by applying the `collate` function.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 19, - "id": "95d54253-4781-4607-9f60-f4727117a520", - "metadata": {}, - "outputs": [], - "source": [ - "BATCH_SIZE = 32\n", - "\n", - "train_dataloader = DataLoader(\n", - " split_train_, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_batch\n", - ")\n", - "valid_dataloader = DataLoader(\n", - " split_valid_, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_batch\n", - ")\n", - "test_dataloader = DataLoader(\n", - " test_dataset, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_batch\n", - ")" - ] - }, - { - "cell_type": "markdown", - "id": "3cc82919-3aa8-4f02-9be6-696a71335672", - "metadata": {}, - "source": [ - "Let's check to see what these data loaders generate.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 22, - "id": "0e6a5049-2b5e-4661-ab9a-30b6072da1bb", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "(tensor([0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0,\n", - " 1, 0, 1, 1, 0, 1, 0, 0], device='cuda:0'),\n", - " tensor([[ 0, 3542, 0, ..., 0, 0, 0],\n", - " [ 40, 663, 838, ..., 0, 0, 0],\n", - " [ 39, 16, 2, ..., 0, 0, 0],\n", - " ...,\n", - " [16307, 0, 59, ..., 0, 0, 0],\n", - " [ 9, 12389, 1608, ..., 0, 0, 0],\n", - " [ 193, 44724, 144, ..., 0, 0, 0]], device='cuda:0'))" - ] - }, - "execution_count": 22, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "label,seqence=next(iter(valid_dataloader))\n", - "label,seqence" - ] - }, - { - "cell_type": "markdown", - "id": "1f0d739f-47a7-44e6-b61d-6ce1eecda895", - "metadata": {}, - "source": [ - "### Neural network\n", - "\n", - "This code defines a class called Net that represents a text classifier based on a PyTorch TransformerEncoder.\n", - "The constructor takes the following arguments:\n", - "\n", - "- `num_class`: The number of classes to classify\n", - "- `vocab_size`: The size of the vocabulary\n", - "- `freeze`: Whether to freeze the embedding layer\n", - "- `nhead`: The number of heads in the transformer encoder\n", - "- `dim_feedforward`: The dimension of the feedforward layer in the transformer encoder\n", - "- `num_layers`: The number of transformer encoder layers\n", - "- `dropout`: The dropout rate\n", - "- `activation`: The activation function to use in the transformer encoder\n", - "- `classifier_dropout`: The dropout rate for the classifier\n", - "\n", - "**Attributes:**\n", - "\n", - "- `emb`: An embedding layer that maps each word in the vocabulary to a dense vector representation\n", - "- `pos_encoder`: A positional encoding layer that adds positional information to the word vectors\n", - "- `transformer_encoder`: A transformer encoder layer that processes the sequence of word vectors and extracts high-level features\n", - "- `classifier`: A linear layer that maps the output of the transformer encoder to the desired number of classes\n", - "\n", - "---\n", - "\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": 23, - "id": "7f290742-5806-47d8-8533-f3d92709eba9", - "metadata": {}, - "outputs": [], - "source": [ - "class Net(nn.Module):\n", - " \"\"\"\n", - " Text classifier based on a pytorch TransformerEncoder.\n", - " \"\"\"\n", - " def __init__(\n", - "\n", - " self,\n", - " num_class,vocab_size,\n", - " freeze=True,\n", - " nhead=2,\n", - " dim_feedforward=128,\n", - " num_layers=2,\n", - " dropout=0.1,\n", - " activation=\"relu\",\n", - " classifier_dropout=0.1):\n", - "\n", - " super().__init__()\n", - "\n", - " #self.emb = embedding=nn.Embedding.from_pretrained(glove_embedding.vectors,freeze=freeze)\n", - " self.emb = nn.Embedding.from_pretrained(glove_embedding.vectors,freeze=freeze)\n", - " embedding_dim = self.emb.embedding_dim\n", - "\n", - "\n", - " self.pos_encoder = PositionalEncoding(\n", - " d_model=embedding_dim,\n", - " dropout=dropout,\n", - " vocab_size=vocab_size,\n", - " )\n", - "\n", - " encoder_layer = nn.TransformerEncoderLayer(\n", - " d_model=embedding_dim,\n", - " nhead=nhead,\n", - " dim_feedforward=dim_feedforward,\n", - " dropout=dropout,\n", - " )\n", - " self.transformer_encoder = nn.TransformerEncoder(\n", - " encoder_layer,\n", - " num_layers=num_layers,\n", - " )\n", - " self.classifier = nn.Linear(embedding_dim, num_class)\n", - " self.d_model = embedding_dim\n", - "\n", - " def forward(self, x):\n", - " x = self.emb(x) * math.sqrt(self.d_model)\n", - " x = self.pos_encoder(x)\n", - " x = self.transformer_encoder(x)\n", - " x = x.mean(dim=1)\n", - " x = self.classifier(x)\n", - "\n", - " return x" - ] - }, - { - "cell_type": "markdown", - "id": "68664b81-817c-4633-bc71-9b32f95ee7f3", - "metadata": {}, - "source": [ - "The model can then be trained on labeled data from the IMDB data set with two classes.\n", - "Let's create the model.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 24, - "id": "545ebebe-2206-47ce-81ab-9f168a80e4bd", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "Net(\n", - " (emb): Embedding(400000, 100)\n", - " (pos_encoder): PositionalEncoding(\n", - " (dropout): Dropout(p=0.1, inplace=False)\n", - " )\n", - " (transformer_encoder): TransformerEncoder(\n", - " (layers): ModuleList(\n", - " (0-1): 2 x TransformerEncoderLayer(\n", - " (self_attn): MultiheadAttention(\n", - " (out_proj): NonDynamicallyQuantizableLinear(in_features=100, out_features=100, bias=True)\n", - " )\n", - " (linear1): Linear(in_features=100, out_features=128, bias=True)\n", - " (dropout): Dropout(p=0.1, inplace=False)\n", - " (linear2): Linear(in_features=128, out_features=100, bias=True)\n", - " (norm1): LayerNorm((100,), eps=1e-05, elementwise_affine=True)\n", - " (norm2): LayerNorm((100,), eps=1e-05, elementwise_affine=True)\n", - " (dropout1): Dropout(p=0.1, inplace=False)\n", - " (dropout2): Dropout(p=0.1, inplace=False)\n", - " )\n", - " )\n", - " )\n", - " (classifier): Linear(in_features=100, out_features=2, bias=True)\n", - ")" - ] - }, - "execution_count": 24, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n", - "model = Net(num_class=2,vocab_size=vocab_size).to(device)\n", - "model" - ] - }, - { - "cell_type": "markdown", - "id": "18ece646-3872-41e1-bedb-be1c7d8a37cc", - "metadata": {}, - "source": [ - "The following **`predict`** function takes in a text, a text pipeline, and a model as inputs. It uses a pretrained model passed as a parameter to predict the label of the text for text classification on the IMDB data set.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 25, - "id": "4cde3566-d2df-40b8-8fd1-944018d4b0ab", - "metadata": {}, - "outputs": [], - "source": [ - "def predict(text, text_pipeline, model):\n", - " with torch.no_grad():\n", - " text = torch.unsqueeze(torch.tensor(text_pipeline(text)),0).to(device)\n", - " model.to(device)\n", - " output = model(text)\n", - " return imdb_label[output.argmax(1).item()]" - ] - }, - { - "cell_type": "code", - "execution_count": 26, - "id": "c7388103-5814-402d-a7de-ede861dc0927", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "' negative review'" - ] - }, - "execution_count": 26, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "predict(\"I like sports and stuff\", text_pipeline, model)" - ] - }, - { - "cell_type": "markdown", - "id": "40449174-d493-4521-8417-7950dafb0072", - "metadata": {}, - "source": [ - "You can create a function to evaluate the model's accuracy on a data set. Here, you define two nearly identical evaluation functions, one that provides a `tqdm` progress bar, and one that does not.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 27, - "id": "1171efb1-cf5c-4a25-a0e0-91feb88e8ab7", - "metadata": {}, - "outputs": [], - "source": [ - "def evaluate(dataloader, model_eval):\n", - " model_eval.eval()\n", - " total_acc, total_count= 0, 0\n", - "\n", - " with torch.no_grad():\n", - " for label, text in tqdm(dataloader):\n", - " label, text = label.to(device), text.to(device)\n", - " output = model_eval(text)\n", - " predicted = torch.max(output.data, 1)[1]\n", - " total_acc += (predicted == label).sum().item()\n", - " total_count += label.size(0)\n", - " return total_acc / total_count" - ] - }, - { - "cell_type": "code", - "execution_count": 28, - "id": "571400ae-320b-4d1a-a103-91592812d1c1", - "metadata": {}, - "outputs": [], - "source": [ - "def evaluate_no_tqdm(dataloader, model_eval):\n", - " model_eval.eval()\n", - " total_acc, total_count= 0, 0\n", - "\n", - " with torch.no_grad():\n", - " for label, text in dataloader:\n", - " label, text = label.to(device), text.to(device)\n", - " output = model_eval(text)\n", - " predicted = torch.max(output.data, 1)[1]\n", - " total_acc += (predicted == label).sum().item()\n", - " total_count += label.size(0)\n", - " return total_acc / total_count" - ] - }, - { - "cell_type": "markdown", - "id": "0092f33b-ac39-41d2-8525-443d2759b55e", - "metadata": {}, - "source": [ - "The following code evaluates the performance of your model. Note that this can take approximately 4 minutes on a CPU. **For efficiency, let's not run this cell now, but trust us that the performance of the untrained model is no better than average. If you wish to confirm yourself of this fact, you are free to uncomment this cell**:\n" - ] - }, - { - "cell_type": "code", - "execution_count": 29, - "id": "6b48b713-72a9-4267-b32e-dfab588c659c", - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "100%|█████████████████████████████████████████| 782/782 [00:09<00:00, 82.88it/s]\n" - ] - }, - { - "data": { - "text/plain": [ - "0.5" - ] - }, - "execution_count": 29, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "evaluate(test_dataloader, model)" - ] - }, - { - "cell_type": "markdown", - "id": "b855a1d8-5e52-4fa2-a507-d7383b2ea73d", - "metadata": {}, - "source": [ - "Note that the current performance of the model is no better than average. This outcome is expected, considering that the model has not undergone any training yet.\n", - "\n", - "---" - ] - }, - { - "cell_type": "markdown", - "id": "8a28c3ce-3828-4b70-8442-ae9b7757805c", - "metadata": {}, - "source": [ - "# Training\n", - "\n", - "The following coe defines the training function used to train your model.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 30, - "id": "376d4c30-7f50-4c97-9fa2-667f3c1a47eb", - "metadata": {}, - "outputs": [], - "source": [ - "def train_model(model, optimizer, criterion, train_dataloader, valid_dataloader, epochs=1000, save_dir=\"\", file_name=None):\n", - " cum_loss_list = []\n", - " acc_epoch = []\n", - " acc_old = 0\n", - " model_path = os.path.join(save_dir, file_name)\n", - " acc_dir = os.path.join(save_dir, os.path.splitext(file_name)[0] + \"_acc\")\n", - " loss_dir = os.path.join(save_dir, os.path.splitext(file_name)[0] + \"_loss\")\n", - " time_start = time.time()\n", - "\n", - " for epoch in tqdm(range(1, epochs + 1)):\n", - " model.train()\n", - " #print(model)\n", - " #for parm in model.parameters():\n", - " # print(parm.requires_grad)\n", - " \n", - " cum_loss = 0\n", - " for idx, (label, text) in enumerate(train_dataloader):\n", - " optimizer.zero_grad()\n", - " label, text = label.to(device), text.to(device)\n", - "\n", - " predicted_label = model(text)\n", - " loss = criterion(predicted_label, label)\n", - " loss.backward()\n", - " #print(loss)\n", - " torch.nn.utils.clip_grad_norm_(model.parameters(), 0.1)\n", - " optimizer.step()\n", - " cum_loss += loss.item()\n", - " print(f\"Epoch {epoch}/{epochs} - Loss: {cum_loss}\")\n", - "\n", - " cum_loss_list.append(cum_loss)\n", - " accu_val = evaluate_no_tqdm(valid_dataloader,model)\n", - " acc_epoch.append(accu_val)\n", - "\n", - " if model_path and accu_val > acc_old:\n", - " print(accu_val)\n", - " acc_old = accu_val\n", - " if save_dir is not None:\n", - " #pass\n", - " print(\"save model epoch\",epoch)\n", - " torch.save(model.state_dict(), model_path)\n", - " save_list_to_file(lst=acc_epoch, filename=acc_dir)\n", - " save_list_to_file(lst=cum_loss_list, filename=loss_dir)\n", - "\n", - " time_end = time.time()\n", - " print(f\"Training time: {time_end - time_start}\")" - ] - }, - { - "cell_type": "markdown", - "id": "353d2098-4d5d-4a06-aa6a-c8866a6ecb0f", - "metadata": {}, - "source": [ - "### Train IMDB\n", - "\n", - "The following code sets the learning rate (LR) to 1, which determines the step size at which the optimizer updates the model's parameters during training. The CrossEntropyLoss criterion is used to calculate the loss between the model's predicted outputs and the ground truth labels. This loss function is commonly employed for multiclass classification tasks.\n", - "\n", - "The chosen optimizer is Stochastic Gradient Descent (SGD), which optimizes the model's parameters based on the computed gradients with respect to the loss function. The SGD optimizer uses the specified learning rate to control the size of the weight updates.\n", - "\n", - "Additionally, a learning rate scheduler is defined using StepLR. This scheduler adjusts the learning rate during training, reducing it by a factor (gamma) of 0.1 after every epoch (step) to improve convergence and fine-tune the model's performance. These components together form the essential setup for training a neural network using the specified learning rate, loss criterion, optimizer, and learning rate scheduler.\n", - "\n", - "For the sake of time efficiency, **the following lines are commented out and the model is not actually trained**. If you would like to get a glimpse of what training would look like, uncomment the following code block to train the model for 2 epochs. If you were to train this model in a real-world scenario, you would likely increase the number of epochs to a larger figure, such as 100 or more. Given the reduced training set defined earlier, it takes approximately 2 minutes to complete 2 epochs of training.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 32, - "id": "03ab8dc6-6dda-44fa-b86c-1ad67e651463", - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - " 0%| | 0/2 [00:00" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "acc_urlopened = urlopen('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/sybqacL5p1qeEO8d4xRZNg/model-IMDB%20dataset%20small2-acc')\n", - "loss_urlopened = urlopen('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/eOt6woGoaOB565T0RLH5WA/model-IMDB%20dataset%20small2-loss')\n", - "acc_epoch = pickle.load(acc_urlopened)\n", - "cum_loss_list = pickle.load(loss_urlopened)\n", - "plot(cum_loss_list,acc_epoch)" - ] - }, - { - "cell_type": "markdown", - "id": "37fe982b-b1a7-4d31-a5fc-21481d97fa4e", - "metadata": {}, - "source": [ - "The following code loads your pretrained model and evaluates its performance on the test set. **For efficiency, let's not run the evaluation because it can take approximately 4 minutes to run. Instead, report the result underneath the cell. If you would like to confirm the result for yourself, you are free to uncomment the last line in the following code block.**\n" - ] - }, - { - "cell_type": "code", - "execution_count": 39, - "id": "ca827b34-bd27-413f-ba76-8fde3d73c570", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "" - ] - }, - "execution_count": 39, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "urlopened = urlopen('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/q66IH6a7lglkZ4haM6hB1w/model-IMDB%20dataset%20small2.pth')\n", - "model_ = Net(vocab_size=vocab_size, num_class=2).to(device)\n", - "model_.load_state_dict(torch.load(io.BytesIO(urlopened.read()), map_location=device))" - ] - }, - { - "cell_type": "code", - "execution_count": 40, - "id": "f7421582-5341-4554-bb43-93eaeec31de6", - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "100%|█████████████████████████████████████████| 782/782 [00:10<00:00, 71.81it/s]\n" - ] - }, - { - "data": { - "text/plain": [ - "0.83208" - ] - }, - "execution_count": 40, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "evaluate(test_dataloader, model_)" - ] - }, - { - "cell_type": "markdown", - "id": "a91438ab-0a67-47b4-908c-2114cedb29e2", - "metadata": {}, - "source": [ - "As you can see, the pretrained model achieved an accuracy of approximately 83% on the test data.\n" - ] - }, - { - "cell_type": "markdown", - "id": "d8f01fc5-871b-4978-a2dd-724efa504014", - "metadata": {}, - "source": [ - "### Fine-tune a model pretrained on the AG News data set\n", - "\n", - "Rather than training a model on the IMDB data set as you did earlier, you can fine-tune a model that has been pretrained on the AG News data set, which is a collection of news articles. The goal of the AG News data set is to categorize news articles into one of four categories: Sports, Business, Sci/tech, or World. You’ll start training a model from scratch on the AG News data set. To save time, you can do this in just one cell. Also, for efficiency, ** comment out the training bit**. If you want to train the model for 2 epochs on a smaller data set to demonstrate what the training process would look like, uncomment the part that says `### Uncomment to Train ###` before running the cell. Training for 2 epochs on the reduced data set can take approximately 3 minutes.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 36, - "id": "4db8b115-13e1-4d42-99e0-fd09dc0527c6", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "Net(\n", - " (emb): Embedding(400000, 100)\n", - " (pos_encoder): PositionalEncoding(\n", - " (dropout): Dropout(p=0.1, inplace=False)\n", - " )\n", - " (transformer_encoder): TransformerEncoder(\n", - " (layers): ModuleList(\n", - " (0-1): 2 x TransformerEncoderLayer(\n", - " (self_attn): MultiheadAttention(\n", - " (out_proj): NonDynamicallyQuantizableLinear(in_features=100, out_features=100, bias=True)\n", - " )\n", - " (linear1): Linear(in_features=100, out_features=128, bias=True)\n", - " (dropout): Dropout(p=0.1, inplace=False)\n", - " (linear2): Linear(in_features=128, out_features=100, bias=True)\n", - " (norm1): LayerNorm((100,), eps=1e-05, elementwise_affine=True)\n", - " (norm2): LayerNorm((100,), eps=1e-05, elementwise_affine=True)\n", - " (dropout1): Dropout(p=0.1, inplace=False)\n", - " (dropout2): Dropout(p=0.1, inplace=False)\n", - " )\n", - " )\n", - " )\n", - " (classifier): Linear(in_features=100, out_features=4, bias=True)\n", - ")" - ] - }, - "execution_count": 36, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "train_iter_ag_news = AG_NEWS(split=\"train\")\n", - "\n", - "num_class_ag_news = len(set([label for (label, text) in train_iter_ag_news ]))\n", - "num_class_ag_news\n", - "\n", - "# Split the dataset into training and testing iterators.\n", - "train_iter_ag_news, test_iter_ag_news = AG_NEWS()\n", - "\n", - "# Convert the training and testing iterators to map-style datasets.\n", - "train_dataset_ag_news = to_map_style_dataset(train_iter_ag_news)\n", - "test_dataset_ag_news = to_map_style_dataset(test_iter_ag_news)\n", - "\n", - "# Determine the number of samples to be used for training and validation (5% for validation).\n", - "num_train_ag_news = int(len(train_dataset_ag_news) * 0.95)\n", - "\n", - "# Randomly split the training dataset into training and validation datasets using `random_split`.\n", - "# The training dataset will contain 95% of the samples, and the validation dataset will contain the remaining 5%.\n", - "split_train_ag_news_, split_valid_ag_news_ = random_split(train_dataset_ag_news, [num_train_ag_news, len(train_dataset_ag_news) - num_train_ag_news])\n", - "\n", - "# Make the training set smaller to allow it to run fast as an example.\n", - "# IF YOU WANT TO TRAIN ON THE AG_NEWS DATASET, COMMENT OUT THE 2 LINEs BELOW.\n", - "# HOWEVER, NOTE THAT TRAINING WILL TAKE A LONG TIME\n", - "num_train_ag_news = int(len(train_dataset_ag_news) * 0.05)\n", - "split_train_ag_news_, _ = random_split(split_train_ag_news_, [num_train_ag_news, len(split_train_ag_news_) - num_train_ag_news])\n", - "\n", - "\n", - "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n", - "device\n", - "\n", - "def label_pipeline(x):\n", - " return int(x) - 1\n", - "\n", - "from torch.nn.utils.rnn import pad_sequence\n", - "\n", - "def collate_batch_ag_news(batch):\n", - " label_list, text_list = [], []\n", - " for _label, _text in batch:\n", - " label_list.append(label_pipeline(_label))\n", - " text_list.append(torch.tensor(text_pipeline(_text), dtype=torch.int64))\n", - "\n", - "\n", - " label_list = torch.tensor(label_list, dtype=torch.int64)\n", - " text_list = pad_sequence(text_list, batch_first=True)\n", - "\n", - "\n", - " return label_list.to(device), text_list.to(device)\n", - "\n", - "BATCH_SIZE = 32\n", - "\n", - "train_dataloader_ag_news = DataLoader(\n", - " split_train_ag_news_, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_batch_ag_news\n", - ")\n", - "valid_dataloader_ag_news = DataLoader(\n", - " split_valid_ag_news_, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_batch_ag_news\n", - ")\n", - "test_dataloader_ag_news = DataLoader(\n", - " test_dataset_ag_news, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_batch_ag_news\n", - ")\n", - "\n", - "\n", - "model_ag_news = Net(num_class=4,vocab_size=vocab_size).to(device)\n", - "model_ag_news.to(device)" - ] - }, - { - "cell_type": "code", - "execution_count": 37, - "id": "1ee6601d-69a0-4274-a237-38f981a6153b", - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - " 0%| | 0/2 [00:00" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "acc_urlopened = urlopen('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/bQk8mJu3Uct3I4JEsEtRnw/model-AG%20News%20small1-acc')\n", - "loss_urlopened = urlopen('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/KNQkqJWWwY_XfbFBRFhZNA/model-AG%20News%20small1-loss')\n", - "acc_epoch = pickle.load(acc_urlopened)\n", - "cum_loss_list = pickle.load(loss_urlopened)\n", - "plot(cum_loss_list,acc_epoch)" - ] - }, - { - "cell_type": "markdown", - "id": "a17edde1-7268-468d-ad53-d1880848efc6", - "metadata": {}, - "source": [ - "The following code loads the pretrained model and evaluates its performance on the AG News test set. **For efficiency, let's not run the evaluation because it can take a few minutes. Instead, claim that the pretrained model works well on the AG News dataset. If you would like to confirm the result for yourself, feel free to uncomment the last line in the following code block.**\n" - ] - }, - { - "cell_type": "code", - "execution_count": 43, - "id": "e98831cf-bdb6-4e9c-9932-8ee7e6897450", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "" - ] - }, - "execution_count": 43, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "urlopened = urlopen('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/9c3Dh2O_jsYBShBuchUNlg/model-AG%20News%20small1.pth')\n", - "model_ag_news_ = Net(vocab_size=vocab_size, num_class=4).to(device)\n", - "model_ag_news_.load_state_dict(torch.load(io.BytesIO(urlopened.read()), map_location=device))" - ] - }, - { - "cell_type": "code", - "execution_count": 44, - "id": "346d1f7a-2331-4010-a979-ae71efb00fcc", - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "100%|████████████████████████████████████████| 238/238 [00:00<00:00, 323.29it/s]\n" - ] - }, - { - "data": { - "text/plain": [ - "0.9046052631578947" - ] - }, - "execution_count": 44, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "evaluate(test_dataloader_ag_news, model_ag_news_)" - ] - }, - { - "cell_type": "markdown", - "id": "49ecfc70-585f-4fd5-a2d3-3eb558705103", - "metadata": {}, - "source": [ - "As you can see, the pretrained model worked extremely well on the AG News data set. However, can this model be fine-tuned to perform well on the IMDB data set as well? Let's find out! You can begin by loading the pretrained AG News model.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 46, - "id": "92336349-f915-4471-a3d2-64c90ea31171", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "" - ] - }, - "execution_count": 46, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "urlopened = urlopen('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/9c3Dh2O_jsYBShBuchUNlg/model-AG%20News%20small1.pth')\n", - "model_fine1 = Net(vocab_size=vocab_size, num_class=4).to(device)\n", - "model_fine1.load_state_dict(torch.load(io.BytesIO(urlopened.read()), map_location=device))\n" - ] - }, - { - "cell_type": "markdown", - "id": "80e739c4-0b6b-44fa-8d5d-456c52576dd6", - "metadata": {}, - "source": [ - "The IMDB dataset is a binary classification task with only two classes (positive and negative reviews). Therefore, the output layer of the AG NEWS model should be adjusted to have just two output neurons to reflect the binary nature of the IMDB dataset. This adjustment is essential for the model to accurately learn and predict the sentiment of movie reviews in the IMDB dataset.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 47, - "id": "1190f1bf-a030-43b1-b146-5f27715ce6a4", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Original final layer: Linear(in_features=100, out_features=4, bias=True)\n", - "Input dimention final layer: 100\n" - ] - } - ], - "source": [ - "model_fine1.classifier\n", - "in_features = model_fine1.classifier.in_features\n", - "print(\"Original final layer:\", model_fine1.classifier)\n", - "print(\"Input dimention final layer:\", in_features)" - ] - }, - { - "cell_type": "markdown", - "id": "682f6a52-d8d8-4ed7-9162-1bf74dee5542", - "metadata": {}, - "source": [ - "You can change the final layer into a two-class problem.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 48, - "id": "46da3876-5b30-4174-bcf1-2702836f3d6d", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "Net(\n", - " (emb): Embedding(400000, 100)\n", - " (pos_encoder): PositionalEncoding(\n", - " (dropout): Dropout(p=0.1, inplace=False)\n", - " )\n", - " (transformer_encoder): TransformerEncoder(\n", - " (layers): ModuleList(\n", - " (0-1): 2 x TransformerEncoderLayer(\n", - " (self_attn): MultiheadAttention(\n", - " (out_proj): NonDynamicallyQuantizableLinear(in_features=100, out_features=100, bias=True)\n", - " )\n", - " (linear1): Linear(in_features=100, out_features=128, bias=True)\n", - " (dropout): Dropout(p=0.1, inplace=False)\n", - " (linear2): Linear(in_features=128, out_features=100, bias=True)\n", - " (norm1): LayerNorm((100,), eps=1e-05, elementwise_affine=True)\n", - " (norm2): LayerNorm((100,), eps=1e-05, elementwise_affine=True)\n", - " (dropout1): Dropout(p=0.1, inplace=False)\n", - " (dropout2): Dropout(p=0.1, inplace=False)\n", - " )\n", - " )\n", - " )\n", - " (classifier): Linear(in_features=100, out_features=2, bias=True)\n", - ")" - ] - }, - "execution_count": 48, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "model_fine1.classifier = nn.Linear(in_features, 2)\n", - "model_fine1.to(device)" - ] - }, - { - "cell_type": "markdown", - "id": "672f77ee-1134-4656-9e85-c158cb4a64ea", - "metadata": {}, - "source": [ - "The following code shows the layers that are frozen (`requires_grad == False`) and unfrozen (`requires_grad == True`) in the model. The unfrozen layers will have their weights updated during fine-tuning.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 49, - "id": "9afe1e34-b1e3-42a8-b67d-5e99e761dbab", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "emb.weight requires_grad: False\n", - "transformer_encoder.layers.0.self_attn.in_proj_weight requires_grad: True\n", - "transformer_encoder.layers.0.self_attn.in_proj_bias requires_grad: True\n", - "transformer_encoder.layers.0.self_attn.out_proj.weight requires_grad: True\n", - "transformer_encoder.layers.0.self_attn.out_proj.bias requires_grad: True\n", - "transformer_encoder.layers.0.linear1.weight requires_grad: True\n", - "transformer_encoder.layers.0.linear1.bias requires_grad: True\n", - "transformer_encoder.layers.0.linear2.weight requires_grad: True\n", - "transformer_encoder.layers.0.linear2.bias requires_grad: True\n", - "transformer_encoder.layers.0.norm1.weight requires_grad: True\n", - "transformer_encoder.layers.0.norm1.bias requires_grad: True\n", - "transformer_encoder.layers.0.norm2.weight requires_grad: True\n", - "transformer_encoder.layers.0.norm2.bias requires_grad: True\n", - "transformer_encoder.layers.1.self_attn.in_proj_weight requires_grad: True\n", - "transformer_encoder.layers.1.self_attn.in_proj_bias requires_grad: True\n", - "transformer_encoder.layers.1.self_attn.out_proj.weight requires_grad: True\n", - "transformer_encoder.layers.1.self_attn.out_proj.bias requires_grad: True\n", - "transformer_encoder.layers.1.linear1.weight requires_grad: True\n", - "transformer_encoder.layers.1.linear1.bias requires_grad: True\n", - "transformer_encoder.layers.1.linear2.weight requires_grad: True\n", - "transformer_encoder.layers.1.linear2.bias requires_grad: True\n", - "transformer_encoder.layers.1.norm1.weight requires_grad: True\n", - "transformer_encoder.layers.1.norm1.bias requires_grad: True\n", - "transformer_encoder.layers.1.norm2.weight requires_grad: True\n", - "transformer_encoder.layers.1.norm2.bias requires_grad: True\n", - "classifier.weight requires_grad: True\n", - "classifier.bias requires_grad: True\n" - ] - } - ], - "source": [ - "for name, param in model_fine1.named_parameters():\n", - " print(f\"{name} requires_grad: {param.requires_grad}\")" - ] - }, - { - "cell_type": "markdown", - "id": "2f1597ed-a727-42d5-bee1-7aaabd6c7681", - "metadata": {}, - "source": [ - "The following code block simulates fine-tuning on the shortened training set for just 2 epochs. **For the sake of time efficiency, this code block has been commented out**. If you want to see what training looks like, uncomment the following code block, but remember that this code could take approximately 2 minutes to run.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 50, - "id": "3d5c5154-4b9d-4360-83d5-fc21a6ba930b", - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - " 0%| | 0/2 [00:00" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "acc_urlopened = urlopen('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/3LEJw8BRgJJFGqlLxaETxA/model-fine1-acc')\n", - "loss_urlopened = urlopen('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/-CT1h97vjv0TolY82Nw29g/model-fine1-loss')\n", - "acc_epoch = pickle.load(acc_urlopened)\n", - "cum_loss_list = pickle.load(loss_urlopened)\n", - "plot(cum_loss_list,acc_epoch)" - ] - }, - { - "cell_type": "markdown", - "id": "85866d1f-1a0d-487c-a426-9da326a06c1f", - "metadata": {}, - "source": [ - "The following line loads a prefine-tuned model that was trained for 100 epochs on the full IMDB training set and evaluates its performance on the IMDB test set. **For the sake of efficiency, let's not run the evaluation because it can take a few minutes to run. Instead, report the result underneath the cell. If you would like to confirm the result for yourself, feel free to uncomment the last line in the code block.**\n" - ] - }, - { - "cell_type": "code", - "execution_count": 54, - "id": "cda6a606-1a87-4846-9926-54edc577879a", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "" - ] - }, - "execution_count": 54, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "urlopened = urlopen('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/e0WOHKh5dnrbC2lGhpsMMw/model-fine1.pth')\n", - "model_fine1_ = Net(vocab_size=vocab_size, num_class=2).to(device)\n", - "model_fine1_.load_state_dict(torch.load(io.BytesIO(urlopened.read()), map_location=device))" - ] - }, - { - "cell_type": "code", - "execution_count": 55, - "id": "e75cfe66-a265-4e7a-8d5f-a7c68f17c6e3", - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "100%|█████████████████████████████████████████| 782/782 [00:10<00:00, 77.06it/s]\n" - ] - }, - { - "data": { - "text/plain": [ - "0.8604" - ] - }, - "execution_count": 55, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "evaluate(test_dataloader, model_fine1_)" - ] - }, - { - "cell_type": "markdown", - "id": "c9b127d6-b41d-4195-ac5a-3da2d8d70b33", - "metadata": {}, - "source": [ - "This model demonstrated notable improvement, exhibiting a remarkable achievement with an accuracy of 86% on the test data. This is higher than the 83% achieved by the model trained from scratch on the IMDB dataset. Although the training process was time-intensive (The fine-tuning was as time-intensive as training the model from scratch), the enhanced performance underscores the fine-tuned model's effectiveness and superiority over the model trained from scratch. Much of the computational effort was devoted to updating the transformer layers. To expedite the training process, one viable strategy is to focus on training the final layer only, which can significantly reduce the computational load but might compromise the model's accuracy.\n" - ] - }, - { - "cell_type": "markdown", - "id": "55331d63-1150-465b-b0bc-d13bbd24fb7c", - "metadata": {}, - "source": [ - "### Fine-tune the final layer only\n", - "\n", - "Fine-tuning the final output layer of a neural network is similar to fine-tuning the whole model. You can begin by loading the pretrained model that you would like to fine-tune. In this case, it is the same model pretrained on the AG News data set.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 59, - "id": "c2ffcf34-695f-4b9d-b6c2-5529a74d568a", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "" - ] - }, - "execution_count": 59, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "urlopened = urlopen('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/9c3Dh2O_jsYBShBuchUNlg/model-AG%20News%20small1.pth')\n", - "model_fine2 = Net(vocab_size=vocab_size, num_class=4).to(device)\n", - "model_fine2.load_state_dict(torch.load(io.BytesIO(urlopened.read()), map_location=device))" - ] - }, - { - "cell_type": "markdown", - "id": "5fa5ba41-8d73-4649-b459-cbc321ca26e2", - "metadata": {}, - "source": [ - "Now, the key difference. You iterate through all of the parameters in the `model_fine2` model and set the `requires_grad` attribute of each parameter to `False`. This effectively freezes all of the layers in the model, meaning that their weights are to be updated during training.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 60, - "id": "000978e5-6244-4be2-8fd5-3e0c7ed1c619", - "metadata": {}, - "outputs": [], - "source": [ - "# Freeze all layers in the model\n", - "for param in model_fine2.parameters():\n", - " param.requires_grad = False" - ] - }, - { - "cell_type": "markdown", - "id": "54b17c25-96f4-43b9-b0a8-963085cf5638", - "metadata": {}, - "source": [ - "Replace the final layer to reflect the fact that you are solving a two-class problem. Note that the new layer will be unfrozen.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 61, - "id": "f0c29db4-3051-4247-8459-b7adab12fa45", - "metadata": {}, - "outputs": [], - "source": [ - "dim=model_fine2.classifier.in_features" - ] - }, - { - "cell_type": "code", - "execution_count": 62, - "id": "b79d7ed1-7a83-4140-ad93-b5e28cdab784", - "metadata": {}, - "outputs": [], - "source": [ - "model_fine2.classifier = nn.Linear(dim, 2)" - ] - }, - { - "cell_type": "code", - "execution_count": 63, - "id": "9387fde1-0a71-45b5-89cf-a2c90ff663d1", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "Net(\n", - " (emb): Embedding(400000, 100)\n", - " (pos_encoder): PositionalEncoding(\n", - " (dropout): Dropout(p=0.1, inplace=False)\n", - " )\n", - " (transformer_encoder): TransformerEncoder(\n", - " (layers): ModuleList(\n", - " (0-1): 2 x TransformerEncoderLayer(\n", - " (self_attn): MultiheadAttention(\n", - " (out_proj): NonDynamicallyQuantizableLinear(in_features=100, out_features=100, bias=True)\n", - " )\n", - " (linear1): Linear(in_features=100, out_features=128, bias=True)\n", - " (dropout): Dropout(p=0.1, inplace=False)\n", - " (linear2): Linear(in_features=128, out_features=100, bias=True)\n", - " (norm1): LayerNorm((100,), eps=1e-05, elementwise_affine=True)\n", - " (norm2): LayerNorm((100,), eps=1e-05, elementwise_affine=True)\n", - " (dropout1): Dropout(p=0.1, inplace=False)\n", - " (dropout2): Dropout(p=0.1, inplace=False)\n", - " )\n", - " )\n", - " )\n", - " (classifier): Linear(in_features=100, out_features=2, bias=True)\n", - ")" - ] - }, - "execution_count": 63, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "model_fine2.to(device)" - ] - }, - { - "cell_type": "markdown", - "id": "42123afb-6a78-4f98-87dd-e87c2cfa7c4f", - "metadata": {}, - "source": [ - "The following block simulates fine-tuning on the shortened training set for just 2 epochs. **For the sake of time efficiency, this code block has been commented out**. The following code should take a shorter amount of time to train than the full fine-tuning conducted previously because only the final layer is unfrozen.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 64, - "id": "3114fa2b-90ee-4955-8459-e01c1db83f2d", - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - " 0%| | 0/2 [00:00" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "acc_urlopened = urlopen('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/UdR3ApQnxSeV2mrA0CbiLg/model-fine2-acc')\n", - "loss_urlopened = urlopen('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/rWGDIF-uL2dEngWcIo9teQ/model-fine2-loss')\n", - "acc_epoch = pickle.load(acc_urlopened)\n", - "cum_loss_list = pickle.load(loss_urlopened)\n", - "plot(cum_loss_list,acc_epoch)" - ] - }, - { - "cell_type": "markdown", - "id": "3043392c-ee76-4103-a758-83e90f594604", - "metadata": {}, - "source": [ - "The following line loads the pretrained model and evaluates its performance on the test set. **For efficiency, let's not run the evaluation because it can take a few minutes to run. Instead, report the result underneath the cell. If you would like to confirm the result for yourself, feel free to uncomment the last line in the following code block.**\n" - ] - }, - { - "cell_type": "code", - "execution_count": 68, - "id": "14e51cf2-4e3f-4adf-8c58-f65da0e74f65", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "" - ] - }, - "execution_count": 68, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "urlopened = urlopen('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/B-1H6lpDg-A0zRwpB6Ek2g/model-fine2.pth')\n", - "model_fine2_ = Net(vocab_size=vocab_size, num_class=2).to(device)\n", - "model_fine2_.load_state_dict(torch.load(io.BytesIO(urlopened.read()), map_location=device))" - ] - }, - { - "cell_type": "code", - "execution_count": 69, - "id": "d7cf2dfa-3705-4f3e-a114-b4c56368cc77", - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "100%|█████████████████████████████████████████| 782/782 [00:09<00:00, 85.13it/s]\n" - ] - }, - { - "data": { - "text/plain": [ - "0.64144" - ] - }, - "execution_count": 69, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "evaluate(test_dataloader, model_fine2_)" - ] - }, - { - "cell_type": "markdown", - "id": "dbd9b301-88ba-42d4-b8ca-6e00379fb3b4", - "metadata": {}, - "source": [ - "The previous code indicates that although fine-tuning the final layer takes a significantly smaller amount of time than fine-tuning the whole model, the performance of the model with just the last layer unfrozen is significantly worse (64% accuracy) than the fine-tuned model with all layers unfrozen (86% accuracy).\n", - "\n", - "---" - ] - }, - { - "cell_type": "markdown", - "id": "d23b9fe8-1655-4c4c-8ee2-f78db934d5f2", - "metadata": {}, - "source": [ - "# Adapters\n", - "FeatureAdapter is a neural network module that introduces a low-dimensional bottleneck in a transformer architecture to allow fine-tuning with fewer parameters. It compresses the original high-dimensional embeddings into a lower dimension, applies a non-linear transformation, and then expands it back to the original dimension. This process is followed by a residual\n", - "connection that adds the transformed output back to the original input to preserve information and\n", - "promote gradient flow.\n", - "\n", - "## Benefits of using adapters in neural networks\n", - "\n", - "- **Efficient fine-tuning**: Adapters allow for targeted updates to specific parts of the model, reducing the need to retrain large sections of the network.\n", - "\n", - "- **Parameter efficiency**: By adding only a few parameters, adapters make it feasible to modify large models without substantial computational overhead.\n", - "\n", - "- **Preservation of pretrained features**: Adapters enable the modification of a model while retaining the valuable features learned during extensive pretraining.\n", - "\n", - "- **Modularity and flexibility**: They enhance the modularity of models, allowing easy adaptation to various tasks without altering core architecture.\n", - "\n", - "- **Task-specific adaptation**: Adapters can be tailored to improve performance on particular tasks, optimizing the model’s effectiveness.\n", - "\n", - "- **Transfer learning and domain adaptation**: They facilitate the adaptation of models to new domains, bridging gaps between different data distributions.\n", - "\n", - "- **Continual learning**: Adapters support the model's ability to learn new information continuously without forgetting previous knowledge.\n", - "\n", - "- **Reduced risk of overfitting**: With fewer trainable parameters, adapters help prevent overfitting, especially on smaller data sets.\n", - "\n", - "The following code shows an adapter model.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 70, - "id": "c46c86b1-eb04-4976-9fa7-c3fc06339bae", - "metadata": {}, - "outputs": [], - "source": [ - "class FeatureAdapter(nn.Module):\n", - " \"\"\"\n", - " Attributes:\n", - " size (int): The bottleneck dimension to which the embeddings are temporarily reduced.\n", - " model_dim (int): The original dimension of the embeddings or features in the transformer model.\n", - " \"\"\"\n", - " def __init__(self, bottleneck_size=50, model_dim=100):\n", - " super().__init__()\n", - " self.bottleneck_transform = nn.Sequential(\n", - " nn.Linear(model_dim, bottleneck_size), # Down-project to a smaller dimension\n", - " nn.ReLU(), # Apply non-linearity\n", - " nn.Linear(bottleneck_size, model_dim) # Up-project back to the original dimension\n", - " )\n", - "\n", - " def forward(self, x):\n", - " \"\"\"\n", - " Forward pass of the FeatureAdapter. Applies the bottleneck transformation to the input\n", - " tensor and adds a skip connection.\n", - "\n", - " Args:\n", - " x (Tensor): Input tensor with shape (batch_size, seq_length, model_dim).\n", - "\n", - " Returns:\n", - " Tensor: Output tensor after applying the adapter transformation and skip connection,\n", - " maintaining the original input shape.\n", - " \"\"\"\n", - " transformed_features = self.bottleneck_transform(x) # Transform features through the bottleneck\n", - " output_with_residual = transformed_features + x # Add the residual connection\n", - " return output_with_residual" - ] - }, - { - "cell_type": "markdown", - "id": "a5c12e9a-5bf1-4a4f-b792-25a264253d28", - "metadata": {}, - "source": [ - "The adapted class wraps this adapter functionality around any specified linear layer, enhancing its output with the non-linearity of a ReLU activation function. This setup is particularly useful for experimenting with subtle architectural modifications in deep learning models, facilitating fine-tuning and potentially improving model performance on complex tasks.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 71, - "id": "97d59ae0-610e-49bb-a9c4-1020c2dc468d", - "metadata": {}, - "outputs": [], - "source": [ - "class Adapted(nn.Module):\n", - " def __init__(self, linear,bottleneck_size=None):\n", - " super(Adapted, self).__init__()\n", - " self.linear = linear\n", - " model_dim = linear.out_features\n", - " if bottleneck_size is None:\n", - " bottleneck_size = model_dim//2 # Define default bottleneck size as half the model_dim\n", - "\n", - " # Initialize FeatureAdapter with calculated bottleneck_size and model_dim\n", - " self.adaptor = FeatureAdapter(bottleneck_size=bottleneck_size, model_dim=model_dim)\n", - "\n", - " def forward(self, x):\n", - " # First, the input x is passed through the linear layer\n", - " x=self.linear(x)\n", - " # Then it's adapted using FeatureAdapter\n", - " x= self.adaptor(x)\n", - " return x" - ] - }, - { - "cell_type": "markdown", - "id": "d8cfb80d-eb8c-4342-8066-33134c5b456d", - "metadata": {}, - "source": [ - "You load the pretrained transformer model that was trained on the AG News dataset.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 74, - "id": "01070ca9-5037-470e-b8f7-ceddfc2f279b", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "" - ] - }, - "execution_count": 74, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "urlopened = urlopen('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/9c3Dh2O_jsYBShBuchUNlg/model-AG%20News%20small1.pth')\n", - "model_adapters = Net(vocab_size=vocab_size, num_class=4).to(device)\n", - "model_adapters.load_state_dict(torch.load(io.BytesIO(urlopened.read()), map_location=device))" - ] - }, - { - "cell_type": "markdown", - "id": "e1df412e-51c5-4e8c-9dbe-c6d38f11a0cf", - "metadata": {}, - "source": [ - "\n", - "First, freeze the parameters of a model named model_adapters to prevent them from being updated during training. Then, retrieve the number of input features for the classifier, and replace the classifier with a new linear layer that outputs to two classes.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 75, - "id": "1820db89-b3b0-4cb2-b42f-1826ba1ec6d0", - "metadata": {}, - "outputs": [], - "source": [ - "for param in model_adapters.parameters():\n", - " param.requires_grad = False\n", - "\n", - "dim= model_adapters.classifier.in_features\n", - "\n", - "model_adapters.classifier = nn.Linear(dim, 2)" - ] - }, - { - "cell_type": "markdown", - "id": "905ad98d-2f62-475a-a7e0-a2f7119b5081", - "metadata": {}, - "source": [ - "Let's explore how to apply the adapted object to a linear layer to obtain the first output. You can obtain the unadapted linear layer for the first output by:\n" - ] - }, - { - "cell_type": "code", - "execution_count": 76, - "id": "cbdcd100-7fda-45ee-a823-9233b7e68ca6", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Linear(in_features=100, out_features=128, bias=True)\n" - ] - } - ], - "source": [ - "my_example_layer=model_adapters.transformer_encoder.layers[0].linear1\n", - "print(my_example_layer)" - ] - }, - { - "cell_type": "markdown", - "id": "1cf2017f-bb26-4b9c-86ea-f32cc5ce4450", - "metadata": {}, - "source": [ - "In the following code, you copy the linear layer and add an adapter layer to it.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 77, - "id": "83fd78c1-111d-45f6-9adf-3ab5c7264a8b", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Adapted(\n", - " (linear): Linear(in_features=100, out_features=128, bias=True)\n", - " (adaptor): FeatureAdapter(\n", - " (bottleneck_transform): Sequential(\n", - " (0): Linear(in_features=128, out_features=64, bias=True)\n", - " (1): ReLU()\n", - " (2): Linear(in_features=64, out_features=128, bias=True)\n", - " )\n", - " )\n", - ")\n" - ] - } - ], - "source": [ - "my_adapeted_layer=Adapted(my_example_layer)\n", - "print(my_adapeted_layer)" - ] - }, - { - "cell_type": "markdown", - "id": "50e35b98-bcbd-4aec-b039-464e87ddd6fb", - "metadata": {}, - "source": [ - "You can print the adapted layer and show that the new layers have their requires_grad attribute set to True, indicating that these layers will be updated during training.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 78, - "id": "d23c9515-4303-437d-8f02-99a3723e8b89", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "False\n", - "False\n", - "True\n", - "True\n", - "True\n", - "True\n" - ] - } - ], - "source": [ - "for parm in my_adapeted_layer.parameters():\n", - " print(parm.requires_grad)" - ] - }, - { - "cell_type": "markdown", - "id": "1a6bc1b1-f1bb-4c01-99ee-c2e16e5090c2", - "metadata": {}, - "source": [ - "You can set a layer in the model to the adapter layer, as shown in the following code in the commented-out line. However, because there are many layers, a more systematic approach would be to traverse the model and replace specific layers with the adapter layer. Note that you should set the bottleneck size to 24, ensuring that there are fewer parameters to train than during a full fine-tuning.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 79, - "id": "f49d7836-5487-4cb4-a177-797f4eb75cee", - "metadata": {}, - "outputs": [], - "source": [ - "# Adapt a specific layer\n", - "#model_adapters.transformer_encoder.layers[0].linear1=Adapted(my_example_layer)" - ] - }, - { - "cell_type": "code", - "execution_count": 82, - "id": "bc5a4706-ef6f-4d1d-9b44-4e6e3cfba925", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "2" - ] - }, - "execution_count": 82, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Find number of layers\n", - "N_layers=len(model_adapters.transformer_encoder.layers)\n", - "N_layers" - ] - }, - { - "cell_type": "code", - "execution_count": 83, - "id": "6af0d8e7-f38c-47a6-9ba4-7a74994e7f33", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - " before linear1\n", - "Linear(in_features=100, out_features=128, bias=True)\n", - " after linear1\n", - "Adapted(\n", - " (linear): Linear(in_features=100, out_features=128, bias=True)\n", - " (adaptor): FeatureAdapter(\n", - " (bottleneck_transform): Sequential(\n", - " (0): Linear(in_features=128, out_features=24, bias=True)\n", - " (1): ReLU()\n", - " (2): Linear(in_features=24, out_features=128, bias=True)\n", - " )\n", - " )\n", - ")\n", - " before linear2\n", - "Linear(in_features=128, out_features=100, bias=True)\n", - " after linear2\n", - "Adapted(\n", - " (linear): Linear(in_features=128, out_features=100, bias=True)\n", - " (adaptor): FeatureAdapter(\n", - " (bottleneck_transform): Sequential(\n", - " (0): Linear(in_features=100, out_features=24, bias=True)\n", - " (1): ReLU()\n", - " (2): Linear(in_features=24, out_features=100, bias=True)\n", - " )\n", - " )\n", - ")\n", - " before linear1\n", - "Linear(in_features=100, out_features=128, bias=True)\n", - " after linear1\n", - "Adapted(\n", - " (linear): Linear(in_features=100, out_features=128, bias=True)\n", - " (adaptor): FeatureAdapter(\n", - " (bottleneck_transform): Sequential(\n", - " (0): Linear(in_features=128, out_features=24, bias=True)\n", - " (1): ReLU()\n", - " (2): Linear(in_features=24, out_features=128, bias=True)\n", - " )\n", - " )\n", - ")\n", - " before linear2\n", - "Linear(in_features=128, out_features=100, bias=True)\n", - " after linear2\n", - "Adapted(\n", - " (linear): Linear(in_features=128, out_features=100, bias=True)\n", - " (adaptor): FeatureAdapter(\n", - " (bottleneck_transform): Sequential(\n", - " (0): Linear(in_features=100, out_features=24, bias=True)\n", - " (1): ReLU()\n", - " (2): Linear(in_features=24, out_features=100, bias=True)\n", - " )\n", - " )\n", - ")\n" - ] - } - ], - "source": [ - "# Traverse model and adapt\n", - "for n in range(N_layers):\n", - " encoder=model_adapters.transformer_encoder.layers[n]\n", - " if encoder.linear1:\n", - " print(\" before linear1\")\n", - " print(encoder.linear1)\n", - " model_adapters.transformer_encoder.layers[n].linear1=Adapted(encoder.linear1, bottleneck_size=24)\n", - " print(\" after linear1\")\n", - " print(model_adapters.transformer_encoder.layers[n].linear1)\n", - "\n", - " if encoder.linear2:\n", - " print(\" before linear2\")\n", - " print(model_adapters.transformer_encoder.layers[n].linear2)\n", - " model_adapters.transformer_encoder.layers[n].linear2=Adapted(encoder.linear2, bottleneck_size=24)\n", - " print(\" after linear2\")\n", - " print(model_adapters.transformer_encoder.layers[n].linear2)" - ] - }, - { - "cell_type": "markdown", - "id": "1ea3e84d-649a-462d-8c21-ce1d4a915ce2", - "metadata": {}, - "source": [ - "The following code sends the model to the device.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 84, - "id": "67b7450d-b1f4-42c6-9995-b13da7a8a7d7", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "Net(\n", - " (emb): Embedding(400000, 100)\n", - " (pos_encoder): PositionalEncoding(\n", - " (dropout): Dropout(p=0.1, inplace=False)\n", - " )\n", - " (transformer_encoder): TransformerEncoder(\n", - " (layers): ModuleList(\n", - " (0-1): 2 x TransformerEncoderLayer(\n", - " (self_attn): MultiheadAttention(\n", - " (out_proj): NonDynamicallyQuantizableLinear(in_features=100, out_features=100, bias=True)\n", - " )\n", - " (linear1): Adapted(\n", - " (linear): Linear(in_features=100, out_features=128, bias=True)\n", - " (adaptor): FeatureAdapter(\n", - " (bottleneck_transform): Sequential(\n", - " (0): Linear(in_features=128, out_features=24, bias=True)\n", - " (1): ReLU()\n", - " (2): Linear(in_features=24, out_features=128, bias=True)\n", - " )\n", - " )\n", - " )\n", - " (dropout): Dropout(p=0.1, inplace=False)\n", - " (linear2): Adapted(\n", - " (linear): Linear(in_features=128, out_features=100, bias=True)\n", - " (adaptor): FeatureAdapter(\n", - " (bottleneck_transform): Sequential(\n", - " (0): Linear(in_features=100, out_features=24, bias=True)\n", - " (1): ReLU()\n", - " (2): Linear(in_features=24, out_features=100, bias=True)\n", - " )\n", - " )\n", - " )\n", - " (norm1): LayerNorm((100,), eps=1e-05, elementwise_affine=True)\n", - " (norm2): LayerNorm((100,), eps=1e-05, elementwise_affine=True)\n", - " (dropout1): Dropout(p=0.1, inplace=False)\n", - " (dropout2): Dropout(p=0.1, inplace=False)\n", - " )\n", - " )\n", - " )\n", - " (classifier): Linear(in_features=100, out_features=2, bias=True)\n", - ")" - ] - }, - "execution_count": 84, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Send model to device\n", - "model_adapters.to(device)" - ] - }, - { - "cell_type": "markdown", - "id": "d233586b-c7ef-49e6-b684-b8d24997db80", - "metadata": {}, - "source": [ - "Finally, the following code simulates training of the adapted model by training on a shortend IMDB train set for 2 epochs.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 85, - "id": "f275d5f0-ee19-469b-a957-98bf72d3f3c3", - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - " 0%| | 0/2 [00:00" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "acc_urlopened = urlopen('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/D49zrrMPWO_ktwQo7PSHIQ/model-adapters-acc')\n", - "loss_urlopened = urlopen('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/RXWlmyaco695RiaoU7QsnA/model-adapters-loss')\n", - "acc_epoch = pickle.load(acc_urlopened)\n", - "cum_loss_list = pickle.load(loss_urlopened)\n", - "plot(cum_loss_list,acc_epoch)" - ] - }, - { - "cell_type": "markdown", - "id": "f7e5392a-549b-4b68-968a-8c8d70dbf05d", - "metadata": {}, - "source": [ - "The following code loads the adapted model fine-tuned for 100 epochs on the full IMDB train set and evaluates its performance on the IMDB test set.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 89, - "id": "5caaa3f7-5c89-4f27-8ef0-3d3d29f79b66", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - " before linear1\n", - "Linear(in_features=100, out_features=128, bias=True)\n", - " after linear1\n", - "Adapted(\n", - " (linear): Linear(in_features=100, out_features=128, bias=True)\n", - " (adaptor): FeatureAdapter(\n", - " (bottleneck_transform): Sequential(\n", - " (0): Linear(in_features=128, out_features=24, bias=True)\n", - " (1): ReLU()\n", - " (2): Linear(in_features=24, out_features=128, bias=True)\n", - " )\n", - " )\n", - ")\n", - " before linear2\n", - "Linear(in_features=128, out_features=100, bias=True)\n", - " after linear2\n", - "Adapted(\n", - " (linear): Linear(in_features=128, out_features=100, bias=True)\n", - " (adaptor): FeatureAdapter(\n", - " (bottleneck_transform): Sequential(\n", - " (0): Linear(in_features=100, out_features=24, bias=True)\n", - " (1): ReLU()\n", - " (2): Linear(in_features=24, out_features=100, bias=True)\n", - " )\n", - " )\n", - ")\n", - " before linear1\n", - "Linear(in_features=100, out_features=128, bias=True)\n", - " after linear1\n", - "Adapted(\n", - " (linear): Linear(in_features=100, out_features=128, bias=True)\n", - " (adaptor): FeatureAdapter(\n", - " (bottleneck_transform): Sequential(\n", - " (0): Linear(in_features=128, out_features=24, bias=True)\n", - " (1): ReLU()\n", - " (2): Linear(in_features=24, out_features=128, bias=True)\n", - " )\n", - " )\n", - ")\n", - " before linear2\n", - "Linear(in_features=128, out_features=100, bias=True)\n", - " after linear2\n", - "Adapted(\n", - " (linear): Linear(in_features=128, out_features=100, bias=True)\n", - " (adaptor): FeatureAdapter(\n", - " (bottleneck_transform): Sequential(\n", - " (0): Linear(in_features=100, out_features=24, bias=True)\n", - " (1): ReLU()\n", - " (2): Linear(in_features=24, out_features=100, bias=True)\n", - " )\n", - " )\n", - ")\n" - ] - } - ], - "source": [ - "model_adapters_ = Net(vocab_size=vocab_size, num_class=2).to(device)\n", - "for n in range(N_layers):\n", - " encoder=model_adapters_.transformer_encoder.layers[n]\n", - " if encoder.linear1:\n", - " print(\" before linear1\")\n", - " print(encoder.linear1)\n", - " model_adapters_.transformer_encoder.layers[n].linear1=Adapted(encoder.linear1, bottleneck_size=24)\n", - " print(\" after linear1\")\n", - " print(model_adapters_.transformer_encoder.layers[n].linear1)\n", - "\n", - " if encoder.linear2:\n", - " print(\" before linear2\")\n", - " print(model_adapters_.transformer_encoder.layers[n].linear2)\n", - " model_adapters_.transformer_encoder.layers[n].linear2=Adapted(encoder.linear2, bottleneck_size=24)\n", - " print(\" after linear2\")\n", - " print(model_adapters_.transformer_encoder.layers[n].linear2)\n", - "\n", - "model_adapters_.to(device)\n", - "for param in model_adapters_.parameters():\n", - " param.requires_grad = False\n" - ] - }, - { - "cell_type": "code", - "execution_count": 90, - "id": "d582b1d3-0fcc-473e-82de-58f9ab1443ef", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "" - ] - }, - "execution_count": 90, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "urlopened = urlopen('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/PGhd5G_NVrWNH-_jdjwNlw/model-adapters.pth')\n", - "model_adapters_.load_state_dict(torch.load(io.BytesIO(urlopened.read()), map_location=device))" - ] - }, - { - "cell_type": "code", - "execution_count": 91, - "id": "df69e146-1a1f-4267-9372-9834fcf6616d", - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "100%|█████████████████████████████████████████| 782/782 [00:11<00:00, 66.29it/s]\n" - ] - }, - { - "data": { - "text/plain": [ - "0.85608" - ] - }, - "execution_count": 91, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "evaluate(test_dataloader, model_adapters_)" - ] - }, - { - "cell_type": "markdown", - "id": "08262362-924a-42d0-b928-0cf6ffe5331c", - "metadata": {}, - "source": [ - "As you can see, the performance of the fine-tuned adapted model is nearly identical to the fully fine-tuned model, with both models achieving a roughly 86% accuracy. This is an especially surprising result because a significantly smaller number of weights were updated for the adapted model than the fully fine-tuned model. Note that only the adapter layers with a bottleneck size of 24 and the final classifier layer are unfrozen.\n", - "\n", - "The above shows that adapters can be used for parameter efficient fine-tuning (PEFT) and that the performance of a model fine-tuned using adapters can be almost as good as a fully fine-tuned model with all of the layers unfrozen!\n", - "\n", - "---" - ] - }, - { - "cell_type": "markdown", - "id": "3b255d65-8f64-4c95-8fd6-ac69d1a67a5a", - "metadata": {}, - "source": [ - "## Exercise: Adapt linear layers in a different network\n", - "\n", - "The following code defines a neural network called `NeuralNetwork`.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 92, - "id": "82730ce9-d9a9-483f-8d00-e0cbd78de764", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "NeuralNetwork(\n", - " (flatten): Flatten(start_dim=1, end_dim=-1)\n", - " (linear_relu_stack): Sequential(\n", - " (0): Linear(in_features=784, out_features=512, bias=True)\n", - " (1): ReLU()\n", - " (2): Linear(in_features=512, out_features=512, bias=True)\n", - " (3): ReLU()\n", - " (4): Linear(in_features=512, out_features=10, bias=True)\n", - " )\n", - ")\n" - ] - } - ], - "source": [ - "class NeuralNetwork(nn.Module):\n", - " def __init__(self):\n", - " super().__init__()\n", - " self.flatten = nn.Flatten()\n", - " self.linear_relu_stack = nn.Sequential(\n", - " nn.Linear(28*28, 512),\n", - " nn.ReLU(),\n", - " nn.Linear(512, 512),\n", - " nn.ReLU(),\n", - " nn.Linear(512, 10),\n", - " )\n", - "\n", - " def forward(self, x):\n", - " x = self.flatten(x)\n", - " logits = self.linear_relu_stack(x)\n", - " return logits\n", - "\n", - "exercise_model = NeuralNetwork()\n", - "\n", - "exercise_model.to(device)\n", - "for param in exercise_model.parameters():\n", - " param.requires_grad = False\n", - "\n", - "print(exercise_model)" - ] - }, - { - "cell_type": "markdown", - "id": "ed4e3b4b-a5df-4543-8d35-adac4ff4b093", - "metadata": {}, - "source": [ - "`NeuralNetwork` is a neural network that uses the `Sequential` container from PyTorch. Adapt the first two linear layers in the `Sequential` container by using the bottleneck adapter with a bottleneck size of 30. Also, change the last linear layer to a layer that has 5 outputs.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 94, - "id": "2fde3cfa-d77d-4ad3-82e1-a3f7eb1d0cf0", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "NeuralNetwork(\n", - " (flatten): Flatten(start_dim=1, end_dim=-1)\n", - " (linear_relu_stack): Sequential(\n", - " (0): Adapted(\n", - " (linear): Linear(in_features=784, out_features=512, bias=True)\n", - " (adaptor): FeatureAdapter(\n", - " (bottleneck_transform): Sequential(\n", - " (0): Linear(in_features=512, out_features=30, bias=True)\n", - " (1): ReLU()\n", - " (2): Linear(in_features=30, out_features=512, bias=True)\n", - " )\n", - " )\n", - " )\n", - " (1): ReLU()\n", - " (2): Adapted(\n", - " (linear): Linear(in_features=512, out_features=512, bias=True)\n", - " (adaptor): FeatureAdapter(\n", - " (bottleneck_transform): Sequential(\n", - " (0): Linear(in_features=512, out_features=30, bias=True)\n", - " (1): ReLU()\n", - " (2): Linear(in_features=30, out_features=512, bias=True)\n", - " )\n", - " )\n", - " )\n", - " (3): ReLU()\n", - " (4): Linear(in_features=512, out_features=5, bias=True)\n", - " )\n", - ")\n" - ] - } - ], - "source": [ - "### REPLACE THIS YOUR ANSWER ###\n", - "exercise_model.linear_relu_stack[0] = Adapted(exercise_model.linear_relu_stack[0], bottleneck_size=30)\n", - "exercise_model.linear_relu_stack[2] = Adapted(exercise_model.linear_relu_stack[2], bottleneck_size=30)\n", - "exercise_model.linear_relu_stack[4] = nn.Linear(512, 5)\n", - "print(exercise_model)" - ] - }, - { - "cell_type": "markdown", - "id": "f8244ee4-45b9-4713-8480-6aadf7a66b43", - "metadata": {}, - "source": [ - "
\n", - " Click here for the solution\n", - "\n", - "```python\n", - "exercise_model.linear_relu_stack[0] = Adapted(exercise_model.linear_relu_stack[0], bottleneck_size=30)\n", - "exercise_model.linear_relu_stack[2] = Adapted(exercise_model.linear_relu_stack[2], bottleneck_size=30)\n", - "exercise_model.linear_relu_stack[4] = nn.Linear(512, 5)\n", - "print(exercise_model)\n", - "```\n", - "\n", - "
\n", - "\n", - "---" - ] - }, - { - "cell_type": "markdown", - "id": "151060fe-37c4-4ee1-8034-539b0e3a657a", - "metadata": {}, - "source": [ - "## Congratulations! You have completed the lab\n", - "\n", - "## Authors\n", - "\n", - "[Joseph Santarcangelo](https://author.skills.network/instructors/joseph_santarcangelo)\n", - "\n", - "Joseph has a Ph.D. in Electrical Engineering, his research focused on using machine learning, signal processing, and computer vision to determine how videos impact human cognition. Joseph has been working for IBM since he completed his PhD.\n", - "\n", - "[Wojciech \"Victor\" Fulmyk](https://www.linkedin.com/in/wfulmyk) \n", - "\n", - "Wojciech \"Victor\" Fulmyk is a Data Scientist at IBM, and a PhD Candidate in economics at the University of Calgary.\n", - "\n", - "[Ashutosh Sagar](https://www.linkedin.com/in/ashutoshsagar/) is completing his MS in CS from Dalhousie University. He has previous experience working with Natural Language Processing and as a Data Scientist.\n", - "\n", - "## References\n", - "\n", - "\n", - "[TEXT CLASSIFICATION WITH THE TORCHTEXT LIBRARY](https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html)\n", - "\n", - "[Parameter-Efficient Transfer Learning for NLP](https://arxiv.org/pdf/1902.00751.pdf)\n", - "\n", - "[Simple, Scalable Adaptation for Neural Machine Translation](https://arxiv.org/pdf/1909.08478)\n", - "\n", - "© Copyright IBM Corporation. All rights reserved.\n" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.8.20" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} diff --git a/notebooks/LLM_Specialization/Adapters_in_PyTorch.ipynb b/notebooks/LLM_Specialization/Adapters_in_PyTorch.ipynb index d63034b..7f236d0 100644 --- a/notebooks/LLM_Specialization/Adapters_in_PyTorch.ipynb +++ b/notebooks/LLM_Specialization/Adapters_in_PyTorch.ipynb @@ -1,12 +1,3368 @@ { "cells": [ + { + "cell_type": "markdown", + "id": "b75e0832-1651-4d74-85dd-14fd49ef56e3", + "metadata": {}, + "source": [ + "

\n", + " \n", + " \"Skills\n", + " \n", + "

\n", + "\n", + "# **Adapters in PyTorch**\n", + "\n", + "Estimated time needed: **45** minutes\n", + "\n", + "**_Note to advanced users_: If you are already familiar with classical fine-tuning and you only want to see the section that relates to adapters, skip forward to Adapters and run all of the cells above that section by going to _Run --> Run All Above Selected Cell_**\n", + "\n", + "You can fine-tune a neural network in several ways. Common strategies include adjusting only the final layer or fine-tuning all layers. However, these methods have their drawbacks: fine-tuning just the final layer often leads to less than optimal results, while fine-tuning all layers can be very time-consuming.\n", + "\n", + "To address these issues, researchers have developed various parameter efficient fine-tuning (PEFT) techniques. One such technique involves the use of adapters. Adapters enable modular training, where small, task-specific modules are trained within the model without changing the pre-existing pretrained parameters. This approach efficiently tailors the model to new tasks with a reduced risk of overfitting. However, adapters are not a cure-all solution. While they are less likely to overfit and are computationally efficient, they might not always reach the same level of accuracy as full model fine-tuning, particularly if the task necessitates substantial changes from the pretrained model's original capabilities.\n", + "\n", + "In this hands-on lab, you learn how to apply an adapter to a transformer-based neural network that has been trained on the AG News data set, with the aim of using this model on the IMDB data set. You also evaluate and compare the performance of this method with that of a fully fine-tuned model and a model where only the last layer is fine-tuned.\n", + "\n", + "---\n", + "\n", + "# __Table of contents__\n", + "\n", + "
    \n", + "
  1. Objectives
  2. \n", + "
  3. \n", + " Setup\n", + "
      \n", + "
    1. Install required libraries
    2. \n", + "
    3. Import required libraries
    4. \n", + "
    5. Defining helper functions
    6. \n", + "
    \n", + "
  4. \n", + "
  5. Positional encodings
  6. \n", + "
  7. Import IMDB data set
  8. \n", + "
      \n", + "
    1. IMDB data set overview
    2. \n", + "
        \n", + "
      1. Data set composition
      2. \n", + "
      3. Applications
      4. \n", + "
      5. Challenges
      6. \n", + "
      7. Data set splits
      8. \n", + "
      9. Data loader
      10. \n", + "
      11. Neural network
      12. \n", + "
      \n", + "
    \n", + "
  9. \n", + " Training\n", + "
      \n", + "
    1. Train IMDB
    2. \n", + "
    3. Fine-tune a model pretrained on the AG News data set
    4. \n", + "
    5. Fine-tune the final layer only
    6. \n", + "
    \n", + "
  10. \n", + "
  11. \n", + " Adapters\n", + "
      \n", + "
    1. Benefits of using adapters in neural networks
    2. \n", + "
    \n", + "
  12. \n", + "
  13. \n", + " Exercise: Adapt linear layers in a different network\n", + "
  14. \n", + "
\n", + "\n", + "---" + ] + }, + { + "cell_type": "markdown", + "id": "9e4d071d-35d8-4d69-9168-a40d25df2205", + "metadata": {}, + "source": [ + "# Objectives\n", + "\n", + "After completing this lab, you are able to:\n", + "\n", + "- Define and pretrain a transformer-based neural network using PyTorch for a classification task [Optional]\n", + "- Fully fine-tune the pretrained model for a different classification task [Optional]\n", + "- Compare results by fine-tuning only the last layer of the pretrained model [Optional]\n", + "- Understand how adapters work\n", + "- Apply adapters to linear layers in a neural network\n", + "- Train a neural network in a parameter efficient way by training just the adapted layers\n", + "\n", + "---" + ] + }, + { + "cell_type": "markdown", + "id": "35a85867-b5fa-4780-b512-35bdf829b33c", + "metadata": {}, + "source": [ + "# Setup\n", + "\n", + "### Install required libraries\n", + "\n", + "For this lab, you use the following libraries, which are __not__ preinstalled in the Skills Network Labs environment. __You must run the code in the following cell__ to install them.\n", + "\n", + "```bash\n", + "!pip install --upgrade portalocker==2.8.2 torchtext==0.17.0 torchdata==0.7.1 pandas==2.2.2 matplotlib==3.9.0 scikit-learn==1.5.0 torch==2.2.0 numpy==1.26.4\n", + "```\n", + "\n", + "---" + ] + }, + { + "cell_type": "markdown", + "id": "a2d7c12f-2141-4af9-91a8-b0c1c8088d66", + "metadata": {}, + "source": [ + "### Import required libraries\n", + "\n", + "The following code imports the required libraries.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "81927908-a8c5-42d6-b24c-97b8b13a42e6", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "True\n", + "Tesla P40\n", + "Import Successfully!\n" + ] + } + ], + "source": [ + "# Environment setup\n", + "import os\n", + "os.environ[\"CUDA_VISIBLE_DEVICES\"] = \"0\"\n", + "\n", + "# Suppress warnings\n", + "import warnings\n", + "warnings.filterwarnings('ignore')\n", + "def warn(*args, **kwargs):\n", + " pass\n", + "warnings.warn = warn\n", + "\n", + "# PyTorch and related libraries\n", + "import torch\n", + "from torch import nn\n", + "from torch.utils.data import DataLoader, Dataset\n", + "from torch.utils.data.dataset import random_split\n", + "from torch.nn.utils.rnn import pad_sequence\n", + "\n", + "# TorchText for NLP tasks\n", + "from torchtext.datasets import AG_NEWS, IMDB\n", + "from torchtext.data.utils import get_tokenizer\n", + "from torchtext.vocab import build_vocab_from_iterator, GloVe, Vectors\n", + "from torchtext.data.functional import to_map_style_dataset\n", + "\n", + "# Utility libraries\n", + "import time\n", + "from itertools import accumulate\n", + "import math\n", + "import pickle\n", + "import io\n", + "from urllib.request import urlopen\n", + "import tarfile\n", + "import tempfile\n", + "\n", + "# Data manipulation and visualization\n", + "import numpy as np\n", + "import pandas as pd\n", + "import matplotlib.pyplot as plt\n", + "from tqdm import tqdm\n", + "\n", + "# Jupyter Notebook utilities\n", + "from IPython.display import Markdown as md\n", + "\n", + "# PyTorch-specific configurations\n", + "torch.set_num_threads(1)\n", + "\n", + "# CUDA-related checks\n", + "print(torch.cuda.is_available())\n", + "print(torch.cuda.get_device_name())\n", + "\n", + "print(\"Import Successfully!\")" + ] + }, + { + "cell_type": "markdown", + "id": "5e710480-51b4-46c7-a3b4-5c80021007b2", + "metadata": {}, + "source": [ + "### Define helper functions\n", + "\n", + "The following code shows some helper functions to help with plotting, saving, and loading files. These functions are not the main focus of this lab, so you do not have to dwell on these too long. However, do run the cells in this section to define these helper functions.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "57d49c2a-d309-4608-8e45-b9fa8ec63cbd", + "metadata": {}, + "outputs": [], + "source": [ + "def plot(COST,ACC):\n", + "\n", + " fig, ax1 = plt.subplots()\n", + " color = 'tab:red'\n", + " ax1.plot(COST, color=color)\n", + " ax1.set_xlabel('epoch', color=color)\n", + " ax1.set_ylabel('total loss', color=color)\n", + " ax1.tick_params(axis='y', color=color)\n", + "\n", + " ax2 = ax1.twinx()\n", + " color = 'tab:blue'\n", + " ax2.set_ylabel('accuracy', color=color) # you already handled the x-label with ax1\n", + " ax2.plot(ACC, color=color)\n", + " ax2.tick_params(axis='y', color=color)\n", + " fig.tight_layout() # otherwise the right y-label is slightly clipped\n", + "\n", + " plt.show()\n", + "\n", + "\n", + "def save_list_to_file(lst, filename):\n", + " \"\"\"\n", + " Save a list to a file using pickle serialization.\n", + "\n", + " Parameters:\n", + " lst (list): The list to be saved.\n", + " filename (str): The name of the file to save the list to.\n", + "\n", + " Returns:\n", + " None\n", + " \"\"\"\n", + " with open(filename, 'wb') as file:\n", + " pickle.dump(lst, file)\n", + "\n", + "\n", + "def load_list_from_file(filename):\n", + " \"\"\"\n", + " Load a list from a file using pickle deserialization.\n", + "\n", + " Parameters:\n", + " filename (str): The name of the file to load the list from.\n", + "\n", + " Returns:\n", + " list: The loaded list.\n", + " \"\"\"\n", + " with open(filename, 'rb') as file:\n", + " loaded_list = pickle.load(file)\n", + " return loaded_list" + ] + }, + { + "cell_type": "markdown", + "id": "fb4827b6-33c9-4370-bfbf-983d89623c98", + "metadata": {}, + "source": [ + "---" + ] + }, + { + "cell_type": "markdown", + "id": "0d6f6a86-020b-4ed5-8c52-5fa69dceca97", + "metadata": {}, + "source": [ + "# Positional encodings\n", + "\n", + "Positional encodings play a pivotal role in transformers and various sequence-to-sequence models, aiding in conveying critical information regarding the positions or sequencing of elements within a given sequence. To illustrate, let's examine the sentences: \"He painted the car red\" and \"He painted the red car.\" Despite their distinct meanings, it's worth noting that the embeddings for these sentences remain identical in the absence of positional encodings. The following class defines positional encodings by inheriting from PyTorch's `Module` class.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "88c9cebf-7dbe-46d0-81d2-5a116120c374", + "metadata": {}, + "outputs": [], + "source": [ + "class PositionalEncoding(nn.Module):\n", + " \"\"\"\n", + " https://pytorch.org/tutorials/beginner/transformer_tutorial.html\n", + " \"\"\"\n", + "\n", + " def __init__(self, d_model, vocab_size=5000, dropout=0.1):\n", + " super().__init__()\n", + " self.dropout = nn.Dropout(p=dropout)\n", + "\n", + " pe = torch.zeros(vocab_size, d_model)\n", + " position = torch.arange(0, vocab_size, dtype=torch.float).unsqueeze(1)\n", + " div_term = torch.exp(\n", + " torch.arange(0, d_model, 2).float()\n", + " * (-math.log(10000.0) / d_model)\n", + " )\n", + " pe[:, 0::2] = torch.sin(position * div_term)\n", + " pe[:, 1::2] = torch.cos(position * div_term)\n", + " pe = pe.unsqueeze(0)\n", + " self.register_buffer(\"pe\", pe)\n", + "\n", + " def forward(self, x):\n", + " x = x + self.pe[:, : x.size(1), :]\n", + " return self.dropout(x)" + ] + }, + { + "cell_type": "markdown", + "id": "aadc814f-d060-47be-963f-c28cfd0618e4", + "metadata": {}, + "source": [ + "# Import IMDB data set\n", + "\n", + "The following code loads the IMDB data set.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "c479b278-01b9-4863-9821-528b1607a74b", + "metadata": {}, + "outputs": [], + "source": [ + "urlopened = urlopen('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/35t-FeC-2uN1ozOwPs7wFg.gz')\n", + "tar = tarfile.open(fileobj=io.BytesIO(urlopened.read()))\n", + "tempdir = tempfile.TemporaryDirectory()\n", + "tar.extractall(tempdir.name)\n", + "tar.close()" + ] + }, + { + "cell_type": "markdown", + "id": "ec7531bd-8b33-45d9-8061-9de43bd188b8", + "metadata": {}, + "source": [ + "## IMDB data set overview\n", + "\n", + "The **IMDB data set** contains movie reviews from the Internet Movie Database (IMDB) and is commonly used for binary sentiment classification tasks. It's a popular data set for training and testing models in natural language processing (NLP), particularly in the context of sentiment analysis.\n", + "\n", + "### Data set composition\n", + "\n", + "- **Reviews**: The data set consists of 50,000 movie reviews, divided evenly into 25,000 training and 25,000 testing samples.\n", + "- **Sentiment labels**: Each review is labeled as either positive or negative, indicating the sentiment expressed in the review. The data set is balanced, with an equal number of positive and negative reviews in both the training and testing sets.\n", + "- **Text content**: Reviews are presented as plain text and have been preprocessed to some extent. For example, HTML tags are removed, but the text retains its original punctuation and capitalization.\n", + "- **Usage**: The data set is commonly used to train models for binary sentiment classification, where the goal is to predict whether a given review is positive or negative based on its text content.\n", + "\n", + "### Applications\n", + "\n", + "- **Sentiment analysis**: The primary application of the IMDB data set is in sentiment analysis, where it serves as a benchmark for various text classification algorithms.\n", + "- **Natural language processing**: The data set is widely used in NLP research and applications, providing a basis for testing the effectiveness of different models and approaches in understanding human language.\n", + "\n", + "### Challenges\n", + "\n", + "The data set is small, so it's hard to train a model from scratch.\n", + "\n", + "The following class is defined to traverse the IMDB data set. The need to define this class arises from the fact that the IMDB data set is split across a large number of files.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "51bc66a7-506d-4ac4-aa0b-1813c6a0e4c5", + "metadata": {}, + "outputs": [], + "source": [ + "class IMDBDataset(Dataset):\n", + " def __init__(self, root_dir, train=True):\n", + " \"\"\"\n", + " root_dir: The base directory of the IMDB dataset.\n", + " train: A boolean flag indicating whether to use training or test data.\n", + " \"\"\"\n", + " self.root_dir = os.path.join(root_dir, \"train\" if train else \"test\")\n", + " self.neg_files = [os.path.join(self.root_dir, \"neg\", f) for f in os.listdir(os.path.join(self.root_dir, \"neg\")) if f.endswith('.txt')]\n", + " self.pos_files = [os.path.join(self.root_dir, \"pos\", f) for f in os.listdir(os.path.join(self.root_dir, \"pos\")) if f.endswith('.txt')]\n", + " self.files = self.neg_files + self.pos_files\n", + " self.labels = [0] * len(self.neg_files) + [1] * len(self.pos_files)\n", + " self.pos_inx=len(self.pos_files)\n", + "\n", + " def __len__(self):\n", + " return len(self.files)\n", + "\n", + " def __getitem__(self, idx):\n", + " file_path = self.files[idx]\n", + " label = self.labels[idx]\n", + " with open(file_path, 'r', encoding='utf-8') as file:\n", + " content = file.read()\n", + " return label, content" + ] + }, + { + "cell_type": "markdown", + "id": "6bb31d42-b20e-413d-96f6-7d680eb83bb2", + "metadata": {}, + "source": [ + "The following code uses the `IMDBDataset` class previously defined to create iterators for the train and test data sets. In the latter part of the cell, you can return 20 examples from the train set.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "01513044-f657-4203-b933-bad10ebeb4c8", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "(0, 'It is important not to be insulted by lack of logic or common sense and those who have any \"gray matters\" will agree that this movie just doesn\\'t work.

The problems lay in the direction, cast selections and lack of depth in the character building. The word comedy was very hard thing to say when i expect to laugh when these words are used. Let\\'s look at the problems in direction/script.

Brother and sister both in their mid 30\\'s seem to be well adjusted. They meet a complete stranger at a park and Heather Graham character walks up to her and asks the most intimate questions that even half sane person would be running the other way or at least scream for a police officer. He then awkwardly walks over and makes some stupid statements and she falls for him. Then after ONE date were they all go out together he falls in love with her and decides to get married in Vegas in a week\\'s time???? Hello does anyone feel stupid yet? He goes out with thousands of women and he meets this one person who says about 10 words that WE see on the screen and he wants to marry her. Not only was there no chemistry it just doesn\\'t make sense. Sure it\\'s a romantic comedy and I want to believe it could, but the direction made it completely flat.

Now Heather falls head over heels with her too and when Heather Graham and Bridget Moynahan (very shallow character) kiss or more to the point it was sloppiest kiss ever that chemistry MIGHT be there. I found it unromantic and unfunny and while many say Heather cannot act i think the reality is Heather was clearly the wrong person for this role.

This was Sue Kramer debut as a director and to me it was just too much for her to chew. It would take a lot of craft to make this movie work and IMHO it could be done with better writers and casting and direction.')\n", + "(0, \"THE SCREAMING SKULL (1 outta 5 stars) This movie boasts some pretty cool opening credits (an offscreen narrator warning that movie patrons will be offered a free burial if they die of fright watching this movie, a scary shot of a skull emerging from a placid pool and the ubiquitous scary music) but, sadly, the movie is all downhill from there. A widowed man takes his new bride to his secluded mansion... admonishing his servants and friends that the new Mrs. has a very fragile disposition due to a tragedy in her past. Well, in no time at all she begins to see and hear mysterious things that no one else can. Her husband assures her that it's all merely in her mind and... well, you can probably see where this all is going. You will have figured out what's going on long before our hapless heroine... because you have probably seen the exact same plot in hundreds of other movies and TV shows (and done better, too). To add to the movie's myriad transgressions, most cuts of this movie (on numerous cheap DVD compilations) seem to be missing a few key scenes. You see the heroine slowly walking towards the window... she goes to open it... you know she is going to see something scary... and then... suddenly the scene cuts to her sobbing in her husband's arms. So what did she see??? I guess we'll never know.\")\n", + "(0, \"

Whether any indictment was intended must be taken into consideration. If in the year 2000 there were still rifts of feeling between Caucasian and Afro-Americans in Georgia, such as shown in this film, obviously there remains a somewhat backward mentality among a lot of people out there. It is rather hypocritical, to say the least, if everyone adores Halle Berry, Whoopie Goldberg, Beyoncé, Noemi Campbell, Denzel Washington, Will Smith, et. al., whilst out in the backs there persist manifest racial divides.

White grandmother suddenly gets black grand-daughter thrust upon her, only to meet up with black grandfather in a very white social backwater. The story is sweet, not lacking tragic overtones, and eminently predictable as in most of these kinds of TV films, though the final scene has you guessing............ will he? won't he.......?

Gena Rowlands in her typical style offers a sincere rendering, and Louis Gossett is a good match for her; the little Penny Bae fortunately does not steal the show.

A `nice' way of relaxing after Sunday lunch without having to force your mind too much, though you might just find yourself having a little siesta in the middle of it.\")\n", + "(0, 'First, it takes a full half hour to get Hackman out of jail and to start doing the job. What a waste of time, we all know Hackman is getting out to do some job for his masters, why waste almost a third of the movie on these sequences. Then Hackman stays in a hotel and the story arc again goes nowhere, simply proving to us that Hackman is under close watch and anything he says or does is know by the masters. Again, another 20 minutes. Then more wasted time showing the reunion with his wife. All of this should have taken 10-15 minutes at most simply as a set-up for the real action, intrigue and plot twists. By the time the real action gets going, I was so bored that I just wanted the movie to end. Hackman is great as usual, and the other actors as well, but this is a dud of the first magnitude.')\n", + "(0, \"Banned as a 'Video Nasty' in the UK, Unhinged has naturally gained quite a bit of notoriety. However, the most shocking thing I found about the film was its amateurishness in all departments. The bloodletting I could handle: the terrible acting, shoddy editing, awful direction, lousy script and abysmal soundtrack were much harder to take.

Three girls on their way to a music festival crash into a ravine during a storm. They are rescued by a friendly stranger who takes them to a nearby house. The owner of the house, a batty old lady, and her spinster daughter, welcome the girls in, allowing them to stay for a few days in order to recuperate. However, someone doesn't want the girls to leave\\x97ever! One by one they fall victim to an unseen assailant.

Taking a long time to get going and featuring some of the worst performances ever in a horror film (and that takes some doing), Unhinged is a truly awful film. The music is a total mess (it sounds like a three year old has been let loose on a synthesiser) and as such, it complements the movie perfectly. Only a couple of bloody scenes towards the end and a bit of gratuitous nudity save Unhinged from getting the lowest possible score.

If you are a horror completist (and unfortunately, I am), you will want to see this in order to tick it off the Video Nasty watch-list. But be warned\\x97it is really, really bad.\")\n", + "(0, \"When great director/actor combinations are talked about the team of J. Lee Thompson and Charles Bronson is not usually mentioned. Probably because the output of nine joint ventures between the two of them runs the gamut from the really good action entertainment to the mediocre. Unfortunately Kinjite: Forbidden Subjects falls in the latter.

That's sad because Kinjite could have been a whole lot better. But for the life of me I don't understand why it was necessary to make the father of the missing Japanese girl, a guy used to getting some cheap jollies because the romance in his marriage has run out. That might have been good for another film altogether, but it served no purpose here.

A straightforward cop drama with Charles Bronson as a vice cop who's seen a bit too much in his line of work and has a strong prejudice against orientals. That part could also have used a little explaining as well. But he's going to have to overcome it if he and patient partner Perry Lopez are going to locate a captured Japanese school girl.

Bronson's time in the vice squad have told him exactly where to look for the kidnapper. A stylish, murderous pimp played by Jaime Fernandez is the guy and he and Bronson have some history. In fact in the film's best scene, Bronson made him eat an expensive rolex watch and set his car on fire.

At one point Fernandez happens to spot Bronson and Lopez in an all night delicatessen and this being after his rolex snack, he sprays the place with an Uzi killing everyone, but Bronson and Lopez. I really think that little incident would have had more than a couple vice cops from the LAPD after Fernandez. But that's another terribly big hole in the plot.

Still there is a very rough justice in the end for Fernandez. I wish the whole film had been better though. This was the last film of the Bronson-Thompson team and J. Lee Thompson's last as a director. He should have gone out with something better.\")\n", + "(0, 'First off, let me say that I am a great believer in Fanpro stuff. I see it as a way to continue a good show long after it has been cancelled. Star Trek Voyages and Star Wars Revelations are examples of decent efforts. So I have a soft-spot for fanpro stuff that means I\\'ll overlook things that I would ordinarily slate badly.

So on to ST: HF. Well, first off the good things. Enthusiasm is a major part of making any show believable and, for the most part, the crew of the various ships all seem to be having a good time with their roles. Next, the effects aren\\'t bad for a home-brew effort, with nothing to make you really wince. The stories aren\\'t too bad either. Nothing particularly innovative, but solid enough stuff and at least there are ongoing story-arcs.

But it has a lot of faults.

First off, although they quite obviously HAVE to rip-off Star Trek footage, set backdrops, music and effects, I see no reason why they proceeded to rip off virtually every other sci-fi musical score ever made. Everything from Aliens to Starship Troopers rears it orchestral head at one point or another. Likewise, much of the footage is from other movies, dutifully CGI\\'d over to make it look different. The Grey warships, for instance, though disguised, are quite obviously Star Destroyers from Star Wars. And the station is also rather obviously Fleet Battle Station Ticonderoga from Starship Troopers. Likewise, sound effects from various Star Wars movies appear in space battles between fighters, as does animated over footage. In one scene in either first or second season, I think, you even see two TIE fighters fly past during a battle, which hardly does your suspension of disbelief any favours.

Acting varies from the reasonable to the hideously painful to watch. Everyone does improve as the seasons progress, though, but expect to grimace at the screen a lot, especially in the early seasons. They\\'ve also made some interesting acting choices. Let\\'s just say that the food replicators on this show seem permanently set to \"cake\" and leave it at that.

Make-up effects are generally quite effective on the whole. But they really ought to mercilessly club to death the person who decided to use cheap Ferengi and Cardassian masks for anything other than background use or \"passing\" shots. They are just beyond unrealistic. Every time I saw one of these (apart from trying not to laugh too much) I kept expecting the unfortunate soul wearing it to pull out a gun and announce that \"This is a stick-up!\" In one scene a \"Cardassian\" actually talks whilst wearing one of these. Not only do the lips not move, but the mask doesn\\'t even have an opening where the mouth should be. Someone needs to be slapped hard for that. Couldn\\'t they have taken a craft knife to it, for goodness\\' sake! There are also some well-done, but unintentionally funny make-up jobs, such as the Herman Munster look alike.

The writing, though coherent, is nothing new. Instead the script runs like a continuation of DS9, with the ships heading out from DS12 on various missions. The new enemy, \"The Grey\" aren\\'t very menacing and the plot line involving them is effectively a reworking of the Borg threads. i.e. Starfleet meet the Grey, the Grey are hugely powerful, Starfleet barely escape with their lives, then through technology they begin to find ways to combat the enemy etc etc. All done before with the Borg.

Another bone of contention is the dialogue. Star Trek writers have long had the ability to write \"insert technobabble here\" into a script. It usually means an exposition of the latest plan to combat the enemy using \"quantum phase discriminators\" or \"isolytic charges\" etc. In other words, nonsense that tells you that they are on the case and a resolution is at hand.

The words are just gibberish really. I\\'ve no problem with this, but where ST:HF makes a mess of it is where they include real-world comments into this concept.

Tactical advice such as \"We need to regroup\" sounds good, but not when uttered by trio of characters already standing in a group. Likewise when asked what the situation is, a tactical officer is heard to reply \"We count three battleships\". He actually needed to count them? C\\'mon! I expected the questioner to ask him \"Are you sure?\" or \"Can you double check\". But my all-time favourite comment is this:

Captain: \"Can we establish two-way communication?\"

Comms officer: \"No, we can only send and receive..\"

Well, duh!.....

Having said all the above, the show does improve as it goes along. Seasons 1 and 2 are pretty bad, 3 shows an improvement but 4 & 5 are where it starts to get noticeably better. Season 6 so far looks quite reasonable.

I do have a problem with their choice of media for the shows though. Quicktime sucks, quite frankly and the sooner they move to divx/avi format the better. Some of us like to actually take our downloaded shows and watch them on decent size screen and not peer at a tiny QT window on a computer monitor. Not only does Quicktime make this difficult, but the 320x180 resolution the shows are in does not scale at all well. In fact, it makes the shows pretty unwatchable, like they were a tenth-generation VHS tape copy. The least they could do was to include a hi-res downloadable option.

Anyway, the show has promise, and I\\'m even beginning to like some of the characters. But that\\'s 40 episodes on, so I\\'m not sure this says that much about character development at all.

But what can you say, it\\'s free....

PS: Out of 28 votes, 19 people rated this show as a 9 or 10. Hmmmm... were we watching the same show? Or are you 19 all three year olds?')\n", + "(0, 'I read in the papers that W.Snipes was broke so no wonder he would take parts in low budget projects like The Contractor.He is just the next action star to join a growing club:the penniless action stars of the 90s (Van Damme,Segal,Lundgren,Snipes). Here he stars the lead in a cheap action flick which was shot in Bulgaria( we are supposed to believe that the location is London, like only a complete moron would buy that)The story is the one of 1000 other movies: retired special forces good guy gets hired by the government again to do a wet job- after that government wants to get rid of him- good guy gets away after killing bad guys (was that a spoiler? guess not!) The star of the movie: the little girl (Eliza Bennett) outperforms everybody else of the cast!!!One star is for her plus one star for eye candy Lena Headey, makes 2 stars. Only for die hard Snipes fans!Everybody else:avoid!')\n", + "(0, \"Did HeidiJean really see this movie? A great Christmas movie? Not even close. Dull, bland and completely lacking in imagination and heart. I kept watching this movie wondering who the hell thought that Carly Pope could play the lead in this movie! The woman has no detectable personality and gives a completely lackluster performance. Baransky was great as usual and provided the only modicum of interesting the whole thing. Probably her involvement was the only reason this project was green lighted to begin with. Maybe I'm expecting too much for a Lifetime movie played 15 days from Christmas but I sat through this thing thinking that with a different director and a recasting JJ with an actress that at least could elicit sympathy this could have been quite a cute little movie.\")\n", + "(0, 'Band Camp was awful, The Naked Mile was a little better, and this third straight to DVD in the American Pie franchise seems the same quality as the predecessor. Basically Erik Stifler (John White) split from his girlfriend after losing his virginity, and now him and Mike \\'Cooze\\' Coozeman (Jake Siegel) are joining Erik\\'s cousin Dwight (Steve Talley) at college. With the promise of many parties, plenty of booze, and enough hot chicks at the Beta House, they only have fifty listed tasks to carry out to become official privileged members. But a threat comes into sight with the rivals, GEK (\"Geek\") House, led by power-hungry nerd (and sheep shagger) Edgar (Tyrone Savage) offering bigger and better than what Beta have. To settle it once and for all, Beta and Gek go into battle with the banned, for forty years, Greek Games to beat each other in, with the loser moving out. The last champion of the games, Noah Levenstein aka Jim\\'s Dad (the only regular Eugene Levy) runs the show, which sees the people unhooking bras, a gladiator duel floating on water, catching a greased pig, Russian Roulette in the mouth with cartridges of aged horse spunk, wife carrying and drinking a full keg of alcohol (with puking not disqualifying). It all comes to the sudden death, with a guy getting stripper lap dancing, and they have to resist cumming, Beta House win when Edgar cums with a girl dressed as a sheep on his lap. Also starring Flubber\\'s Christopher McDonald as Mr. Stifler, Meghan Heffern as Ashley, Dan Petronijevic as Bull, Nic Nac as Bobby, Christine Barger as Margie, Italia Ricci as Laura Johnson, Moshana Halbert as Sara Coleman, Sarah Power as Denise, Andreja Punkris as Stacy and Jordan Prentice as Rock. The nudity amount is very slightly increased, as is the grossness of the jokes, and I could guess it being rated one star out of five, but I like it. Adequate!')\n", + "(1, \"I saw this recently on a cable channel. The movie is great; it's one of the few musicals I have seen that doesn't shy away from the light and dark. It portrays some of the splendour of the age along with a lot of the squalor. Some of the set piece dance sequences so much is going on, I didn't know where to look next. One day I shall go and see this on the big screen, just so that I see what's happening. But what really lifts this to another level is Oliver Reed's performance as Bill Sykes. Not only is a thoroughly mean and menacing man but there is something else, some inner demons. He gave me the impression that if you pushed him into a corner, he was capable of anything. It was almost as if the Sykes character was on the edge of madness, just awaiting the trigger. I have seen the Robert Newton's Bill Sykes from the 1948 movie, and I thought he was 'just' a bad egg, but Oliver Reed's performance intimidated me in my own living room.\")\n", + "(1, '\"The Gingerbread Man is the first thriller I\\'ve ever done!\" \\x96 Robert Altman

In 1955 Charles Laughton directed \"The Night of the Hunter\", a spooky slice of Southern Gothic in which Robert Mitchum plays a scary serial killer. One of the film\\'s more famous sequences consists of two kids escaping from Mitchum on a rowboat, the kids frantically paddling whilst Mitchum wades after them like a monster.

Seven years later Mitchum played an equally spooky killer in \"Cape Fear\", another film set in the American South. That film featured a local attorney trying to protect his family and likewise ended with Mitchum terrorising folks on a boat. In 1991 Martin Scorsese, trying to branch out and tackle something more mainstream, remade \"Cape Fear\", boat scene and all.

Now we have Robert Altman\\'s \"The Gingerbread Man\", another slice of small town Southern Gothic. Altman says he consulted \"The Night of the Hunter\" for inspiration and tackled such a mainstream film purely because he wanted to \"spread his wings and try a popcorn picture\", but what he\\'s secretly attempting to do here is deconstruct the canonical films of the Southern Gothic genre.

So instead of a showdown on small boat, we get a showdown on a giant ship. Instead of two kids being kidnapped, we get two kids being safely returned to the police. Instead of money being hidden, we have money being readily given via a last will and testament. Instead of the righteous attorney of the 1961 film and the deplorable attorney of the 1991 remake, we get a rather three-dimensional lawyer in Kenneth Branagh. Instead of the monster chasing the family we get the hero chasing the bad guys. Instead of the monster breaking into the family\\'s house boat, we have the hero hunting the monster on board the monster\\'s \"house ship\". Similarly, instead of a murderous serial killer we get an innocent weirdo played by Robert Duvall. . .etc etc etc.

Altman goes on and on, reversing everything just a little slightly, pulling at the edges and doing his own thing. His touch is most apparent during the film\\'s first half-hour, the film existing in an uneasy space between conventional plot-driven movie storytelling and Altman\\'s fondness for overlapping dialogue, casual narratives, prowling camera movement and the way that characters aren\\'t so much introduced as they are simply part of what\\'s going on.

Still, despite Altman\\'s best intentions, the film never rises above mediocrity. Altman\\'s too bound to the conventions of the \"thriller format\" to do much damage, his style is too lethargic to generate tension and the film is simply not radical enough to counterpoint other canonical films in the genre. \"Gingerbread Man\" is thus too mainstream to work as a more pure Altman film and too Altman to work as a mainstream thriller.

The film\\'s not a complete waste, though. Robert Downey Junior, Kenneth Branagh and the usually intolerable Daryl Hannah, all turn in juicy performances. The film also has a nice atmosphere, set against a approaching hurricane, and the final act contains some interesting twists and turns. While it\\'s not the complete disaster that Scorsese\\'s \"Cape Fear\" was, the film still never amounts to anything special.

7/10 \\x96 In the late 90s Altman made 3 successive films set in the American South: \"Kansas City\", \"Gingerbread Man\" and \"Cookie\\'s Fortune\". Unlike \"Gingerbread Man\", both \"Kansas City\" and \"Cookie\\'s Fortune\" tackle the genre on the broader, more looser canvases that Altman was most comfortable with.

\"Kansas City\" is the more important of these two films, its hierarchies of class, politics and crime, and its desire to break radically away from typical gangster genre frameworks, would prove influential on all serious 21st century film crime writers (see, for example, \"The Wire\"). That said, \"Cookie\\'s Fortune\", while a much slighter tale, is perhaps the better picture.

Note: Altman claims that this is his first thriller, but he directed \"Images\", an art house thriller, in 1972.

Worth one viewing.')\n", + "(1, '\"Read My Lips (Sur mes lèvres)\" (which probably has different idiomatic resonance in its French title) is a nifty, twisty contemporary tale of office politics that unexpectedly becomes a crime caper as the unusually matched characters slide up and down an ethical and sensual slippery slope.

The two leads are magnetic, Emmanuelle Devos (who I\\'ve never seen before despite her lengthy resume in French movies) and an even more disheveled than usual Vincent Cassel (who has brought a sexy and/or threatening look and voice to some US movies).

The first half of the movie is on her turf in a competitive real estate office and he\\'s the neophyte. The second half is on his turf as an ex-con and her wrenching adaptation to that milieu.

Writer/director Jacques Audiard very cleverly uses the woman\\'s isolating hearing disability as an entrée for us into her perceptions, turning the sound up and down for us to hear as she does (so it\\'s even more annoying than usual when audience members talk), using visuals as sensory reactors as well.

None of the characters act as anticipated (she is not like that pliable victim from \"In the Company of Men,\" not in individual interactions, not in scenes, and not in the overall arc of the unpredictable story line (well, until the last shot, but heck the audience was waiting for that fulfillment) as we move from a hectic modern office, to a hectic disco to romantic and criminal stake-outs.

There is a side story that\\'s thematically redundant and unnecessary, but that just gives us a few minutes to catch our breaths.

This is one of my favorites of the year!

(originally written 7/28/2002)')\n", + "(1, \"This was a pretty decent movie. This movie is good to just sit down and watch and be entertained. Just a typical Hollywood film. This movie will never win an Oscar or anything and definitely doesn't deserve one, but I thought it was pretty good. It's kind of like the show 24 but set into movie format. If you like the whole we've got to stop the terrorist from killing the president kind of movie then you will enjoy this flick. I personally think that storyline has been done WAY too much, but The Sentinel does add a little twist with the mole in the Secret Service. All in all, this movie won't leave your jaw to the floor or change your life, but who says every single movie has to be like that to be good?\")\n", + "(1, 'Actually they could not have chosen a better diversified actor to portray Little Richard than Leon. He captures Little Richard to a most believable essence. The outfits where wonderful and any person watching this movie will definitely keep a smile on their face through the entire movie. Although the movie is a little long, it keeps your attention with the personality and outfits of Little Richard in mind. The ending should have taken a direction of moving Little Richard more into the present where you could see him as he has aged into this new millennium. He will always be the King of Rock-N-Roll as far as I am concerned regardless of what the other media says.')\n", + "(1, \"If you are a fan of Altman's large ensemble casts, as evidenced in major films like M.A.S.H., Nashville, Gosford Park, and lesser seen films like A Wedding, then you will no doubt be entertained by HealtH. Centered around a Health Convention where two women are running for President, HealtH contains many of Altman's latter 70s regulars like Paul Dooley (who helped write the film), Carol Burnett, and Henry Gibson, while also including top star Altman newcomers like Lauren Bacall, James Garner, and Glenda Jackson. Like a lot of Altman ensemble films there are numerous subplots in this film, but it is not nearly as overwhelming as films like Nashville or A Wedding, rather it has a more centered feel, perhaps like M.A.S.H. or Gosford Park. The whole thing is an obvious satire on the Health movement, filled with over-top, outlandish, contradictive characters, with guest stars like Dick Cavett providing a wry commentary on the whole thing. Underlining the whole election process is Altman's characteristic pessimism about politics and public appeal but what is most appealing about this film is the sheer fun most people seem to be having. This would be one of Altman's last films like this for a while!\")\n", + "(1, 'I saw the movie with two grown children. Although it was not as clever as Shrek, I thought it was rather good. In a movie theatre surrounded by children who were on spring break, there was not a sound so I know the children all liked it. There parents also seemed engaged. The death and apparent death of characters brought about the appropriate gasps and comments. Hopefully people realize this movie was made for kids. As such, it was successful although I liked it too. Personally I liked the Scrat!!')\n", + "(1, \"This show has been my escape from reality for the past ten years. I will sadly miss it. Although Atlantis has filled the hole a small bit.

The last ever episode of SG1(on television anyway)was beautifully done. Robert wrote something that felt close to reality. As though he was trying to explain what it was like on the set of the show. (Everyone working closely together for such a long time there are bound to up's and downs. But over the years they've turned into a family). I thought this was a wonderful way to end despite anyone else's criticisms.

SG1 was something special and time and time again it took me across thresholds of disbelief and amazement. The wonderful characters, stories, directors, writers. From episode one I was hooked. The blend of action, science, drama and especially comedy worked so well that made me keep wanting more.

There are no real words in which to completely express what this show meant to me. I can only thank those who kept the show so fresh and entertaining for so many years. It has inspired me to do many things that I thought was impossible.

I look forward to the movies next year and I really hope there will be a number of them. I never want the show to die.

Stargate SG1 - 1997 - 2007?\")\n", + "(1, 'For a long time it seemed like all the good Canadian actors had headed south of the border and (I guessed) all the second rank ones filled the top slots and that left the dregs for the sex comedies.

This film was a real surprise: despite the outlandish plots that are typical of farces, the actors seemed to be trying to put something into their characters and what we, the viewer, got back was almost true suspension of belief. When the extras from the music video attacked the evicting police, you almost believed it was possible.

If you are a fan of some of the better sex farces (Canadian or not) you should definitely seek this one out. And the big surprise, this sex farce is also loaded with some very good nudity.')\n", + "(1, 'Thank God this wasn\\'t based on a true story, because what a story it is. Populated by despicable characters whose depravity knows no bounds, Before The Devil is a mesmerizing, jaw-dropping excursion into perversion which would be laughable (and sometimes is, even with - or perhaps because of - the sickeningly tragic undercurrent of human dysfunction throughout) if it weren\\'t carried out with such magnificent, overwhelming conviction by its stars. The excellent script by Kelly Masterson and superb direction by none other than Sidney Lumet doesn\\'t hurt either.

The main dysfunction here is of a family nature, with the two majorly screwed up brothers (brilliant portrayals from Philip Seymour Hoffman and Ethan Hawke) deciding to rob their own parents\\' jewelry store, an attempt that goes pathetically awry.

The story is told with time-shifts (which are noted on screen, such as: \"Charlie: Two Days Before The Robbery\", so no one should be confused); some people have said they didn\\'t like this device but I thought it worked perfectly, adding to the skeweredness of the whole affair, considering that the two brothers in question are hardly playing with full decks - between them you couldn\\'t make a decent poker hand to save your life. Throw in these cheesy extra tidbits: one of the brothers is a drug addict, married to Gina (Marisa Tomei, also excellent), who is having an affair with the other brother, toss in some monumental sibling rivalry, along with the fact that said drug addict brother hates his father (a wrenching performance from Albert Finney), who has apparently caused him serious past pain, and you\\'ve got a Shakespearean/Greek tragedy on your hands. Proceed with caution.')\n" + ] + } + ], + "source": [ + "root_dir = tempdir.name + '/' + 'imdb_dataset'\n", + "train_iter = IMDBDataset(root_dir=root_dir, train=True) # For training data\n", + "test_iter = IMDBDataset(root_dir=root_dir, train=False) # For test data\n", + "\n", + "start=train_iter.pos_inx\n", + "for i in range(-10,10):\n", + " print(train_iter[start+i])" + ] + }, + { + "cell_type": "markdown", + "id": "1a6c647b-b8ab-49df-8434-becaa0dea775", + "metadata": {}, + "source": [ + "The following code defines the mapping of numeric labels to positive and negative reviews.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "3fca543a-ffc0-4079-bc30-7e3e05576623", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'positive review'" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "imdb_label = {0: \" negative review\", 1: \"positive review\"}\n", + "imdb_label[1]" + ] + }, + { + "cell_type": "markdown", + "id": "ac4da5eb-a679-45d2-90bd-cfde475e0f8e", + "metadata": {}, + "source": [ + "The following code checks to ensure that there are exactly two classes in the train data set.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "e78c80a0-0dee-4236-a382-dfe9f75a8fea", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "2" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "num_class = len(set([label for (label, text) in train_iter]))\n", + "num_class" + ] + }, + { + "cell_type": "markdown", + "id": "9c35ca39-77a1-40d9-9f16-fddb3259fd64", + "metadata": {}, + "source": [ + "The following code loads a basic English tokenizer and defines a function called ```yield_tokens``` that uses the tokenizer to break down text data yielded by an iterator into tokens.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "6a41ea50-7e8b-423b-9de8-36a88e97e96b", + "metadata": {}, + "outputs": [], + "source": [ + "tokenizer = get_tokenizer(\"basic_english\")\n", + "\n", + "def yield_tokens(data_iter):\n", + " \"\"\"Yield tokens for each data sample.\"\"\"\n", + " for _, text in data_iter:\n", + " yield tokenizer(text)" + ] + }, + { + "cell_type": "markdown", + "id": "e1853ae6-c596-4fac-a2f4-185c0a354510", + "metadata": {}, + "source": [ + " The following code loads a pretrained word embedding model called GloVe into a variable called `glove_embedding`.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "0cbe5e7e-13d0-4d64-8c0c-86b74954c9b2", + "metadata": {}, + "outputs": [], + "source": [ + "# Note that GloVe embeddings are typically downloaded using:\n", + "#glove_embedding = GloVe(name=\"6B\", dim=100)\n", + "# However, the GloVe server is frequently down. The code below offers a workaround\n", + "\n", + "\n", + "class GloVe_override(Vectors):\n", + " url = {\n", + " \"6B\": \"https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/tQdezXocAJMBMPfUJx_iUg/glove-6B.zip\",\n", + " }\n", + "\n", + " def __init__(self, name=\"6B\", dim=100, **kwargs) -> None:\n", + " url = self.url[name]\n", + " name = \"glove.{}.{}d.txt\".format(name, str(dim))\n", + " #name = \"glove.{}/glove.{}.{}d.txt\".format(name, name, str(dim))\n", + " super(GloVe_override, self).__init__(name, url=url, **kwargs)\n", + "\n", + "class GloVe_override2(Vectors):\n", + " url = {\n", + " \"6B\": \"https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/tQdezXocAJMBMPfUJx_iUg/glove-6B.zip\",\n", + " }\n", + "\n", + " def __init__(self, name=\"6B\", dim=100, **kwargs) -> None:\n", + " url = self.url[name]\n", + " #name = \"glove.{}.{}d.txt\".format(name, str(dim))\n", + " name = \"glove.{}/glove.{}.{}d.txt\".format(name, name, str(dim))\n", + " super(GloVe_override2, self).__init__(name, url=url, **kwargs)\n", + "\n", + "try:\n", + " glove_embedding = GloVe_override(name=\"6B\", dim=100)\n", + "except:\n", + " try:\n", + " glove_embedding = GloVe_override2(name=\"6B\", dim=100)\n", + " except:\n", + " glove_embedding = GloVe(name=\"6B\", dim=100)" + ] + }, + { + "cell_type": "markdown", + "id": "e81eda58-67d1-4414-a91a-c238a1434729", + "metadata": {}, + "source": [ + "The following code builds a vocabulary object from a pretrained GloVe word embedding model and sets the default index to the token.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "5a3dbfda-7d7f-4256-937c-af441df1ee1b", + "metadata": {}, + "outputs": [], + "source": [ + "from torchtext.vocab import GloVe,vocab\n", + "# Build vocab from glove_vectors\n", + "vocab = vocab(glove_embedding .stoi, 0,specials=('', ''))\n", + "vocab.set_default_index(vocab[\"\"])" + ] + }, + { + "cell_type": "markdown", + "id": "e1fc78f6-0d04-4b35-afdd-a6193c66166e", + "metadata": {}, + "source": [ + "Let's count the number of words in the vocab:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "bc7de385-2867-4836-a6dc-f87a9f0c38f5", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "400002" + ] + }, + "execution_count": 13, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "vocab_size=len(vocab)\n", + "vocab_size" + ] + }, + { + "cell_type": "markdown", + "id": "aa4eca48-cfb4-4fb4-8ca3-180319852e0d", + "metadata": {}, + "source": [ + "Let's test the ```vocab``` function:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "id": "59689af6-a6ed-49eb-a12e-51303c1eb9da", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[20]" + ] + }, + "execution_count": 14, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "vocab(['he'])" + ] + }, + { + "cell_type": "markdown", + "id": "6ca4130b-f548-4b92-a68f-b2d83b6b19a5", + "metadata": {}, + "source": [ + "### Data set splits\n", + "\n", + "The following converts the data set into map-style data sets and then performs a random split to create separate training and validation data sets. The training data set will contain 95% of the samples in the original training set, while the validation data set will contain the remaining 5%. These data sets can be used for training and evaluating a machine learning model for text classification on the IMDB data set. The final performance of the model will be evaluated on the hold-out test set.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "id": "be0403dd-b319-406e-96ea-d215c0a42f6c", + "metadata": {}, + "outputs": [], + "source": [ + "# Convert the training and testing iterators to map-style datasets.\n", + "train_dataset = to_map_style_dataset(train_iter)\n", + "test_dataset = to_map_style_dataset(test_iter)\n", + "\n", + "# Determine the number of samples to be used for training and validation (5% for validation).\n", + "num_train = int(len(train_dataset) * 0.95)\n", + "\n", + "# Randomly split the training dataset into training and validation datasets using `random_split`.\n", + "# The training dataset will contain 95% of the samples, and the validation dataset will contain the remaining 5%.\n", + "split_train_, split_valid_ = random_split(train_dataset, [num_train, len(train_dataset) - num_train])" + ] + }, + { + "cell_type": "markdown", + "id": "82b6143c-d493-49b1-a255-b5aae8f9280a", + "metadata": {}, + "source": [ + "Be aware that the Skills Network currently does not offer GPU access to learners. As a result, training on the full data set could be time-consuming. To address this, you further reduce the size of the training set. This approach helps you mimic the training process as if a GPU were available. However, if you want to train using the full IMDB data set, you must either comment out or remove the two lines in the following code block.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "id": "0b86c021-06f7-463d-95d7-c83f45de8605", + "metadata": {}, + "outputs": [], + "source": [ + "num_train = int(len(train_dataset) * 0.05)\n", + "split_train_, _ = random_split(split_train_, [num_train, len(split_train_) - num_train])" + ] + }, + { + "cell_type": "markdown", + "id": "e2e0359b-b92d-4aae-9916-aaadbd206ddb", + "metadata": {}, + "source": [ + "The following code checks to see if a CUDA-compatible GPU is available in the system using PyTorch, a popular deep learning framework. If a GPU is available, it assigns the device variable to \"cuda\" (which stands for CUDA, the parallel computing platform and application programming interface model developed by NVIDIA). If a GPU is not available, it assigns the device variable to \"cpu\" (which means the code will run on the CPU instead).\n" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "id": "3e37cbbd-1c52-4b02-b26c-c7f4b694e944", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "device(type='cuda')" + ] + }, + "execution_count": 21, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n", + "device" + ] + }, + { + "cell_type": "markdown", + "id": "9c0631e5-5eb9-4e91-a84f-315381189cb9", + "metadata": {}, + "source": [ + "### Data loader\n", + "\n", + "The following code prepares the text processing pipeline with the tokenizer and vocabulary. The text pipeline is used to process the raw data strings from the data set iterators.\n", + "\n", + "The function **```text_pipeline```** first tokenizes the input text, then **```vocab```** is applied to get the token indices.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "id": "6b1807bd-4548-435d-b763-cc1641c95c45", + "metadata": {}, + "outputs": [], + "source": [ + "def text_pipeline(x):\n", + " return vocab(tokenizer(x))" + ] + }, + { + "cell_type": "markdown", + "id": "58976c04-a071-4f9f-8e55-b0207f39a7d1", + "metadata": {}, + "source": [ + "In PyTorch, the **`collate_fn`** function is used in conjunction with data loaders to customize the way batches are created from individual samples. The provided code defines a `collate_batch` function in PyTorch, which is used with data loaders to customize batch creation from individual samples. It processes a batch of data, including labels and text sequences. It applies the `text_pipeline` function to preprocess the text. The processed data is then converted into PyTorch tensors and returned as a tuple containing the label tensor, text tensor, and offsets tensor representing the starting positions of each text sequence in the combined tensor. The function also ensures that the returned tensors are moved to the specified device (for example, GPU) for efficient computation.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "id": "442be80e-ae39-4001-98f7-0a0c5266bb65", + "metadata": {}, + "outputs": [], + "source": [ + "from torch.nn.utils.rnn import pad_sequence\n", + "\n", + "def collate_batch(batch):\n", + " label_list, text_list = [], []\n", + " for _label, _text in batch:\n", + "\n", + " label_list.append(_label)\n", + " text_list.append(torch.tensor(text_pipeline(_text), dtype=torch.int64))\n", + "\n", + " label_list = torch.tensor(label_list, dtype=torch.int64)\n", + " text_list = pad_sequence(text_list, batch_first=True)\n", + "\n", + " return label_list.to(device), text_list.to(device)" + ] + }, + { + "cell_type": "markdown", + "id": "fb6f5466-b4b7-4772-9f09-9836249d4279", + "metadata": {}, + "source": [ + "You can convert the data set objects to data loaders by applying the `collate` function.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "id": "95d54253-4781-4607-9f60-f4727117a520", + "metadata": {}, + "outputs": [], + "source": [ + "BATCH_SIZE = 32\n", + "\n", + "train_dataloader = DataLoader(\n", + " split_train_, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_batch\n", + ")\n", + "valid_dataloader = DataLoader(\n", + " split_valid_, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_batch\n", + ")\n", + "test_dataloader = DataLoader(\n", + " test_dataset, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_batch\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "3cc82919-3aa8-4f02-9be6-696a71335672", + "metadata": {}, + "source": [ + "Let's check to see what these data loaders generate.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "id": "0e6a5049-2b5e-4661-ab9a-30b6072da1bb", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(tensor([0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0,\n", + " 1, 0, 1, 1, 0, 1, 0, 0], device='cuda:0'),\n", + " tensor([[ 0, 3542, 0, ..., 0, 0, 0],\n", + " [ 40, 663, 838, ..., 0, 0, 0],\n", + " [ 39, 16, 2, ..., 0, 0, 0],\n", + " ...,\n", + " [16307, 0, 59, ..., 0, 0, 0],\n", + " [ 9, 12389, 1608, ..., 0, 0, 0],\n", + " [ 193, 44724, 144, ..., 0, 0, 0]], device='cuda:0'))" + ] + }, + "execution_count": 22, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "label,seqence=next(iter(valid_dataloader))\n", + "label,seqence" + ] + }, + { + "cell_type": "markdown", + "id": "1f0d739f-47a7-44e6-b61d-6ce1eecda895", + "metadata": {}, + "source": [ + "### Neural network\n", + "\n", + "This code defines a class called Net that represents a text classifier based on a PyTorch TransformerEncoder.\n", + "The constructor takes the following arguments:\n", + "\n", + "- `num_class`: The number of classes to classify\n", + "- `vocab_size`: The size of the vocabulary\n", + "- `freeze`: Whether to freeze the embedding layer\n", + "- `nhead`: The number of heads in the transformer encoder\n", + "- `dim_feedforward`: The dimension of the feedforward layer in the transformer encoder\n", + "- `num_layers`: The number of transformer encoder layers\n", + "- `dropout`: The dropout rate\n", + "- `activation`: The activation function to use in the transformer encoder\n", + "- `classifier_dropout`: The dropout rate for the classifier\n", + "\n", + "**Attributes:**\n", + "\n", + "- `emb`: An embedding layer that maps each word in the vocabulary to a dense vector representation\n", + "- `pos_encoder`: A positional encoding layer that adds positional information to the word vectors\n", + "- `transformer_encoder`: A transformer encoder layer that processes the sequence of word vectors and extracts high-level features\n", + "- `classifier`: A linear layer that maps the output of the transformer encoder to the desired number of classes\n", + "\n", + "---\n", + "\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "id": "7f290742-5806-47d8-8533-f3d92709eba9", + "metadata": {}, + "outputs": [], + "source": [ + "class Net(nn.Module):\n", + " \"\"\"\n", + " Text classifier based on a pytorch TransformerEncoder.\n", + " \"\"\"\n", + " def __init__(\n", + "\n", + " self,\n", + " num_class,vocab_size,\n", + " freeze=True,\n", + " nhead=2,\n", + " dim_feedforward=128,\n", + " num_layers=2,\n", + " dropout=0.1,\n", + " activation=\"relu\",\n", + " classifier_dropout=0.1):\n", + "\n", + " super().__init__()\n", + "\n", + " #self.emb = embedding=nn.Embedding.from_pretrained(glove_embedding.vectors,freeze=freeze)\n", + " self.emb = nn.Embedding.from_pretrained(glove_embedding.vectors,freeze=freeze)\n", + " embedding_dim = self.emb.embedding_dim\n", + "\n", + "\n", + " self.pos_encoder = PositionalEncoding(\n", + " d_model=embedding_dim,\n", + " dropout=dropout,\n", + " vocab_size=vocab_size,\n", + " )\n", + "\n", + " encoder_layer = nn.TransformerEncoderLayer(\n", + " d_model=embedding_dim,\n", + " nhead=nhead,\n", + " dim_feedforward=dim_feedforward,\n", + " dropout=dropout,\n", + " )\n", + " self.transformer_encoder = nn.TransformerEncoder(\n", + " encoder_layer,\n", + " num_layers=num_layers,\n", + " )\n", + " self.classifier = nn.Linear(embedding_dim, num_class)\n", + " self.d_model = embedding_dim\n", + "\n", + " def forward(self, x):\n", + " x = self.emb(x) * math.sqrt(self.d_model)\n", + " x = self.pos_encoder(x)\n", + " x = self.transformer_encoder(x)\n", + " x = x.mean(dim=1)\n", + " x = self.classifier(x)\n", + "\n", + " return x" + ] + }, + { + "cell_type": "markdown", + "id": "68664b81-817c-4633-bc71-9b32f95ee7f3", + "metadata": {}, + "source": [ + "The model can then be trained on labeled data from the IMDB data set with two classes.\n", + "Let's create the model.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "id": "545ebebe-2206-47ce-81ab-9f168a80e4bd", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Net(\n", + " (emb): Embedding(400000, 100)\n", + " (pos_encoder): PositionalEncoding(\n", + " (dropout): Dropout(p=0.1, inplace=False)\n", + " )\n", + " (transformer_encoder): TransformerEncoder(\n", + " (layers): ModuleList(\n", + " (0-1): 2 x TransformerEncoderLayer(\n", + " (self_attn): MultiheadAttention(\n", + " (out_proj): NonDynamicallyQuantizableLinear(in_features=100, out_features=100, bias=True)\n", + " )\n", + " (linear1): Linear(in_features=100, out_features=128, bias=True)\n", + " (dropout): Dropout(p=0.1, inplace=False)\n", + " (linear2): Linear(in_features=128, out_features=100, bias=True)\n", + " (norm1): LayerNorm((100,), eps=1e-05, elementwise_affine=True)\n", + " (norm2): LayerNorm((100,), eps=1e-05, elementwise_affine=True)\n", + " (dropout1): Dropout(p=0.1, inplace=False)\n", + " (dropout2): Dropout(p=0.1, inplace=False)\n", + " )\n", + " )\n", + " )\n", + " (classifier): Linear(in_features=100, out_features=2, bias=True)\n", + ")" + ] + }, + "execution_count": 24, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n", + "model = Net(num_class=2,vocab_size=vocab_size).to(device)\n", + "model" + ] + }, + { + "cell_type": "markdown", + "id": "18ece646-3872-41e1-bedb-be1c7d8a37cc", + "metadata": {}, + "source": [ + "The following **`predict`** function takes in a text, a text pipeline, and a model as inputs. It uses a pretrained model passed as a parameter to predict the label of the text for text classification on the IMDB data set.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "id": "4cde3566-d2df-40b8-8fd1-944018d4b0ab", + "metadata": {}, + "outputs": [], + "source": [ + "def predict(text, text_pipeline, model):\n", + " with torch.no_grad():\n", + " text = torch.unsqueeze(torch.tensor(text_pipeline(text)),0).to(device)\n", + " model.to(device)\n", + " output = model(text)\n", + " return imdb_label[output.argmax(1).item()]" + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "id": "c7388103-5814-402d-a7de-ede861dc0927", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "' negative review'" + ] + }, + "execution_count": 26, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "predict(\"I like sports and stuff\", text_pipeline, model)" + ] + }, + { + "cell_type": "markdown", + "id": "40449174-d493-4521-8417-7950dafb0072", + "metadata": {}, + "source": [ + "You can create a function to evaluate the model's accuracy on a data set. Here, you define two nearly identical evaluation functions, one that provides a `tqdm` progress bar, and one that does not.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "id": "1171efb1-cf5c-4a25-a0e0-91feb88e8ab7", + "metadata": {}, + "outputs": [], + "source": [ + "def evaluate(dataloader, model_eval):\n", + " model_eval.eval()\n", + " total_acc, total_count= 0, 0\n", + "\n", + " with torch.no_grad():\n", + " for label, text in tqdm(dataloader):\n", + " label, text = label.to(device), text.to(device)\n", + " output = model_eval(text)\n", + " predicted = torch.max(output.data, 1)[1]\n", + " total_acc += (predicted == label).sum().item()\n", + " total_count += label.size(0)\n", + " return total_acc / total_count" + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "id": "571400ae-320b-4d1a-a103-91592812d1c1", + "metadata": {}, + "outputs": [], + "source": [ + "def evaluate_no_tqdm(dataloader, model_eval):\n", + " model_eval.eval()\n", + " total_acc, total_count= 0, 0\n", + "\n", + " with torch.no_grad():\n", + " for label, text in dataloader:\n", + " label, text = label.to(device), text.to(device)\n", + " output = model_eval(text)\n", + " predicted = torch.max(output.data, 1)[1]\n", + " total_acc += (predicted == label).sum().item()\n", + " total_count += label.size(0)\n", + " return total_acc / total_count" + ] + }, + { + "cell_type": "markdown", + "id": "0092f33b-ac39-41d2-8525-443d2759b55e", + "metadata": {}, + "source": [ + "The following code evaluates the performance of your model. Note that this can take approximately 4 minutes on a CPU. **For efficiency, let's not run this cell now, but trust us that the performance of the untrained model is no better than average. If you wish to confirm yourself of this fact, you are free to uncomment this cell**:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "id": "6b48b713-72a9-4267-b32e-dfab588c659c", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "100%|█████████████████████████████████████████| 782/782 [00:09<00:00, 82.88it/s]\n" + ] + }, + { + "data": { + "text/plain": [ + "0.5" + ] + }, + "execution_count": 29, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "evaluate(test_dataloader, model)" + ] + }, + { + "cell_type": "markdown", + "id": "b855a1d8-5e52-4fa2-a507-d7383b2ea73d", + "metadata": {}, + "source": [ + "Note that the current performance of the model is no better than average. This outcome is expected, considering that the model has not undergone any training yet.\n", + "\n", + "---" + ] + }, + { + "cell_type": "markdown", + "id": "8a28c3ce-3828-4b70-8442-ae9b7757805c", + "metadata": {}, + "source": [ + "# Training\n", + "\n", + "The following coe defines the training function used to train your model.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 30, + "id": "376d4c30-7f50-4c97-9fa2-667f3c1a47eb", + "metadata": {}, + "outputs": [], + "source": [ + "def train_model(model, optimizer, criterion, train_dataloader, valid_dataloader, epochs=1000, save_dir=\"\", file_name=None):\n", + " cum_loss_list = []\n", + " acc_epoch = []\n", + " acc_old = 0\n", + " model_path = os.path.join(save_dir, file_name)\n", + " acc_dir = os.path.join(save_dir, os.path.splitext(file_name)[0] + \"_acc\")\n", + " loss_dir = os.path.join(save_dir, os.path.splitext(file_name)[0] + \"_loss\")\n", + " time_start = time.time()\n", + "\n", + " for epoch in tqdm(range(1, epochs + 1)):\n", + " model.train()\n", + " #print(model)\n", + " #for parm in model.parameters():\n", + " # print(parm.requires_grad)\n", + " \n", + " cum_loss = 0\n", + " for idx, (label, text) in enumerate(train_dataloader):\n", + " optimizer.zero_grad()\n", + " label, text = label.to(device), text.to(device)\n", + "\n", + " predicted_label = model(text)\n", + " loss = criterion(predicted_label, label)\n", + " loss.backward()\n", + " #print(loss)\n", + " torch.nn.utils.clip_grad_norm_(model.parameters(), 0.1)\n", + " optimizer.step()\n", + " cum_loss += loss.item()\n", + " print(f\"Epoch {epoch}/{epochs} - Loss: {cum_loss}\")\n", + "\n", + " cum_loss_list.append(cum_loss)\n", + " accu_val = evaluate_no_tqdm(valid_dataloader,model)\n", + " acc_epoch.append(accu_val)\n", + "\n", + " if model_path and accu_val > acc_old:\n", + " print(accu_val)\n", + " acc_old = accu_val\n", + " if save_dir is not None:\n", + " #pass\n", + " print(\"save model epoch\",epoch)\n", + " torch.save(model.state_dict(), model_path)\n", + " save_list_to_file(lst=acc_epoch, filename=acc_dir)\n", + " save_list_to_file(lst=cum_loss_list, filename=loss_dir)\n", + "\n", + " time_end = time.time()\n", + " print(f\"Training time: {time_end - time_start}\")" + ] + }, + { + "cell_type": "markdown", + "id": "353d2098-4d5d-4a06-aa6a-c8866a6ecb0f", + "metadata": {}, + "source": [ + "### Train IMDB\n", + "\n", + "The following code sets the learning rate (LR) to 1, which determines the step size at which the optimizer updates the model's parameters during training. The CrossEntropyLoss criterion is used to calculate the loss between the model's predicted outputs and the ground truth labels. This loss function is commonly employed for multiclass classification tasks.\n", + "\n", + "The chosen optimizer is Stochastic Gradient Descent (SGD), which optimizes the model's parameters based on the computed gradients with respect to the loss function. The SGD optimizer uses the specified learning rate to control the size of the weight updates.\n", + "\n", + "Additionally, a learning rate scheduler is defined using StepLR. This scheduler adjusts the learning rate during training, reducing it by a factor (gamma) of 0.1 after every epoch (step) to improve convergence and fine-tune the model's performance. These components together form the essential setup for training a neural network using the specified learning rate, loss criterion, optimizer, and learning rate scheduler.\n", + "\n", + "For the sake of time efficiency, **the following lines are commented out and the model is not actually trained**. If you would like to get a glimpse of what training would look like, uncomment the following code block to train the model for 2 epochs. If you were to train this model in a real-world scenario, you would likely increase the number of epochs to a larger figure, such as 100 or more. Given the reduced training set defined earlier, it takes approximately 2 minutes to complete 2 epochs of training.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 32, + "id": "03ab8dc6-6dda-44fa-b86c-1ad67e651463", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + " 0%| | 0/2 [00:00" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "acc_urlopened = urlopen('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/sybqacL5p1qeEO8d4xRZNg/model-IMDB%20dataset%20small2-acc')\n", + "loss_urlopened = urlopen('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/eOt6woGoaOB565T0RLH5WA/model-IMDB%20dataset%20small2-loss')\n", + "acc_epoch = pickle.load(acc_urlopened)\n", + "cum_loss_list = pickle.load(loss_urlopened)\n", + "plot(cum_loss_list,acc_epoch)" + ] + }, + { + "cell_type": "markdown", + "id": "37fe982b-b1a7-4d31-a5fc-21481d97fa4e", + "metadata": {}, + "source": [ + "The following code loads your pretrained model and evaluates its performance on the test set. **For efficiency, let's not run the evaluation because it can take approximately 4 minutes to run. Instead, report the result underneath the cell. If you would like to confirm the result for yourself, you are free to uncomment the last line in the following code block.**\n" + ] + }, + { + "cell_type": "code", + "execution_count": 39, + "id": "ca827b34-bd27-413f-ba76-8fde3d73c570", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 39, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "urlopened = urlopen('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/q66IH6a7lglkZ4haM6hB1w/model-IMDB%20dataset%20small2.pth')\n", + "model_ = Net(vocab_size=vocab_size, num_class=2).to(device)\n", + "model_.load_state_dict(torch.load(io.BytesIO(urlopened.read()), map_location=device))" + ] + }, + { + "cell_type": "code", + "execution_count": 40, + "id": "f7421582-5341-4554-bb43-93eaeec31de6", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "100%|█████████████████████████████████████████| 782/782 [00:10<00:00, 71.81it/s]\n" + ] + }, + { + "data": { + "text/plain": [ + "0.83208" + ] + }, + "execution_count": 40, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "evaluate(test_dataloader, model_)" + ] + }, + { + "cell_type": "markdown", + "id": "a91438ab-0a67-47b4-908c-2114cedb29e2", + "metadata": {}, + "source": [ + "As you can see, the pretrained model achieved an accuracy of approximately 83% on the test data.\n" + ] + }, + { + "cell_type": "markdown", + "id": "d8f01fc5-871b-4978-a2dd-724efa504014", + "metadata": {}, + "source": [ + "### Fine-tune a model pretrained on the AG News data set\n", + "\n", + "Rather than training a model on the IMDB data set as you did earlier, you can fine-tune a model that has been pretrained on the AG News data set, which is a collection of news articles. The goal of the AG News data set is to categorize news articles into one of four categories: Sports, Business, Sci/tech, or World. You’ll start training a model from scratch on the AG News data set. To save time, you can do this in just one cell. Also, for efficiency, ** comment out the training bit**. If you want to train the model for 2 epochs on a smaller data set to demonstrate what the training process would look like, uncomment the part that says `### Uncomment to Train ###` before running the cell. Training for 2 epochs on the reduced data set can take approximately 3 minutes.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 36, + "id": "4db8b115-13e1-4d42-99e0-fd09dc0527c6", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Net(\n", + " (emb): Embedding(400000, 100)\n", + " (pos_encoder): PositionalEncoding(\n", + " (dropout): Dropout(p=0.1, inplace=False)\n", + " )\n", + " (transformer_encoder): TransformerEncoder(\n", + " (layers): ModuleList(\n", + " (0-1): 2 x TransformerEncoderLayer(\n", + " (self_attn): MultiheadAttention(\n", + " (out_proj): NonDynamicallyQuantizableLinear(in_features=100, out_features=100, bias=True)\n", + " )\n", + " (linear1): Linear(in_features=100, out_features=128, bias=True)\n", + " (dropout): Dropout(p=0.1, inplace=False)\n", + " (linear2): Linear(in_features=128, out_features=100, bias=True)\n", + " (norm1): LayerNorm((100,), eps=1e-05, elementwise_affine=True)\n", + " (norm2): LayerNorm((100,), eps=1e-05, elementwise_affine=True)\n", + " (dropout1): Dropout(p=0.1, inplace=False)\n", + " (dropout2): Dropout(p=0.1, inplace=False)\n", + " )\n", + " )\n", + " )\n", + " (classifier): Linear(in_features=100, out_features=4, bias=True)\n", + ")" + ] + }, + "execution_count": 36, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "train_iter_ag_news = AG_NEWS(split=\"train\")\n", + "\n", + "num_class_ag_news = len(set([label for (label, text) in train_iter_ag_news ]))\n", + "num_class_ag_news\n", + "\n", + "# Split the dataset into training and testing iterators.\n", + "train_iter_ag_news, test_iter_ag_news = AG_NEWS()\n", + "\n", + "# Convert the training and testing iterators to map-style datasets.\n", + "train_dataset_ag_news = to_map_style_dataset(train_iter_ag_news)\n", + "test_dataset_ag_news = to_map_style_dataset(test_iter_ag_news)\n", + "\n", + "# Determine the number of samples to be used for training and validation (5% for validation).\n", + "num_train_ag_news = int(len(train_dataset_ag_news) * 0.95)\n", + "\n", + "# Randomly split the training dataset into training and validation datasets using `random_split`.\n", + "# The training dataset will contain 95% of the samples, and the validation dataset will contain the remaining 5%.\n", + "split_train_ag_news_, split_valid_ag_news_ = random_split(train_dataset_ag_news, [num_train_ag_news, len(train_dataset_ag_news) - num_train_ag_news])\n", + "\n", + "# Make the training set smaller to allow it to run fast as an example.\n", + "# IF YOU WANT TO TRAIN ON THE AG_NEWS DATASET, COMMENT OUT THE 2 LINEs BELOW.\n", + "# HOWEVER, NOTE THAT TRAINING WILL TAKE A LONG TIME\n", + "num_train_ag_news = int(len(train_dataset_ag_news) * 0.05)\n", + "split_train_ag_news_, _ = random_split(split_train_ag_news_, [num_train_ag_news, len(split_train_ag_news_) - num_train_ag_news])\n", + "\n", + "\n", + "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n", + "device\n", + "\n", + "def label_pipeline(x):\n", + " return int(x) - 1\n", + "\n", + "from torch.nn.utils.rnn import pad_sequence\n", + "\n", + "def collate_batch_ag_news(batch):\n", + " label_list, text_list = [], []\n", + " for _label, _text in batch:\n", + " label_list.append(label_pipeline(_label))\n", + " text_list.append(torch.tensor(text_pipeline(_text), dtype=torch.int64))\n", + "\n", + "\n", + " label_list = torch.tensor(label_list, dtype=torch.int64)\n", + " text_list = pad_sequence(text_list, batch_first=True)\n", + "\n", + "\n", + " return label_list.to(device), text_list.to(device)\n", + "\n", + "BATCH_SIZE = 32\n", + "\n", + "train_dataloader_ag_news = DataLoader(\n", + " split_train_ag_news_, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_batch_ag_news\n", + ")\n", + "valid_dataloader_ag_news = DataLoader(\n", + " split_valid_ag_news_, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_batch_ag_news\n", + ")\n", + "test_dataloader_ag_news = DataLoader(\n", + " test_dataset_ag_news, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_batch_ag_news\n", + ")\n", + "\n", + "\n", + "model_ag_news = Net(num_class=4,vocab_size=vocab_size).to(device)\n", + "model_ag_news.to(device)" + ] + }, + { + "cell_type": "code", + "execution_count": 37, + "id": "1ee6601d-69a0-4274-a237-38f981a6153b", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + " 0%| | 0/2 [00:00" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "acc_urlopened = urlopen('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/bQk8mJu3Uct3I4JEsEtRnw/model-AG%20News%20small1-acc')\n", + "loss_urlopened = urlopen('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/KNQkqJWWwY_XfbFBRFhZNA/model-AG%20News%20small1-loss')\n", + "acc_epoch = pickle.load(acc_urlopened)\n", + "cum_loss_list = pickle.load(loss_urlopened)\n", + "plot(cum_loss_list,acc_epoch)" + ] + }, + { + "cell_type": "markdown", + "id": "a17edde1-7268-468d-ad53-d1880848efc6", + "metadata": {}, + "source": [ + "The following code loads the pretrained model and evaluates its performance on the AG News test set. **For efficiency, let's not run the evaluation because it can take a few minutes. Instead, claim that the pretrained model works well on the AG News dataset. If you would like to confirm the result for yourself, feel free to uncomment the last line in the following code block.**\n" + ] + }, + { + "cell_type": "code", + "execution_count": 43, + "id": "e98831cf-bdb6-4e9c-9932-8ee7e6897450", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 43, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "urlopened = urlopen('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/9c3Dh2O_jsYBShBuchUNlg/model-AG%20News%20small1.pth')\n", + "model_ag_news_ = Net(vocab_size=vocab_size, num_class=4).to(device)\n", + "model_ag_news_.load_state_dict(torch.load(io.BytesIO(urlopened.read()), map_location=device))" + ] + }, + { + "cell_type": "code", + "execution_count": 44, + "id": "346d1f7a-2331-4010-a979-ae71efb00fcc", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "100%|████████████████████████████████████████| 238/238 [00:00<00:00, 323.29it/s]\n" + ] + }, + { + "data": { + "text/plain": [ + "0.9046052631578947" + ] + }, + "execution_count": 44, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "evaluate(test_dataloader_ag_news, model_ag_news_)" + ] + }, + { + "cell_type": "markdown", + "id": "49ecfc70-585f-4fd5-a2d3-3eb558705103", + "metadata": {}, + "source": [ + "As you can see, the pretrained model worked extremely well on the AG News data set. However, can this model be fine-tuned to perform well on the IMDB data set as well? Let's find out! You can begin by loading the pretrained AG News model.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 46, + "id": "92336349-f915-4471-a3d2-64c90ea31171", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 46, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "urlopened = urlopen('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/9c3Dh2O_jsYBShBuchUNlg/model-AG%20News%20small1.pth')\n", + "model_fine1 = Net(vocab_size=vocab_size, num_class=4).to(device)\n", + "model_fine1.load_state_dict(torch.load(io.BytesIO(urlopened.read()), map_location=device))\n" + ] + }, + { + "cell_type": "markdown", + "id": "80e739c4-0b6b-44fa-8d5d-456c52576dd6", + "metadata": {}, + "source": [ + "The IMDB dataset is a binary classification task with only two classes (positive and negative reviews). Therefore, the output layer of the AG NEWS model should be adjusted to have just two output neurons to reflect the binary nature of the IMDB dataset. This adjustment is essential for the model to accurately learn and predict the sentiment of movie reviews in the IMDB dataset.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 47, + "id": "1190f1bf-a030-43b1-b146-5f27715ce6a4", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Original final layer: Linear(in_features=100, out_features=4, bias=True)\n", + "Input dimention final layer: 100\n" + ] + } + ], + "source": [ + "model_fine1.classifier\n", + "in_features = model_fine1.classifier.in_features\n", + "print(\"Original final layer:\", model_fine1.classifier)\n", + "print(\"Input dimention final layer:\", in_features)" + ] + }, + { + "cell_type": "markdown", + "id": "682f6a52-d8d8-4ed7-9162-1bf74dee5542", + "metadata": {}, + "source": [ + "You can change the final layer into a two-class problem.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 48, + "id": "46da3876-5b30-4174-bcf1-2702836f3d6d", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Net(\n", + " (emb): Embedding(400000, 100)\n", + " (pos_encoder): PositionalEncoding(\n", + " (dropout): Dropout(p=0.1, inplace=False)\n", + " )\n", + " (transformer_encoder): TransformerEncoder(\n", + " (layers): ModuleList(\n", + " (0-1): 2 x TransformerEncoderLayer(\n", + " (self_attn): MultiheadAttention(\n", + " (out_proj): NonDynamicallyQuantizableLinear(in_features=100, out_features=100, bias=True)\n", + " )\n", + " (linear1): Linear(in_features=100, out_features=128, bias=True)\n", + " (dropout): Dropout(p=0.1, inplace=False)\n", + " (linear2): Linear(in_features=128, out_features=100, bias=True)\n", + " (norm1): LayerNorm((100,), eps=1e-05, elementwise_affine=True)\n", + " (norm2): LayerNorm((100,), eps=1e-05, elementwise_affine=True)\n", + " (dropout1): Dropout(p=0.1, inplace=False)\n", + " (dropout2): Dropout(p=0.1, inplace=False)\n", + " )\n", + " )\n", + " )\n", + " (classifier): Linear(in_features=100, out_features=2, bias=True)\n", + ")" + ] + }, + "execution_count": 48, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "model_fine1.classifier = nn.Linear(in_features, 2)\n", + "model_fine1.to(device)" + ] + }, + { + "cell_type": "markdown", + "id": "672f77ee-1134-4656-9e85-c158cb4a64ea", + "metadata": {}, + "source": [ + "The following code shows the layers that are frozen (`requires_grad == False`) and unfrozen (`requires_grad == True`) in the model. The unfrozen layers will have their weights updated during fine-tuning.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 49, + "id": "9afe1e34-b1e3-42a8-b67d-5e99e761dbab", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "emb.weight requires_grad: False\n", + "transformer_encoder.layers.0.self_attn.in_proj_weight requires_grad: True\n", + "transformer_encoder.layers.0.self_attn.in_proj_bias requires_grad: True\n", + "transformer_encoder.layers.0.self_attn.out_proj.weight requires_grad: True\n", + "transformer_encoder.layers.0.self_attn.out_proj.bias requires_grad: True\n", + "transformer_encoder.layers.0.linear1.weight requires_grad: True\n", + "transformer_encoder.layers.0.linear1.bias requires_grad: True\n", + "transformer_encoder.layers.0.linear2.weight requires_grad: True\n", + "transformer_encoder.layers.0.linear2.bias requires_grad: True\n", + "transformer_encoder.layers.0.norm1.weight requires_grad: True\n", + "transformer_encoder.layers.0.norm1.bias requires_grad: True\n", + "transformer_encoder.layers.0.norm2.weight requires_grad: True\n", + "transformer_encoder.layers.0.norm2.bias requires_grad: True\n", + "transformer_encoder.layers.1.self_attn.in_proj_weight requires_grad: True\n", + "transformer_encoder.layers.1.self_attn.in_proj_bias requires_grad: True\n", + "transformer_encoder.layers.1.self_attn.out_proj.weight requires_grad: True\n", + "transformer_encoder.layers.1.self_attn.out_proj.bias requires_grad: True\n", + "transformer_encoder.layers.1.linear1.weight requires_grad: True\n", + "transformer_encoder.layers.1.linear1.bias requires_grad: True\n", + "transformer_encoder.layers.1.linear2.weight requires_grad: True\n", + "transformer_encoder.layers.1.linear2.bias requires_grad: True\n", + "transformer_encoder.layers.1.norm1.weight requires_grad: True\n", + "transformer_encoder.layers.1.norm1.bias requires_grad: True\n", + "transformer_encoder.layers.1.norm2.weight requires_grad: True\n", + "transformer_encoder.layers.1.norm2.bias requires_grad: True\n", + "classifier.weight requires_grad: True\n", + "classifier.bias requires_grad: True\n" + ] + } + ], + "source": [ + "for name, param in model_fine1.named_parameters():\n", + " print(f\"{name} requires_grad: {param.requires_grad}\")" + ] + }, + { + "cell_type": "markdown", + "id": "2f1597ed-a727-42d5-bee1-7aaabd6c7681", + "metadata": {}, + "source": [ + "The following code block simulates fine-tuning on the shortened training set for just 2 epochs. **For the sake of time efficiency, this code block has been commented out**. If you want to see what training looks like, uncomment the following code block, but remember that this code could take approximately 2 minutes to run.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 50, + "id": "3d5c5154-4b9d-4360-83d5-fc21a6ba930b", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + " 0%| | 0/2 [00:00" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "acc_urlopened = urlopen('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/3LEJw8BRgJJFGqlLxaETxA/model-fine1-acc')\n", + "loss_urlopened = urlopen('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/-CT1h97vjv0TolY82Nw29g/model-fine1-loss')\n", + "acc_epoch = pickle.load(acc_urlopened)\n", + "cum_loss_list = pickle.load(loss_urlopened)\n", + "plot(cum_loss_list,acc_epoch)" + ] + }, + { + "cell_type": "markdown", + "id": "85866d1f-1a0d-487c-a426-9da326a06c1f", + "metadata": {}, + "source": [ + "The following line loads a prefine-tuned model that was trained for 100 epochs on the full IMDB training set and evaluates its performance on the IMDB test set. **For the sake of efficiency, let's not run the evaluation because it can take a few minutes to run. Instead, report the result underneath the cell. If you would like to confirm the result for yourself, feel free to uncomment the last line in the code block.**\n" + ] + }, + { + "cell_type": "code", + "execution_count": 54, + "id": "cda6a606-1a87-4846-9926-54edc577879a", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 54, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "urlopened = urlopen('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/e0WOHKh5dnrbC2lGhpsMMw/model-fine1.pth')\n", + "model_fine1_ = Net(vocab_size=vocab_size, num_class=2).to(device)\n", + "model_fine1_.load_state_dict(torch.load(io.BytesIO(urlopened.read()), map_location=device))" + ] + }, + { + "cell_type": "code", + "execution_count": 55, + "id": "e75cfe66-a265-4e7a-8d5f-a7c68f17c6e3", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "100%|█████████████████████████████████████████| 782/782 [00:10<00:00, 77.06it/s]\n" + ] + }, + { + "data": { + "text/plain": [ + "0.8604" + ] + }, + "execution_count": 55, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "evaluate(test_dataloader, model_fine1_)" + ] + }, + { + "cell_type": "markdown", + "id": "c9b127d6-b41d-4195-ac5a-3da2d8d70b33", + "metadata": {}, + "source": [ + "This model demonstrated notable improvement, exhibiting a remarkable achievement with an accuracy of 86% on the test data. This is higher than the 83% achieved by the model trained from scratch on the IMDB dataset. Although the training process was time-intensive (The fine-tuning was as time-intensive as training the model from scratch), the enhanced performance underscores the fine-tuned model's effectiveness and superiority over the model trained from scratch. Much of the computational effort was devoted to updating the transformer layers. To expedite the training process, one viable strategy is to focus on training the final layer only, which can significantly reduce the computational load but might compromise the model's accuracy.\n" + ] + }, + { + "cell_type": "markdown", + "id": "55331d63-1150-465b-b0bc-d13bbd24fb7c", + "metadata": {}, + "source": [ + "### Fine-tune the final layer only\n", + "\n", + "Fine-tuning the final output layer of a neural network is similar to fine-tuning the whole model. You can begin by loading the pretrained model that you would like to fine-tune. In this case, it is the same model pretrained on the AG News data set.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 59, + "id": "c2ffcf34-695f-4b9d-b6c2-5529a74d568a", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 59, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "urlopened = urlopen('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/9c3Dh2O_jsYBShBuchUNlg/model-AG%20News%20small1.pth')\n", + "model_fine2 = Net(vocab_size=vocab_size, num_class=4).to(device)\n", + "model_fine2.load_state_dict(torch.load(io.BytesIO(urlopened.read()), map_location=device))" + ] + }, + { + "cell_type": "markdown", + "id": "5fa5ba41-8d73-4649-b459-cbc321ca26e2", + "metadata": {}, + "source": [ + "Now, the key difference. You iterate through all of the parameters in the `model_fine2` model and set the `requires_grad` attribute of each parameter to `False`. This effectively freezes all of the layers in the model, meaning that their weights are to be updated during training.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 60, + "id": "000978e5-6244-4be2-8fd5-3e0c7ed1c619", + "metadata": {}, + "outputs": [], + "source": [ + "# Freeze all layers in the model\n", + "for param in model_fine2.parameters():\n", + " param.requires_grad = False" + ] + }, + { + "cell_type": "markdown", + "id": "54b17c25-96f4-43b9-b0a8-963085cf5638", + "metadata": {}, + "source": [ + "Replace the final layer to reflect the fact that you are solving a two-class problem. Note that the new layer will be unfrozen.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 61, + "id": "f0c29db4-3051-4247-8459-b7adab12fa45", + "metadata": {}, + "outputs": [], + "source": [ + "dim=model_fine2.classifier.in_features" + ] + }, + { + "cell_type": "code", + "execution_count": 62, + "id": "b79d7ed1-7a83-4140-ad93-b5e28cdab784", + "metadata": {}, + "outputs": [], + "source": [ + "model_fine2.classifier = nn.Linear(dim, 2)" + ] + }, + { + "cell_type": "code", + "execution_count": 63, + "id": "9387fde1-0a71-45b5-89cf-a2c90ff663d1", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Net(\n", + " (emb): Embedding(400000, 100)\n", + " (pos_encoder): PositionalEncoding(\n", + " (dropout): Dropout(p=0.1, inplace=False)\n", + " )\n", + " (transformer_encoder): TransformerEncoder(\n", + " (layers): ModuleList(\n", + " (0-1): 2 x TransformerEncoderLayer(\n", + " (self_attn): MultiheadAttention(\n", + " (out_proj): NonDynamicallyQuantizableLinear(in_features=100, out_features=100, bias=True)\n", + " )\n", + " (linear1): Linear(in_features=100, out_features=128, bias=True)\n", + " (dropout): Dropout(p=0.1, inplace=False)\n", + " (linear2): Linear(in_features=128, out_features=100, bias=True)\n", + " (norm1): LayerNorm((100,), eps=1e-05, elementwise_affine=True)\n", + " (norm2): LayerNorm((100,), eps=1e-05, elementwise_affine=True)\n", + " (dropout1): Dropout(p=0.1, inplace=False)\n", + " (dropout2): Dropout(p=0.1, inplace=False)\n", + " )\n", + " )\n", + " )\n", + " (classifier): Linear(in_features=100, out_features=2, bias=True)\n", + ")" + ] + }, + "execution_count": 63, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "model_fine2.to(device)" + ] + }, + { + "cell_type": "markdown", + "id": "42123afb-6a78-4f98-87dd-e87c2cfa7c4f", + "metadata": {}, + "source": [ + "The following block simulates fine-tuning on the shortened training set for just 2 epochs. **For the sake of time efficiency, this code block has been commented out**. The following code should take a shorter amount of time to train than the full fine-tuning conducted previously because only the final layer is unfrozen.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 64, + "id": "3114fa2b-90ee-4955-8459-e01c1db83f2d", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + " 0%| | 0/2 [00:00" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "acc_urlopened = urlopen('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/UdR3ApQnxSeV2mrA0CbiLg/model-fine2-acc')\n", + "loss_urlopened = urlopen('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/rWGDIF-uL2dEngWcIo9teQ/model-fine2-loss')\n", + "acc_epoch = pickle.load(acc_urlopened)\n", + "cum_loss_list = pickle.load(loss_urlopened)\n", + "plot(cum_loss_list,acc_epoch)" + ] + }, + { + "cell_type": "markdown", + "id": "3043392c-ee76-4103-a758-83e90f594604", + "metadata": {}, + "source": [ + "The following line loads the pretrained model and evaluates its performance on the test set. **For efficiency, let's not run the evaluation because it can take a few minutes to run. Instead, report the result underneath the cell. If you would like to confirm the result for yourself, feel free to uncomment the last line in the following code block.**\n" + ] + }, + { + "cell_type": "code", + "execution_count": 68, + "id": "14e51cf2-4e3f-4adf-8c58-f65da0e74f65", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 68, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "urlopened = urlopen('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/B-1H6lpDg-A0zRwpB6Ek2g/model-fine2.pth')\n", + "model_fine2_ = Net(vocab_size=vocab_size, num_class=2).to(device)\n", + "model_fine2_.load_state_dict(torch.load(io.BytesIO(urlopened.read()), map_location=device))" + ] + }, + { + "cell_type": "code", + "execution_count": 69, + "id": "d7cf2dfa-3705-4f3e-a114-b4c56368cc77", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "100%|█████████████████████████████████████████| 782/782 [00:09<00:00, 85.13it/s]\n" + ] + }, + { + "data": { + "text/plain": [ + "0.64144" + ] + }, + "execution_count": 69, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "evaluate(test_dataloader, model_fine2_)" + ] + }, + { + "cell_type": "markdown", + "id": "dbd9b301-88ba-42d4-b8ca-6e00379fb3b4", + "metadata": {}, + "source": [ + "The previous code indicates that although fine-tuning the final layer takes a significantly smaller amount of time than fine-tuning the whole model, the performance of the model with just the last layer unfrozen is significantly worse (64% accuracy) than the fine-tuned model with all layers unfrozen (86% accuracy).\n", + "\n", + "---" + ] + }, + { + "cell_type": "markdown", + "id": "d23b9fe8-1655-4c4c-8ee2-f78db934d5f2", + "metadata": {}, + "source": [ + "# Adapters\n", + "FeatureAdapter is a neural network module that introduces a low-dimensional bottleneck in a transformer architecture to allow fine-tuning with fewer parameters. It compresses the original high-dimensional embeddings into a lower dimension, applies a non-linear transformation, and then expands it back to the original dimension. This process is followed by a residual\n", + "connection that adds the transformed output back to the original input to preserve information and\n", + "promote gradient flow.\n", + "\n", + "## Benefits of using adapters in neural networks\n", + "\n", + "- **Efficient fine-tuning**: Adapters allow for targeted updates to specific parts of the model, reducing the need to retrain large sections of the network.\n", + "\n", + "- **Parameter efficiency**: By adding only a few parameters, adapters make it feasible to modify large models without substantial computational overhead.\n", + "\n", + "- **Preservation of pretrained features**: Adapters enable the modification of a model while retaining the valuable features learned during extensive pretraining.\n", + "\n", + "- **Modularity and flexibility**: They enhance the modularity of models, allowing easy adaptation to various tasks without altering core architecture.\n", + "\n", + "- **Task-specific adaptation**: Adapters can be tailored to improve performance on particular tasks, optimizing the model’s effectiveness.\n", + "\n", + "- **Transfer learning and domain adaptation**: They facilitate the adaptation of models to new domains, bridging gaps between different data distributions.\n", + "\n", + "- **Continual learning**: Adapters support the model's ability to learn new information continuously without forgetting previous knowledge.\n", + "\n", + "- **Reduced risk of overfitting**: With fewer trainable parameters, adapters help prevent overfitting, especially on smaller data sets.\n", + "\n", + "The following code shows an adapter model.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 70, + "id": "c46c86b1-eb04-4976-9fa7-c3fc06339bae", + "metadata": {}, + "outputs": [], + "source": [ + "class FeatureAdapter(nn.Module):\n", + " \"\"\"\n", + " Attributes:\n", + " size (int): The bottleneck dimension to which the embeddings are temporarily reduced.\n", + " model_dim (int): The original dimension of the embeddings or features in the transformer model.\n", + " \"\"\"\n", + " def __init__(self, bottleneck_size=50, model_dim=100):\n", + " super().__init__()\n", + " self.bottleneck_transform = nn.Sequential(\n", + " nn.Linear(model_dim, bottleneck_size), # Down-project to a smaller dimension\n", + " nn.ReLU(), # Apply non-linearity\n", + " nn.Linear(bottleneck_size, model_dim) # Up-project back to the original dimension\n", + " )\n", + "\n", + " def forward(self, x):\n", + " \"\"\"\n", + " Forward pass of the FeatureAdapter. Applies the bottleneck transformation to the input\n", + " tensor and adds a skip connection.\n", + "\n", + " Args:\n", + " x (Tensor): Input tensor with shape (batch_size, seq_length, model_dim).\n", + "\n", + " Returns:\n", + " Tensor: Output tensor after applying the adapter transformation and skip connection,\n", + " maintaining the original input shape.\n", + " \"\"\"\n", + " transformed_features = self.bottleneck_transform(x) # Transform features through the bottleneck\n", + " output_with_residual = transformed_features + x # Add the residual connection\n", + " return output_with_residual" + ] + }, + { + "cell_type": "markdown", + "id": "a5c12e9a-5bf1-4a4f-b792-25a264253d28", + "metadata": {}, + "source": [ + "The adapted class wraps this adapter functionality around any specified linear layer, enhancing its output with the non-linearity of a ReLU activation function. This setup is particularly useful for experimenting with subtle architectural modifications in deep learning models, facilitating fine-tuning and potentially improving model performance on complex tasks.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 71, + "id": "97d59ae0-610e-49bb-a9c4-1020c2dc468d", + "metadata": {}, + "outputs": [], + "source": [ + "class Adapted(nn.Module):\n", + " def __init__(self, linear,bottleneck_size=None):\n", + " super(Adapted, self).__init__()\n", + " self.linear = linear\n", + " model_dim = linear.out_features\n", + " if bottleneck_size is None:\n", + " bottleneck_size = model_dim//2 # Define default bottleneck size as half the model_dim\n", + "\n", + " # Initialize FeatureAdapter with calculated bottleneck_size and model_dim\n", + " self.adaptor = FeatureAdapter(bottleneck_size=bottleneck_size, model_dim=model_dim)\n", + "\n", + " def forward(self, x):\n", + " # First, the input x is passed through the linear layer\n", + " x=self.linear(x)\n", + " # Then it's adapted using FeatureAdapter\n", + " x= self.adaptor(x)\n", + " return x" + ] + }, + { + "cell_type": "markdown", + "id": "d8cfb80d-eb8c-4342-8066-33134c5b456d", + "metadata": {}, + "source": [ + "You load the pretrained transformer model that was trained on the AG News dataset.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 74, + "id": "01070ca9-5037-470e-b8f7-ceddfc2f279b", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 74, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "urlopened = urlopen('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/9c3Dh2O_jsYBShBuchUNlg/model-AG%20News%20small1.pth')\n", + "model_adapters = Net(vocab_size=vocab_size, num_class=4).to(device)\n", + "model_adapters.load_state_dict(torch.load(io.BytesIO(urlopened.read()), map_location=device))" + ] + }, + { + "cell_type": "markdown", + "id": "e1df412e-51c5-4e8c-9dbe-c6d38f11a0cf", + "metadata": {}, + "source": [ + "\n", + "First, freeze the parameters of a model named model_adapters to prevent them from being updated during training. Then, retrieve the number of input features for the classifier, and replace the classifier with a new linear layer that outputs to two classes.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 75, + "id": "1820db89-b3b0-4cb2-b42f-1826ba1ec6d0", + "metadata": {}, + "outputs": [], + "source": [ + "for param in model_adapters.parameters():\n", + " param.requires_grad = False\n", + "\n", + "dim= model_adapters.classifier.in_features\n", + "\n", + "model_adapters.classifier = nn.Linear(dim, 2)" + ] + }, + { + "cell_type": "markdown", + "id": "905ad98d-2f62-475a-a7e0-a2f7119b5081", + "metadata": {}, + "source": [ + "Let's explore how to apply the adapted object to a linear layer to obtain the first output. You can obtain the unadapted linear layer for the first output by:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 76, + "id": "cbdcd100-7fda-45ee-a823-9233b7e68ca6", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Linear(in_features=100, out_features=128, bias=True)\n" + ] + } + ], + "source": [ + "my_example_layer=model_adapters.transformer_encoder.layers[0].linear1\n", + "print(my_example_layer)" + ] + }, + { + "cell_type": "markdown", + "id": "1cf2017f-bb26-4b9c-86ea-f32cc5ce4450", + "metadata": {}, + "source": [ + "In the following code, you copy the linear layer and add an adapter layer to it.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 77, + "id": "83fd78c1-111d-45f6-9adf-3ab5c7264a8b", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Adapted(\n", + " (linear): Linear(in_features=100, out_features=128, bias=True)\n", + " (adaptor): FeatureAdapter(\n", + " (bottleneck_transform): Sequential(\n", + " (0): Linear(in_features=128, out_features=64, bias=True)\n", + " (1): ReLU()\n", + " (2): Linear(in_features=64, out_features=128, bias=True)\n", + " )\n", + " )\n", + ")\n" + ] + } + ], + "source": [ + "my_adapeted_layer=Adapted(my_example_layer)\n", + "print(my_adapeted_layer)" + ] + }, + { + "cell_type": "markdown", + "id": "50e35b98-bcbd-4aec-b039-464e87ddd6fb", + "metadata": {}, + "source": [ + "You can print the adapted layer and show that the new layers have their requires_grad attribute set to True, indicating that these layers will be updated during training.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 78, + "id": "d23c9515-4303-437d-8f02-99a3723e8b89", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "False\n", + "False\n", + "True\n", + "True\n", + "True\n", + "True\n" + ] + } + ], + "source": [ + "for parm in my_adapeted_layer.parameters():\n", + " print(parm.requires_grad)" + ] + }, + { + "cell_type": "markdown", + "id": "1a6bc1b1-f1bb-4c01-99ee-c2e16e5090c2", + "metadata": {}, + "source": [ + "You can set a layer in the model to the adapter layer, as shown in the following code in the commented-out line. However, because there are many layers, a more systematic approach would be to traverse the model and replace specific layers with the adapter layer. Note that you should set the bottleneck size to 24, ensuring that there are fewer parameters to train than during a full fine-tuning.\n" + ] + }, { "cell_type": "code", - "execution_count": null, - "id": "94133067-93b7-4652-b497-040866a7b78c", + "execution_count": 79, + "id": "f49d7836-5487-4cb4-a177-797f4eb75cee", "metadata": {}, "outputs": [], - "source": [] + "source": [ + "# Adapt a specific layer\n", + "#model_adapters.transformer_encoder.layers[0].linear1=Adapted(my_example_layer)" + ] + }, + { + "cell_type": "code", + "execution_count": 82, + "id": "bc5a4706-ef6f-4d1d-9b44-4e6e3cfba925", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "2" + ] + }, + "execution_count": 82, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Find number of layers\n", + "N_layers=len(model_adapters.transformer_encoder.layers)\n", + "N_layers" + ] + }, + { + "cell_type": "code", + "execution_count": 83, + "id": "6af0d8e7-f38c-47a6-9ba4-7a74994e7f33", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " before linear1\n", + "Linear(in_features=100, out_features=128, bias=True)\n", + " after linear1\n", + "Adapted(\n", + " (linear): Linear(in_features=100, out_features=128, bias=True)\n", + " (adaptor): FeatureAdapter(\n", + " (bottleneck_transform): Sequential(\n", + " (0): Linear(in_features=128, out_features=24, bias=True)\n", + " (1): ReLU()\n", + " (2): Linear(in_features=24, out_features=128, bias=True)\n", + " )\n", + " )\n", + ")\n", + " before linear2\n", + "Linear(in_features=128, out_features=100, bias=True)\n", + " after linear2\n", + "Adapted(\n", + " (linear): Linear(in_features=128, out_features=100, bias=True)\n", + " (adaptor): FeatureAdapter(\n", + " (bottleneck_transform): Sequential(\n", + " (0): Linear(in_features=100, out_features=24, bias=True)\n", + " (1): ReLU()\n", + " (2): Linear(in_features=24, out_features=100, bias=True)\n", + " )\n", + " )\n", + ")\n", + " before linear1\n", + "Linear(in_features=100, out_features=128, bias=True)\n", + " after linear1\n", + "Adapted(\n", + " (linear): Linear(in_features=100, out_features=128, bias=True)\n", + " (adaptor): FeatureAdapter(\n", + " (bottleneck_transform): Sequential(\n", + " (0): Linear(in_features=128, out_features=24, bias=True)\n", + " (1): ReLU()\n", + " (2): Linear(in_features=24, out_features=128, bias=True)\n", + " )\n", + " )\n", + ")\n", + " before linear2\n", + "Linear(in_features=128, out_features=100, bias=True)\n", + " after linear2\n", + "Adapted(\n", + " (linear): Linear(in_features=128, out_features=100, bias=True)\n", + " (adaptor): FeatureAdapter(\n", + " (bottleneck_transform): Sequential(\n", + " (0): Linear(in_features=100, out_features=24, bias=True)\n", + " (1): ReLU()\n", + " (2): Linear(in_features=24, out_features=100, bias=True)\n", + " )\n", + " )\n", + ")\n" + ] + } + ], + "source": [ + "# Traverse model and adapt\n", + "for n in range(N_layers):\n", + " encoder=model_adapters.transformer_encoder.layers[n]\n", + " if encoder.linear1:\n", + " print(\" before linear1\")\n", + " print(encoder.linear1)\n", + " model_adapters.transformer_encoder.layers[n].linear1=Adapted(encoder.linear1, bottleneck_size=24)\n", + " print(\" after linear1\")\n", + " print(model_adapters.transformer_encoder.layers[n].linear1)\n", + "\n", + " if encoder.linear2:\n", + " print(\" before linear2\")\n", + " print(model_adapters.transformer_encoder.layers[n].linear2)\n", + " model_adapters.transformer_encoder.layers[n].linear2=Adapted(encoder.linear2, bottleneck_size=24)\n", + " print(\" after linear2\")\n", + " print(model_adapters.transformer_encoder.layers[n].linear2)" + ] + }, + { + "cell_type": "markdown", + "id": "1ea3e84d-649a-462d-8c21-ce1d4a915ce2", + "metadata": {}, + "source": [ + "The following code sends the model to the device.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 84, + "id": "67b7450d-b1f4-42c6-9995-b13da7a8a7d7", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Net(\n", + " (emb): Embedding(400000, 100)\n", + " (pos_encoder): PositionalEncoding(\n", + " (dropout): Dropout(p=0.1, inplace=False)\n", + " )\n", + " (transformer_encoder): TransformerEncoder(\n", + " (layers): ModuleList(\n", + " (0-1): 2 x TransformerEncoderLayer(\n", + " (self_attn): MultiheadAttention(\n", + " (out_proj): NonDynamicallyQuantizableLinear(in_features=100, out_features=100, bias=True)\n", + " )\n", + " (linear1): Adapted(\n", + " (linear): Linear(in_features=100, out_features=128, bias=True)\n", + " (adaptor): FeatureAdapter(\n", + " (bottleneck_transform): Sequential(\n", + " (0): Linear(in_features=128, out_features=24, bias=True)\n", + " (1): ReLU()\n", + " (2): Linear(in_features=24, out_features=128, bias=True)\n", + " )\n", + " )\n", + " )\n", + " (dropout): Dropout(p=0.1, inplace=False)\n", + " (linear2): Adapted(\n", + " (linear): Linear(in_features=128, out_features=100, bias=True)\n", + " (adaptor): FeatureAdapter(\n", + " (bottleneck_transform): Sequential(\n", + " (0): Linear(in_features=100, out_features=24, bias=True)\n", + " (1): ReLU()\n", + " (2): Linear(in_features=24, out_features=100, bias=True)\n", + " )\n", + " )\n", + " )\n", + " (norm1): LayerNorm((100,), eps=1e-05, elementwise_affine=True)\n", + " (norm2): LayerNorm((100,), eps=1e-05, elementwise_affine=True)\n", + " (dropout1): Dropout(p=0.1, inplace=False)\n", + " (dropout2): Dropout(p=0.1, inplace=False)\n", + " )\n", + " )\n", + " )\n", + " (classifier): Linear(in_features=100, out_features=2, bias=True)\n", + ")" + ] + }, + "execution_count": 84, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Send model to device\n", + "model_adapters.to(device)" + ] + }, + { + "cell_type": "markdown", + "id": "d233586b-c7ef-49e6-b684-b8d24997db80", + "metadata": {}, + "source": [ + "Finally, the following code simulates training of the adapted model by training on a shortend IMDB train set for 2 epochs.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 85, + "id": "f275d5f0-ee19-469b-a957-98bf72d3f3c3", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + " 0%| | 0/2 [00:00" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "acc_urlopened = urlopen('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/D49zrrMPWO_ktwQo7PSHIQ/model-adapters-acc')\n", + "loss_urlopened = urlopen('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/RXWlmyaco695RiaoU7QsnA/model-adapters-loss')\n", + "acc_epoch = pickle.load(acc_urlopened)\n", + "cum_loss_list = pickle.load(loss_urlopened)\n", + "plot(cum_loss_list,acc_epoch)" + ] + }, + { + "cell_type": "markdown", + "id": "f7e5392a-549b-4b68-968a-8c8d70dbf05d", + "metadata": {}, + "source": [ + "The following code loads the adapted model fine-tuned for 100 epochs on the full IMDB train set and evaluates its performance on the IMDB test set.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 89, + "id": "5caaa3f7-5c89-4f27-8ef0-3d3d29f79b66", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " before linear1\n", + "Linear(in_features=100, out_features=128, bias=True)\n", + " after linear1\n", + "Adapted(\n", + " (linear): Linear(in_features=100, out_features=128, bias=True)\n", + " (adaptor): FeatureAdapter(\n", + " (bottleneck_transform): Sequential(\n", + " (0): Linear(in_features=128, out_features=24, bias=True)\n", + " (1): ReLU()\n", + " (2): Linear(in_features=24, out_features=128, bias=True)\n", + " )\n", + " )\n", + ")\n", + " before linear2\n", + "Linear(in_features=128, out_features=100, bias=True)\n", + " after linear2\n", + "Adapted(\n", + " (linear): Linear(in_features=128, out_features=100, bias=True)\n", + " (adaptor): FeatureAdapter(\n", + " (bottleneck_transform): Sequential(\n", + " (0): Linear(in_features=100, out_features=24, bias=True)\n", + " (1): ReLU()\n", + " (2): Linear(in_features=24, out_features=100, bias=True)\n", + " )\n", + " )\n", + ")\n", + " before linear1\n", + "Linear(in_features=100, out_features=128, bias=True)\n", + " after linear1\n", + "Adapted(\n", + " (linear): Linear(in_features=100, out_features=128, bias=True)\n", + " (adaptor): FeatureAdapter(\n", + " (bottleneck_transform): Sequential(\n", + " (0): Linear(in_features=128, out_features=24, bias=True)\n", + " (1): ReLU()\n", + " (2): Linear(in_features=24, out_features=128, bias=True)\n", + " )\n", + " )\n", + ")\n", + " before linear2\n", + "Linear(in_features=128, out_features=100, bias=True)\n", + " after linear2\n", + "Adapted(\n", + " (linear): Linear(in_features=128, out_features=100, bias=True)\n", + " (adaptor): FeatureAdapter(\n", + " (bottleneck_transform): Sequential(\n", + " (0): Linear(in_features=100, out_features=24, bias=True)\n", + " (1): ReLU()\n", + " (2): Linear(in_features=24, out_features=100, bias=True)\n", + " )\n", + " )\n", + ")\n" + ] + } + ], + "source": [ + "model_adapters_ = Net(vocab_size=vocab_size, num_class=2).to(device)\n", + "for n in range(N_layers):\n", + " encoder=model_adapters_.transformer_encoder.layers[n]\n", + " if encoder.linear1:\n", + " print(\" before linear1\")\n", + " print(encoder.linear1)\n", + " model_adapters_.transformer_encoder.layers[n].linear1=Adapted(encoder.linear1, bottleneck_size=24)\n", + " print(\" after linear1\")\n", + " print(model_adapters_.transformer_encoder.layers[n].linear1)\n", + "\n", + " if encoder.linear2:\n", + " print(\" before linear2\")\n", + " print(model_adapters_.transformer_encoder.layers[n].linear2)\n", + " model_adapters_.transformer_encoder.layers[n].linear2=Adapted(encoder.linear2, bottleneck_size=24)\n", + " print(\" after linear2\")\n", + " print(model_adapters_.transformer_encoder.layers[n].linear2)\n", + "\n", + "model_adapters_.to(device)\n", + "for param in model_adapters_.parameters():\n", + " param.requires_grad = False\n" + ] + }, + { + "cell_type": "code", + "execution_count": 90, + "id": "d582b1d3-0fcc-473e-82de-58f9ab1443ef", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 90, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "urlopened = urlopen('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/PGhd5G_NVrWNH-_jdjwNlw/model-adapters.pth')\n", + "model_adapters_.load_state_dict(torch.load(io.BytesIO(urlopened.read()), map_location=device))" + ] + }, + { + "cell_type": "code", + "execution_count": 91, + "id": "df69e146-1a1f-4267-9372-9834fcf6616d", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "100%|█████████████████████████████████████████| 782/782 [00:11<00:00, 66.29it/s]\n" + ] + }, + { + "data": { + "text/plain": [ + "0.85608" + ] + }, + "execution_count": 91, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "evaluate(test_dataloader, model_adapters_)" + ] + }, + { + "cell_type": "markdown", + "id": "08262362-924a-42d0-b928-0cf6ffe5331c", + "metadata": {}, + "source": [ + "As you can see, the performance of the fine-tuned adapted model is nearly identical to the fully fine-tuned model, with both models achieving a roughly 86% accuracy. This is an especially surprising result because a significantly smaller number of weights were updated for the adapted model than the fully fine-tuned model. Note that only the adapter layers with a bottleneck size of 24 and the final classifier layer are unfrozen.\n", + "\n", + "The above shows that adapters can be used for parameter efficient fine-tuning (PEFT) and that the performance of a model fine-tuned using adapters can be almost as good as a fully fine-tuned model with all of the layers unfrozen!\n", + "\n", + "---" + ] + }, + { + "cell_type": "markdown", + "id": "3b255d65-8f64-4c95-8fd6-ac69d1a67a5a", + "metadata": {}, + "source": [ + "## Exercise: Adapt linear layers in a different network\n", + "\n", + "The following code defines a neural network called `NeuralNetwork`.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 92, + "id": "82730ce9-d9a9-483f-8d00-e0cbd78de764", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "NeuralNetwork(\n", + " (flatten): Flatten(start_dim=1, end_dim=-1)\n", + " (linear_relu_stack): Sequential(\n", + " (0): Linear(in_features=784, out_features=512, bias=True)\n", + " (1): ReLU()\n", + " (2): Linear(in_features=512, out_features=512, bias=True)\n", + " (3): ReLU()\n", + " (4): Linear(in_features=512, out_features=10, bias=True)\n", + " )\n", + ")\n" + ] + } + ], + "source": [ + "class NeuralNetwork(nn.Module):\n", + " def __init__(self):\n", + " super().__init__()\n", + " self.flatten = nn.Flatten()\n", + " self.linear_relu_stack = nn.Sequential(\n", + " nn.Linear(28*28, 512),\n", + " nn.ReLU(),\n", + " nn.Linear(512, 512),\n", + " nn.ReLU(),\n", + " nn.Linear(512, 10),\n", + " )\n", + "\n", + " def forward(self, x):\n", + " x = self.flatten(x)\n", + " logits = self.linear_relu_stack(x)\n", + " return logits\n", + "\n", + "exercise_model = NeuralNetwork()\n", + "\n", + "exercise_model.to(device)\n", + "for param in exercise_model.parameters():\n", + " param.requires_grad = False\n", + "\n", + "print(exercise_model)" + ] + }, + { + "cell_type": "markdown", + "id": "ed4e3b4b-a5df-4543-8d35-adac4ff4b093", + "metadata": {}, + "source": [ + "`NeuralNetwork` is a neural network that uses the `Sequential` container from PyTorch. Adapt the first two linear layers in the `Sequential` container by using the bottleneck adapter with a bottleneck size of 30. Also, change the last linear layer to a layer that has 5 outputs.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 94, + "id": "2fde3cfa-d77d-4ad3-82e1-a3f7eb1d0cf0", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "NeuralNetwork(\n", + " (flatten): Flatten(start_dim=1, end_dim=-1)\n", + " (linear_relu_stack): Sequential(\n", + " (0): Adapted(\n", + " (linear): Linear(in_features=784, out_features=512, bias=True)\n", + " (adaptor): FeatureAdapter(\n", + " (bottleneck_transform): Sequential(\n", + " (0): Linear(in_features=512, out_features=30, bias=True)\n", + " (1): ReLU()\n", + " (2): Linear(in_features=30, out_features=512, bias=True)\n", + " )\n", + " )\n", + " )\n", + " (1): ReLU()\n", + " (2): Adapted(\n", + " (linear): Linear(in_features=512, out_features=512, bias=True)\n", + " (adaptor): FeatureAdapter(\n", + " (bottleneck_transform): Sequential(\n", + " (0): Linear(in_features=512, out_features=30, bias=True)\n", + " (1): ReLU()\n", + " (2): Linear(in_features=30, out_features=512, bias=True)\n", + " )\n", + " )\n", + " )\n", + " (3): ReLU()\n", + " (4): Linear(in_features=512, out_features=5, bias=True)\n", + " )\n", + ")\n" + ] + } + ], + "source": [ + "### REPLACE THIS YOUR ANSWER ###\n", + "exercise_model.linear_relu_stack[0] = Adapted(exercise_model.linear_relu_stack[0], bottleneck_size=30)\n", + "exercise_model.linear_relu_stack[2] = Adapted(exercise_model.linear_relu_stack[2], bottleneck_size=30)\n", + "exercise_model.linear_relu_stack[4] = nn.Linear(512, 5)\n", + "print(exercise_model)" + ] + }, + { + "cell_type": "markdown", + "id": "f8244ee4-45b9-4713-8480-6aadf7a66b43", + "metadata": {}, + "source": [ + "
\n", + " Click here for the solution\n", + "\n", + "```python\n", + "exercise_model.linear_relu_stack[0] = Adapted(exercise_model.linear_relu_stack[0], bottleneck_size=30)\n", + "exercise_model.linear_relu_stack[2] = Adapted(exercise_model.linear_relu_stack[2], bottleneck_size=30)\n", + "exercise_model.linear_relu_stack[4] = nn.Linear(512, 5)\n", + "print(exercise_model)\n", + "```\n", + "\n", + "
\n", + "\n", + "---" + ] + }, + { + "cell_type": "markdown", + "id": "151060fe-37c4-4ee1-8034-539b0e3a657a", + "metadata": {}, + "source": [ + "## Congratulations! You have completed the lab\n", + "\n", + "## Authors\n", + "\n", + "[Joseph Santarcangelo](https://author.skills.network/instructors/joseph_santarcangelo)\n", + "\n", + "Joseph has a Ph.D. in Electrical Engineering, his research focused on using machine learning, signal processing, and computer vision to determine how videos impact human cognition. Joseph has been working for IBM since he completed his PhD.\n", + "\n", + "[Wojciech \"Victor\" Fulmyk](https://www.linkedin.com/in/wfulmyk) \n", + "\n", + "Wojciech \"Victor\" Fulmyk is a Data Scientist at IBM, and a PhD Candidate in economics at the University of Calgary.\n", + "\n", + "[Ashutosh Sagar](https://www.linkedin.com/in/ashutoshsagar/) is completing his MS in CS from Dalhousie University. He has previous experience working with Natural Language Processing and as a Data Scientist.\n", + "\n", + "## References\n", + "\n", + "\n", + "[TEXT CLASSIFICATION WITH THE TORCHTEXT LIBRARY](https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html)\n", + "\n", + "[Parameter-Efficient Transfer Learning for NLP](https://arxiv.org/pdf/1902.00751.pdf)\n", + "\n", + "[Simple, Scalable Adaptation for Neural Machine Translation](https://arxiv.org/pdf/1909.08478)\n", + "\n", + "© Copyright IBM Corporation. All rights reserved.\n" + ] } ], "metadata": { diff --git a/notebooks/LLM_Specialization/Generative AI Advance Fine-Tuning for LLMs/Instruction-Tuning and Reward Modeling/Instruction fine-tuning-v1.ipynb b/notebooks/LLM_Specialization/Generative AI Advance Fine-Tuning for LLMs/Instruction-Tuning and Reward Modeling/Instruction fine-tuning-v1.ipynb deleted file mode 100755 index 3c3120d..0000000 --- a/notebooks/LLM_Specialization/Generative AI Advance Fine-Tuning for LLMs/Instruction-Tuning and Reward Modeling/Instruction fine-tuning-v1.ipynb +++ /dev/null @@ -1,1370 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "

\n", - " \n", - " \"Skills\n", - " \n", - "

\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Instruction-Tuning with LLMs\n", - "\n", - "\n", - "Instruction-based fine-tuning, referred to as instruction GPT. It trains the language models to follow specific instructions and generate appropriate responses. For instruction-tuning, the dataset plays an important role as it provides structured examples of instructions, contexts, and responses, allowing the model to learn how to handle various tasks effectively. Instruction GPT often uses human feedback to refine and improve model performance; however, this lab doesn't cover this aspect.\n", - "\n", - "The context and instruction are concatenated to form a single input sequence that the model can understand and use to generate the correct response.\n", - "\n", - "#### Context and instruction\n", - "\n", - "\t•\tInstruction: A command to specify what the model should do\n", - "\t•\tContext: Additional information or background required for performing the instruction\n", - "\t•\tCombined input: The instruction and context combine together into a single input sequence\n", - " \n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Let's review certain examples for various templates:\n", - "\n", - "---\n", - "#### Response template\n", - "Template: `### Question: {question}\\n ### Answer: {answer}`\n", - "\n", - "Example:\n", - "```\n", - "### Question: What is the capital of France?\n", - "### Answer: Paris\n", - "```\n", - "\n", - "---\n", - "#### Conversation template\n", - "\n", - "Template: `### User: {user_input}\\n ### Bot: {bot_response}`\n", - "Example:\n", - "```\n", - "### User: How are you today?\n", - "### Bot: I'm doing great, thank you! How can I assist you today?\n", - "```\n", - "\n", - "---\n", - "#### Instruction and output template\n", - "\n", - "Template: `### Instruction: {instruction}\\n ### Output: {output}`\n", - "\n", - "Example:\n", - "```\n", - "### Instruction: Translate the following sentence to Spanish: \"Hello, how are you?\"\n", - "### Output: \"Hola, ¿cómo estás?\"\n", - "```\n", - "\n", - "---\n", - "#### Completion template\n", - "\n", - "Template: `{prompt} ### Completion: {completion}`\n", - "Example:\n", - "```\n", - "Once upon a time in a faraway land, ### Completion: there lived a wise old owl who knew all the secrets of the forest.\n", - "```\n", - "\n", - "#### Summarization template\n", - "\n", - "Template: `### Text: {text}\\n ### Summary: {summary}`\n", - "\n", - "Example:\n", - "```\n", - "### Text: The quick brown fox jumps over the lazy dog.\n", - "### Summary: A fox jumps over a dog.\n", - "```\n", - "\n", - "---\n", - "#### Dialogue template\n", - "\n", - "Template: `### Speaker 1: {utterance_1}\\n ### Speaker 2: {utterance_2}\\n ### Speaker 1: {utterance_3}`\n", - "\n", - "Example:\n", - "```\n", - "### Speaker 1: Hi, what are you doing today?\n", - "### Speaker 2: I'm going to the park.\n", - "### Speaker 1: That sounds fun!\n", - "```\n", - "\n", - "---\n", - "#### Code generation template\n", - "\n", - "Template: `### Task: {task_description}\\n ### Code: {code_output}`\n", - "\n", - "Example:\n", - "```\n", - "### Task: Write a function to add two numbers in Python.\n", - "### Code: def add(a, b):\\n return a + b\n", - "```\n", - "\n", - "---\n", - "#### Data analysis template\n", - "\n", - "Template: `### Analysis Task: {task_description}\\n ### Analysis: {analysis_output}`\n", - "\n", - "Example:\n", - "```\n", - "### Analysis Task: Provide insights from the sales data of Q1 2022.\n", - "### Analysis: The sales increased by 15% compared to Q4 2021, with the highest growth in the electronics category.\n", - "```\n", - "\n", - "---\n", - "#### Recipe template\n", - "\n", - "Template: `### Recipe Name: {recipe_name}\\n ### Ingredients: {ingredients}\\n ### Instructions: {instructions}`\n", - "\n", - "Example:\n", - "```\n", - "### Recipe Name: Chocolate Chip Cookies\n", - "### Ingredients: Flour, Sugar, Chocolate Chips, Butter, Eggs, Vanilla Extract\n", - "### Instructions: Mix the dry ingredients, add the wet ingredients, fold in the chocolate chips, and bake at 350°F for 10-12 minutes.\n", - "```\n", - "\n", - "---\n", - "#### Explanation template\n", - "\n", - "Template: `### Concept: {concept}\\n ### Explanation: {explanation}`\n", - "\n", - "Example:\n", - "```\n", - "### Concept: Photosynthesis\n", - "### Explanation: Photosynthesis is the process by which green plants use sunlight to synthesize nutrients from carbon dioxide and water.\n", - "```\n", - "\n", - "---\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Objectives\n", - "\n", - "After completing this lab, you will be able to:\n", - "\n", - " - Understand the various types of templates including instruction-response, question-answering, summarization, code generation, dialogue, data analysis, and explanation and their applications for fine-tuning large language models (LLMs).\n", - " - Create and apply different templates to fine-tune LLMs for various tasks.\n", - " - Format datasets based on the created templates to prepare them for effective model training\n", - " - Perform instruction fine-tuning using Hugging Face libraries and tools\n", - " - Apply Low-Rank Adaptation (LoRA) techniques to fine-tune LLMs efficiently\n", - " - Configure and use the SFTTrainer for supervised fine-tuning of instruction-following models\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The concepts presented in this lab would apply to the other template formats as well.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# __Table of contents__\n", - "\n", - "
    \n", - "
  1. \n", - " Setup\n", - "
      \n", - "
    1. Install required libraries
    2. \n", - "
    3. Import required libraries
    4. \n", - "
    5. Define the device
    6. \n", - "
    \n", - "
  2. \n", - "
  3. Dataset description
  4. \n", - "
  5. Model and tokenizer
  6. \n", - "
  7. Preprocessing the data
  8. \n", - "
  9. Test the base model
  10. \n", - "
      \n", - "
    1. BLEU score
    2. \n", - "
    \n", - "
  11. Perform instruction fine-tuning with LoRA
  12. \n", - "
  13. Exercises
  14. \n", - "
\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Setup\n", - "\n", - "### Install required libraries\n", - "\n", - "For this lab, use the following libraries, which are __not__ preinstalled in the Skills Network Labs environment. You can install libraries by running the code in the below cell. \n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "!pip install -qq datasets==2.20.0 trl==0.9.6 transformers==4.42.3 peft==0.11.1 tqdm==4.66.4 numpy==1.26.4 pandas==2.2.2 matplotlib==3.9.1 seaborn==0.13.2 scikit-learn==1.5.1 sacrebleu==2.4.2 evaluate==0.4.2" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Import required libraries\n", - "\n", - "The following code imports the required libraries.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import warnings\n", - "warnings.filterwarnings('ignore')\n", - "warnings.simplefilter('ignore')\n", - "\n", - "from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline\n", - "from datasets import load_dataset\n", - "import torch\n", - "from torch.utils.data import Dataset\n", - "from tqdm import tqdm\n", - "import evaluate\n", - "from trl import SFTConfig, SFTTrainer, DataCollatorForCompletionOnlyLM\n", - "\n", - "from peft import get_peft_model, LoraConfig, TaskType\n", - "\n", - "import pickle\n", - "import json\n", - "import matplotlib.pyplot as plt \n", - "\n", - "from urllib.request import urlopen\n", - "import io" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Define the device\n", - "\n", - "The below code will set your device to 'cuda' if your device is compatible with GPU, otherwise, you can use 'cpu'.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Dataset description\n", - "\n", - "Use the below sentences to download the CodeAlpaca 20k dataset, a programming code dataset. This code is available [here](https://github.com/sahil280114/codealpaca?tab=readme-ov-file#data-release). The CodeAlpaca dataset contains the following elements:\n", - "\n", - "\n", - "- `instruction`: **str**, describes the task the model should perform. Each of the 20K instructions is unique.\n", - "- `input`: **str**, optional context or input for the task. For example, when the instruction is \"Amend the following SQL query to select distinct elements\", the input is the SQL query. Around 40% of the examples have an input.\n", - "- `output`: **str**, the answer to the instruction as generated by text-davinci-003.\n", - "\n", - "The following code block downloads the training split from the CodeAlpaca-20k dataset:\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "dataset = load_dataset(\"lucasmccabe-lmi/CodeAlpaca-20k\", split=\"train\")\n", - "dataset" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Let's look at the example in the dataset:\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "dataset[1000]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "To keep things simple let's just focus on the examples that do not have any `input`:\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "dataset = dataset.filter(lambda example: example[\"input\"] == '')" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The original CodeAlpaca dataset may not have been shuffled. The following line indicates how to shuffle a `datasets.arrow_dataset.Dataset()` object with a random seed:\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "dataset = dataset.shuffle(seed=42)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "dataset" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The CodeAlpaca 20k dataset has a training and test set. You can split the original training data into a train and test set by assigning 80% of the data to the training set and 20% to the testing set.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "dataset_split = dataset.train_test_split(test_size=0.2, seed=42)\n", - "train_dataset = dataset_split['train']\n", - "test_dataset = dataset_split['test']\n", - "dataset_split" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Select a small set of data for the resource limitation\n", - "# This dataset will be only used for evaluation parts, not for the training\n", - "tiny_test_dataset=test_dataset.select(range(10))\n", - "tiny_train_dataset=train_dataset.select(range(10))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Model and tokenizer\n", - "\n", - "In this exercise, let's fine-tune the [`opt-350m`](https://huggingface.co/facebook/opt-350m) model from Facebook. A description of this OpenSource model was published [here](https://arxiv.org/abs/2205.01068), and the model was originally made available on [metaseq's Github repository](https://github.com/facebookresearch/metaseq).\n", - "\n", - "The below lines load the base model from Hugging Face:\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Base model\n", - "model = AutoModelForCausalLM.from_pretrained(\"facebook/opt-350m\").to(device)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "This model comes with its own tokenizer which you will be loading here:\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "tokenizer = AutoTokenizer.from_pretrained(\"facebook/opt-350m\", padding_side='left')" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Let's find the end of sentence (EOS) token. This is a special tokenizer token. Once this token is encountered, the model will stop generating further tokens:\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "tokenizer.eos_token" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Preprocessing the data\n", - "\n", - "To perform the fine-tuning, first, preprocess the data by creating functions that generate the prompt.\n", - "\n", - "The `formatting_prompts_func` function takes a dataset as input. For every element in the dataset format, the instruction and the output into a template using the format:\n", - "\n", - "```\n", - "### Instruction:\n", - "Translate the following sentence to Spanish: \"Hello, how are you?\"\n", - "\n", - "### Response:\n", - "\"Hola, ¿cómo estás?\"\n", - "```\n", - "\n", - "_**Note:**_ \n", - "1. The template provided in this section may differ from the **Instruction and output template** presented in the introduction of this lab. You can replace the `### Response:` with `### Output:` to generate similar results.\n", - "\n", - "2. Introducing the `` end of sentence token at the end of the text informs the model to stop generating text beyond this point.\n", - "\n", - "Finally, the `formatting_prompts_func_no_response` function behaves similarly to the `formatting_prompts_func` except the response is not included.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "def formatting_prompts_func(mydataset):\n", - " output_texts = []\n", - " for i in range(len(mydataset['instruction'])):\n", - " text = (\n", - " f\"### Instruction:\\n{mydataset['instruction'][i]}\"\n", - " f\"\\n\\n### Response:\\n{mydataset['output'][i]}\"\n", - " )\n", - " output_texts.append(text)\n", - " return output_texts\n", - "\n", - "def formatting_prompts_func_no_response(mydataset):\n", - " output_texts = []\n", - " for i in range(len(mydataset['instruction'])):\n", - " text = (\n", - " f\"### Instruction:\\n{mydataset['instruction'][i]}\"\n", - " f\"\\n\\n### Response:\\n\"\n", - " )\n", - " output_texts.append(text)\n", - " return output_texts" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The following code block generates the `instructions` (the part of the prompt that does not include the response), the `instructions_with_responses` (the full prompt with the response and `eos` token), and the `expected_outputs`, which are the parts of the `instructions_with_responses` that are between the `instructions` and the `eos` token.\n", - "\n", - "To find the `expected_outputs`, tokenize `instructions` and the `instructions_with_responses`. Then, count the number of tokens in `instructions`, and discard the equivalent amount of tokens from the beginning of the tokenized `instructions_with_responses` vector. Finally, discard the final token in `instructions_with_responses`, corresponding to the `eos` token. Decode the resulting vector using the tokenizer, resulting in the `expected_output`:\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "expected_outputs = []\n", - "instructions_with_responses = formatting_prompts_func(test_dataset)\n", - "instructions = formatting_prompts_func_no_response(test_dataset)\n", - "for i in tqdm(range(len(instructions_with_responses))):\n", - " tokenized_instruction_with_response = tokenizer(instructions_with_responses[i], return_tensors=\"pt\", max_length=1024, truncation=True, padding=False)\n", - " tokenized_instruction = tokenizer(instructions[i], return_tensors=\"pt\")\n", - " expected_output = tokenizer.decode(tokenized_instruction_with_response['input_ids'][0][len(tokenized_instruction['input_ids'][0])-1:], skip_special_tokens=True)\n", - " expected_outputs.append(expected_output)\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Let's look at the example to view what `instructions` include, `instructions_with_responses`, and `expected_outputs`:\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "print('############## instructions ##############\\n' + instructions[0])\n", - "print('############## instructions_with_responses ##############\\n' + instructions_with_responses[0])\n", - "print('\\n############## expected_outputs ##############' + expected_outputs[0])" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Instead of keeping the instructions as-is, it's beneficial to convert the `instructions` list into a `torch` `Dataset`. The following code defines a class called `ListDataset` that inherits from `Dataset` and creates a `torch` `Dataset` from a list. This class is then used to generate a `Dataset` object from `instructions`: \n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "class ListDataset(Dataset):\n", - " def __init__(self, original_list):\n", - " self.original_list = original_list\n", - " \n", - " def __len__(self):\n", - " return len(self.original_list)\n", - " \n", - " def __getitem__(self, i):\n", - " return self.original_list[i]\n", - "\n", - "instructions_torch = ListDataset(instructions)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "instructions_torch[0]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Test the base model\n", - "\n", - "Let's understand how the base model performs without performing fine-tuning in the model. This may involve response generation from the base, that is from the non-fine-tuned mode. \n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The below code defines a text generation pipeline using the `pipeline` class from `transformers`. This pipeline is useful to generate text given by a model and a tokenizer:\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "gen_pipeline = pipeline(\"text-generation\",\n", - " model=model,\n", - " tokenizer=tokenizer,\n", - " device=device,\n", - " batch_size=2,\n", - " max_length=50,\n", - " truncation=True,\n", - " padding=False,\n", - " return_full_text=False)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "**_Note:_** The generation pipeline can generate tokens or text. If```return_tensors=True```, the pipeline returns token IDs; otherwise, it returns words. Additionally, the generation pipeline generates both the instructions *and* the responses by default. However, to assess the model's performance, exclude the generated instructions and focus on the responses. To do this, set ```return_full_text=False```.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The below code leverages the pre-defined generation pipeline to generate outputs using the model. \n", - "\n", - "**_Note:_** The code is commented out because it may take a long time for CPU. Instead of generating the raw tokens here, you can load output from this model later.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "tokenizer.padding_side = 'left'\n", - "\n", - "with torch.no_grad():\n", - " # Due to resource limitation, only apply the function on 3 records using \"instructions_torch[:10]\"\n", - " pipeline_iterator= gen_pipeline(instructions_torch[:3], \n", - " max_length=50, # this is set to 50 due to resource constraint, using a GPU, you can increase it to the length of your choice\n", - " num_beams=5,\n", - " early_stopping=True,)\n", - "\n", - "generated_outputs_base = []\n", - "for text in pipeline_iterator:\n", - " generated_outputs_base.append(text[0][\"generated_text\"])" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The below code loads the generated responses for the whole dataset using machine that has access to a fast CUDA-enabled GPU:\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "urlopened = urlopen('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/VvQRrSqS1P0_GobqtL-SKA/instruction-tuning-generated-outputs-base.pkl')\n", - "generated_outputs_base = pickle.load(io.BytesIO(urlopened.read()))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Let's look at the sample responses generated by the base model and the expected responses from the dataset.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "for i in range(3):\n", - " print('@@@@@@@@@@@@@@@@@@@@')\n", - " print('@@@@@ Instruction '+ str(i+1) +': ')\n", - " print(instructions[i])\n", - " print('\\n\\n')\n", - " print('@@@@@ Expected response '+ str(i+1) +': ')\n", - " print(expected_outputs[i])\n", - " print('\\n\\n')\n", - " print('@@@@@ Generated response '+ str(i+1) +': ')\n", - " print(generated_outputs_base[i])\n", - " print('\\n\\n')\n", - " print('@@@@@@@@@@@@@@@@@@@@')\n", - " " - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "You can see that the responses generated by the base model are not up to the mark. Also, the responses have the tendency to extend and repeat the answers until they generate the maximum number of tokens. Later on, you can see that the instruction-tuning can fix both of these issues. First, the instruction fine-tuned model will be able to provide more meaningful responses. Second, because, you appended the `eos` token `<\\s>` to the output, you will teach the model via instruction fine-tuning to not generate responses without bound.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## BLEU score\n", - "\n", - "Let's set up a metric that compares the generated responses and the expected responses in the test environment. In this lab, let's use the [BLEU score](https://en.wikipedia.org/wiki/BLEU), a metric originally intended to check the quality of translations made by translation models. You can calculate the BLEU scores for individual generated segments by comparing them with a set of expected outputs and average the scores for the individual segments. Depending on the implementation, BLEU scores range from 0 to 1 or from 0 to 100 (as in the implementation used herein), with higher scores indicating a better match between the model generated output and the expected output.\n", - "\n", - "_**Note:**_ \n", - "1. The BLEU score was originally implemented for assessing the quality of translations. However, it may not necessarily be the best metric for instruction fine-tuning in general, but it is nonetheless a useful metric that gives a sense of the alignment between the model generated output and the expected output.\n", - "2. BLEU scores are very challenging to compare from one study to the next because it is a parametrized metric. As a result, you can employ a variant of BLEU called [SacreBLEU](https://aclanthology.org/W18-6319/) invariant to the metric's parametrization.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "sacrebleu = evaluate.load(\"sacrebleu\")\n", - "results_base = sacrebleu.compute(predictions=generated_outputs_base,\n", - " references=expected_outputs)\n", - "\n", - "print(list(results_base.keys()))\n", - "print(round(results_base[\"score\"], 1))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The SacreBLEU score of 0.4/100 indicates that there is very little alignment between the base model's generated responses and the expected responses for the examples in the test dataset.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "---\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Perform instruction fine-tuning with LoRA\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "To save time, let's perform instruction fine-tuning using a parameter-efficient fine-tuning (PEFT) method called low-rank adaptation (LoRA).\n", - "First, convert the model into a PEFT model suitable for LoRA fine-tuning by defining a `LoraConfig` object from the `peft` library that outlines LoRA parameters, such as the LoRA rank and the target modules. Next, apply LoRA configuration on the model using `get_peft_model()`, which effectively converts `model` into a LoRA `model`.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "lora_config = LoraConfig(\n", - " r=16, # Low-rank dimension\n", - " lora_alpha=32, # Scaling factor\n", - " target_modules=[\"q_proj\", \"v_proj\"], # Modules to apply LoRA\n", - " lora_dropout=0.1, # Dropout rate\n", - " task_type=TaskType.CAUSAL_LM # Task type should be causal language model\n", - ")\n", - "\n", - "model = get_peft_model(model, lora_config)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Instruction fine-tuning using the `SFTTrainer` has the effect of generating the instructions *and* the responses. However, for the purposes of assessing the quality of the generated text, consider only the quality of the response and not the quality of the instruction. For the purposes of calculating the BLEU score, eliminate the length of tokens corresponding to the instruction from the beginning of the tokenized model output. \n", - "\n", - "For example, suppose the tokenized instruction had a length of ten, but the generated text had a length of fourteen. Then the tokenized response that was kept for the purposes of calculating the BLEU score was just the four tokens at the end of the tokenized generated text because the first ten tokens represent the model's generation of the tokenized instruction.\n", - "\n", - "Although eliminating the first few tokens of the tokenized output worked well for the purposes of calculating BLEU. However, during fine-tuning, the first few tokens won't have an impact on the loss function. You can mask those tokens using -100 by ignoring the value of PyTorch loss functions such as [CrossEntropyLoss](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html). By masking the tokens corresponding to the instruction with -100, only the tokens associated with the response can bear the loss.\n", - "\n", - "You can create such a masking manually by defining your own function. However, it is easier to instead use the `DataCollatorForCompletionOnlyLM` class from `trl`:\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "response_template = \"### Response:\\n\"\n", - "collator = DataCollatorForCompletionOnlyLM(response_template, tokenizer=tokenizer)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now, pass the `collator`, `DataCollatorForCompletionOnlyLM` object to the data collator into `SFTTrainer`, resulting in the generated instructions without bearing on the loss.\n", - "\n", - "To perform the training, first configure our `SFTTrainer`, and create the `SFTTrainer` object by passing to the `collator`:\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "training_args = SFTConfig(\n", - " output_dir=\"/tmp\",\n", - " num_train_epochs=10,\n", - " save_strategy=\"epoch\",\n", - " fp16=True,\n", - " per_device_train_batch_size=2, # Reduce batch size\n", - " per_device_eval_batch_size=2, # Reduce batch size\n", - " max_seq_length=1024,\n", - " do_eval=True\n", - ")\n", - "\n", - "trainer = SFTTrainer(\n", - " model,\n", - " train_dataset=train_dataset,\n", - " eval_dataset=test_dataset,\n", - " formatting_func=formatting_prompts_func,\n", - " args=training_args,\n", - " packing=False,\n", - " data_collator=collator,\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Please ignore the above warning.\n", - "The below comments, runs the trainer, because this would take a long time on the CPU. Therefore, let's not run the trainer here.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "#trainer.train()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "If you want to train the trainer, the `trainer` object would have a state history for every training step. You would be able to access this state history using the below commented out line:\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "#log_history_lora = trainer.state.log_history" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Instead of extracting the state history above, let's load the state history of a model that was instruction fine-tuned to the above specifications on a GPU.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "urlopened = urlopen('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/49I70jQD0-RNRg2v-eOoxg/instruction-tuning-log-history-lora.json')\n", - "log_history_lora = json.load(io.BytesIO(urlopened.read()))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "You can plot the training loss for each training step.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "train_loss = [log[\"loss\"] for log in log_history_lora if \"loss\" in log]\n", - "\n", - "# Plot the training loss\n", - "plt.figure(figsize=(10, 5))\n", - "plt.plot(train_loss, label='Training Loss')\n", - "\n", - "plt.xlabel('Steps')\n", - "plt.ylabel('Loss')\n", - "plt.title('Training Loss')\n", - "plt.legend()\n", - "plt.show()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "If you want to fine-tune the model, the fine-tuned model could be saved using the below commented out line:\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "#trainer.save_model(\"./instruction_tuning_final_model_lora\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Let's redefine the text generation pipeline because the model has been changed to the LoRA model. Ignore the warning for the `PeftModelForCausalLM` not being supported for `text-generation`. However, if the PEFT model is supported, the warning is erroneous.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "gen_pipeline = pipeline(\"text-generation\", \n", - " model=model, \n", - " tokenizer=tokenizer, \n", - " device=device, \n", - " batch_size=2, \n", - " max_length=50, \n", - " truncation=True, \n", - " padding=False,\n", - " return_full_text=False)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The below code generates tokens with the pipeline using the instruction fine-tuned model. Only three records of data are used for demonstration because generating text is time consuming on CPU:\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "with torch.no_grad():\n", - " # Due to resource limitation, only apply the function on 3 records using \"instructions_torch[:10]\"\n", - " pipeline_iterator= gen_pipeline(instructions_torch[:3],\n", - " max_length=50, # this is set to 50 due to resource constraint, using a GPU, you can increase it to the length of your choice\n", - " num_beams=5,\n", - " early_stopping=True,)\n", - "generated_outputs_lora = []\n", - "for text in pipeline_iterator:\n", - " generated_outputs_lora.append(text[0][\"generated_text\"])" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "generated_outputs_lora[0]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "You can load the generated texts for the entire dataset from the fine-tuned LoRA model and run on GPU.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "urlopened = urlopen('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/o7uYxe15xvX4CN-6Lr10iA/instruction-tuning-generated-outputs-lora.pkl')\n", - "generated_outputs_lora = pickle.load(io.BytesIO(urlopened.read()))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Let's have a look at some of the responses from the instruction fine-tuned model and the expected responses.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "for i in range(3):\n", - " print('@@@@@@@@@@@@@@@@@@@@')\n", - " print('@@@@@ Instruction '+ str(i+1) +': ')\n", - " print(instructions[i])\n", - " print('\\n\\n')\n", - " print('@@@@@ Expected response '+ str(i+1) +': ')\n", - " print(expected_outputs[i])\n", - " print('\\n\\n')\n", - " print('@@@@@ Generated response '+ str(i+1) +': ')\n", - " print(generated_outputs_lora[i])\n", - " print('\\n\\n')\n", - " print('@@@@@@@@@@@@@@@@@@@@')\n", - " " - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Compared to the base model, you can see that the responses are much better. Additionally, the responses don't extend until the maximum number of tokens are generated.\n", - "\n", - "To confirm the responses generated by the instruction fine-tuned model align better with the expected output, let's calculate the SacreBLEU score:\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "sacrebleu = evaluate.load(\"sacrebleu\")\n", - "results_lora = sacrebleu.compute(predictions=generated_outputs_lora,\n", - " references=expected_outputs)\n", - "print(list(results_lora.keys()))\n", - "print(round(results_lora[\"score\"], 1))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "You can see that the fine-tuned model achieves a SacreBLEU score of 14.7/100, significantly better than the 0.4/100 achieved by the base model. \n", - "\n", - "Let's conclude. The instruction fine-tuned model generates responses that align much better with the expected responses in the dataset.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "---\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Exercises\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Exercise 1: Try with another response template (Question-Answering)\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Create a `formatting_prompts_response_template` function to format the train_dataset in the Response Template. \n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Template: `### Question: {question}\\n ### Answer: {answer}`\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "#write your code here" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "
\n", - " Click here for the solution\n", - "\n", - "```python\n", - "def formatting_prompts_response_template(mydataset):\n", - " output_texts = []\n", - " for i in range(len(mydataset['instruction'])):\n", - " text = (\n", - " f\"### Question:\\n{mydataset['instruction'][i]}\"\n", - " f\"\\n\\n### Answer:\\n{mydataset['output'][i]}\"\n", - " )\n", - " output_texts.append(text)\n", - " return output_texts\n", - "```\n", - "\n", - "
\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Create a `formatting_prompts_response_template_no_response` function to format the `test_dataset` in the Response Template, excluding the response.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Template: `### Question: {question}\\n ### Answer: `\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "#write your code here" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "
\n", - " Click here for the solution\n", - "\n", - "```python\n", - "def formatting_prompts_response_template_no_response(mydataset):\n", - " output_texts = []\n", - " for i in range(len(mydataset['instruction'])):\n", - " text = (\n", - " f\"### Question:\\n{mydataset['instruction'][i]}\"\n", - " f\"\\n\\n### Answer:\\n\"\n", - " )\n", - " output_texts.append(text)\n", - " return output_texts\n", - "```\n", - "\n", - "
\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Exercise 2: Try with another LLM (EleutherAI/gpt-neo-125m)\n", - "\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The EleutherAI/gpt-neo-125m is a smaller variant of the GPT-Neo family of models developed by EleutherAI. With 125 million parameters, it is designed to be computationally efficient while still providing robust performance for various natural language processing tasks.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Download and load the `EleutherAI/gpt-neo-125m` model\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "#write your code here" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "
\n", - " Click here for the solution\n", - "\n", - "```python\n", - "model_name = \"EleutherAI/gpt-neo-125m\"\n", - "\n", - "# Load the model and tokenizer\n", - "model = AutoModelForCausalLM.from_pretrained(model_name)\n", - "tokenizer = AutoTokenizer.from_pretrained(model_name)\n", - "```\n", - "\n", - "
\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Initialize LoRA Configuration:\n", - "\n", - "- r: 8 (Low-rank dimension)\n", - "- lora_alpha: 16 (Scaling factor)\n", - "- target_modules: [\"q_proj\", \"v_proj\"] (Modules to apply LoRA)\n", - "- lora_dropout: 0.1 (Dropout rate)\n", - "- task_type: TaskType.CAUSAL_LM (Task type should be causal language model)\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "#write your code here" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "
\n", - " Click here for the solution\n", - "\n", - "```python\n", - "\n", - "lora_config = LoraConfig(\n", - " r=8, # Low-rank dimension\n", - " lora_alpha=16, # Scaling factor\n", - " target_modules=[\"q_proj\", \"v_proj\"], # Modules to apply LoRA\n", - " lora_dropout=0.1, # Dropout rate\n", - " task_type=TaskType.CAUSAL_LM # Task type should be causal language model\n", - ")\n", - "```\n", - "\n", - "
\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Apply LoRA Configuration to the model.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "#write your code here" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "
\n", - " Click here for the solution\n", - "\n", - "```python\n", - "model = get_peft_model(model, lora_config)\n", - "```\n", - "\n", - "
\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Congratulations! You have completed the lab\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Authors\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "[Wojciech \"Victor\" Fulmyk](https://www.linkedin.com/in/wfulmyk) is a Data Scientist and a PhD Candidate in Economics at the University of Calgary.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "[Fateme Akbari](https://www.linkedin.com/in/fatemeakbari/) is a Ph.D. candidate in Information Systems at McMaster University with demonstrated research experience in Machine Learning and NLP.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "[Joseph Santarcangelo](https://author.skills.network/instructors/joseph_santarcangelo) has a Ph.D. in Electrical Engineering, his research focused on using machine learning, signal processing, and computer vision to determine how videos impact human cognition. Joseph has been working for IBM since he completed his PhD.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## References\n", - "\n", - "[Supervised Fine-tuning Trainer](https://huggingface.co/docs/trl/main/en/sft_trainer)\n", - "\n", - "[Finetuning To Follow Instructions](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch07/01_main-chapter-code/ch07.ipynb)\n", - "\n", - "[Finetuning with LoRA -- A Hands-On Example](https://lightning.ai/lightning-ai/studios/code-lora-from-scratch)\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "```{## Change Log|Date (YYYY-MM-DD)|Version|Changed By|Change Description||-|-|-|-||2024-07-18|1.0|Wojciech \"Victor\" Fulmyk|Lab Written||2024-07-25|2.0|Fateme Akbari|Bugs Fixed||2024-07-31|3.0|Bhavika Chhatbar|ID reviewed|}\n", - "\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "© Copyright IBM Corporation. All rights reserved.\n" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.8.19" - }, - "prev_pub_hash": "280a2cf79e2287085899526a711a657e3abe91f52fd641be6356c1ef9f2bafbd" - }, - "nbformat": 4, - "nbformat_minor": 4 -} diff --git a/notebooks/LLM_Specialization/Generative AI Advance Fine-Tuning for LLMs/Instruction-Tuning and Reward Modeling/RewardTrainer-v1.ipynb b/notebooks/LLM_Specialization/Generative AI Advance Fine-Tuning for LLMs/Instruction-Tuning and Reward Modeling/RewardTrainer-v1.ipynb deleted file mode 100755 index 73a5bb1..0000000 --- a/notebooks/LLM_Specialization/Generative AI Advance Fine-Tuning for LLMs/Instruction-Tuning and Reward Modeling/RewardTrainer-v1.ipynb +++ /dev/null @@ -1,1111 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "

\n", - " \n", - " \"Skills\n", - " \n", - "

\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Reward Modeling\n", - "Imagine that you're working as a machine learning engineer for a large technology company that wants to integrate advanced language models into its suite of AI-powered products. Your task is to evaluate and select the best large language model (LLM) that can understand and follow complex instructions, improve the quality of automated customer service, and generate high-quality responses.\n", - "\n", - "However, simply choosing a powerful LLM isn't enough. To truly excel in these tasks, the model should be fine-tuned to align with specific goals and criteria. This is where reward models come into the picture. By training a reward model, you can guide the LLM to prioritize certain behaviors, ensuring it generates responses that not only meet technical standards but also align with the company's values and objectives.\n", - "\n", - "In this hands-on lab, you dive into the process of creating and training a reward model by using the transformer reinforcement learner (trl) library from Hugging Face. You learn how to set up the environment, define the rewards that shape the model's behavior, and fine-tune the LLM to perform with precision. This project equips you with the skills to implement reward models in real-world applications, enhancing the effectiveness and quality of AI-powered products.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## __Table of Contents__\n", - "\n", - "
    \n", - "
  1. Objectives
  2. \n", - " \n", - "
  3. Setup
  4. \n", - " \n", - "
  5. Data set
  6. \n", - " \n", - "
  7. Model and tokenizer setup
  8. \n", - "
  9. Preprocessing
  10. \n", - " \n", - "
  11. Evaluating the model
  12. \n", - "
  13. Exercise
  14. \n", - "
\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Objectives\n", - "After completing this lab, you are able to:\n", - "\n", - "- Understand the concept of reward modeling in machine learning\n", - "- Explore and preprocess a data set for reward modeling tasks\n", - "- Set up and configure a GPT-2 model for sequence classification\n", - "- Tokenize and prepare text data for model training\n", - "- Evaluate model performance using pairwise comparison of responses\n", - "- Apply preprocessing and evaluation techniques to different subsets of data\n", - "- Understand important concepts related to transformers and reward modeling\n", - "- Implement special tokens in the tokenizer and configure the model accordingly\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Setup\n", - "## Installing required libraries\n", - "Before you start, make sure that you have all of the necessary libraries installed. You can run the following commands to install them:\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "!pip install datasets\n", - "!pip install trl\n", - "!pip install datasets huggingface_hub\n", - "!pip install transformers\n", - "!pip install peft\n", - "!pip install nltk rouge_score\n", - "!pip install --upgrade transformers\n", - "!pip install --upgrade peft\n", - "!pip install bitsandbytes==0.43.1\n", - "!pip install matplotlib" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Importing required libraries\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import json\n", - "from datasets import load_dataset, DatasetDict\n", - "import torch\n", - "from transformers import GPT2Tokenizer, GPT2ForSequenceClassification, TrainingArguments\n", - "from peft import LoraConfig, TaskType\n", - "from transformers import TrainingArguments\n", - "from trl import RewardTrainer\n", - "import matplotlib.pyplot as plt\n", - "import warnings\n", - "\n", - "warnings.filterwarnings('ignore')\n", - "# Disable warnings for a cleaner notebook or console experience\n", - "def warn(*args, **kwargs):\n", - " pass\n", - "warnings.warn = warn" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Defining helper functions\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "def save_to_json(data, file_path):\n", - " \"\"\"\n", - " Save a dictionary to a JSON file.\n", - "\n", - " Args:\n", - " data (dict): The dictionary to save.\n", - " file_path (str): The path to the JSON file.\n", - " \"\"\"\n", - " with open(file_path, 'w') as json_file:\n", - " json.dump(data, json_file, indent=4)\n", - " print(f\"Data successfully saved to {file_path}\")\n", - " \n", - " \n", - "def load_from_json(file_path):\n", - " \"\"\"\n", - " Load data from a JSON file.\n", - "\n", - " Args:\n", - " file_path (str): The path to the JSON file.\n", - "\n", - " Returns:\n", - " dict: The data loaded from the JSON file.\n", - " \"\"\"\n", - " with open(file_path, 'r') as json_file:\n", - " data = json.load(json_file)\n", - " return data " - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "## Data set\n", - "\n", - "In this section, you load a data set that is used for training the reward model. In this lab, you use the Dahoas/synthetic-instruct-gptj-pairwise data set from Hugging Face, a synthetic data set that is designed for training and evaluating instruction-following models. This data set includes pairs of prompts and responses, where one response is preferred over the other. The primary use case is to train models to distinguish between better and worse responses, essential for tasks like reinforcement learning with human feedback (RLHF).\n", - "\n", - "### Purpose\n", - "\n", - "This data set helps train models for better understanding and following instructions by learning from pairs of good and bad responses. This is particularly useful for improving the quality of generated responses in dialogue systems and other AI applications that require understanding and generating natural language instructions.\n", - "\n", - "### Applications\n", - "- **Reinforcement learning**: Enhancing models to prefer better responses based on feedback\n", - "- **Fine-tuning language models**: Improving the performance of models on instruction-following tasks\n", - "- **Evaluation**: Assessing the ability of models to distinguish between high- and low-quality responses\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Load the Dahoas/synthetic-instruct-gptj-pairwise dataset \n", - "dataset = load_dataset(\"Dahoas/synthetic-instruct-gptj-pairwise\")\n", - "# Display the dataset\n", - "print(dataset)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Data set features\n", - "\n", - "To get a better understanding of the data set, let's inspect a few samples. First, print out the `prompt`, `chosen`, and `rejected` responses for the first 10 examples in the training set. This gives an insight into the type of data on which you are working and how it is structured.\n", - "\n", - "```Prompt:``` A text prompt that the model should respond to\n", - "\n", - "```Chosen:``` The preferred response to the prompt\n", - "\n", - "```Rejected:``` The less preferred response to the prompt\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "for i in range(10): \n", - " print('prompt')\n", - " print(dataset[\"train\"][i]['prompt'],'\\n')\n", - " \n", - " print('chosen')\n", - " print(dataset[ 'train'][i]['chosen'],'\\n')\n", - "\n", - " print('rejected')\n", - " print(dataset[ 'train'][i]['rejected'],'\\n')\n", - " print('---------------------------\\n')" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Model and tokenizer setup\n", - "In this section, you set up the tokenizer and the model for training. You can use the GPT-2 model for sequence classification, which helps in determining the quality of responses.\n", - "\n", - "Next, specify the model name or path as \"gpt2\". To initialize the tokenizer and model, use `GPT2Tokenizer.from_pretrained` and `GPT2ForSequenceClassification.from_pretrained`, respectively, with `num_labels` set to 1 for ranking (a numerical score value). To handle padding, set the `pad_token` of the tokenizer to be the same as the `eos_token` (end-of-sequence token). Similarly, configure the model to use the `eos_token_id` as the `pad_token_id`. This setup ensures that the tokenizer and model are correctly initialized and prepared for sequence classification tasks with GPT-2.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Define the model name or path\n", - "model_name_or_path = \"gpt2\"\n", - "\n", - "# Initialize tokenizer and model\n", - "tokenizer = GPT2Tokenizer.from_pretrained(model_name_or_path, use_fast=True)\n", - "model = GPT2ForSequenceClassification.from_pretrained(model_name_or_path, num_labels=1)\n", - "\n", - "# Add special tokens if necessary\n", - "tokenizer.pad_token = tokenizer.eos_token\n", - "model.config.pad_token_id = model.config.eos_token_id\n", - "\n", - "# Define the maximum length\n", - "max_length = 1024" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Next, preprocess the data set for training. Then combine the prompt with the chosen and rejected responses into a format suitable for input into the model. This process helps create clear input-output pairs for the model to learn from.\n", - "\n", - "`Lambda Function`: Define a lambda function `get_res` that takes the data set and a response type (chosen or rejected) and combines the prompt with the respective response. Each entry is formatted as a dialogue between \"Human\" and \"Assistant\".\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "get_res=lambda dataset,res:[ \"\\n\\nHuman: \"+prompt + \"\\n\\nAssistant: \"+resp for prompt, resp in zip(dataset[\"train\"][\"prompt\"], dataset[\"train\"][res])]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "`Chosen Samples`: Apply the `get_res` function to create a list of chosen samples.\n", - "\n", - "`Rejected Samples`: Similarly, create a list of rejected samples using the same function.\n", - "\n", - "After applying the function, you get the following results.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "chosen_samples=get_res( dataset,'chosen')\n", - "rejected_samples=get_res( dataset,'rejected')\n", - "print('chosen',chosen_samples[0])\n", - "print('rejected',rejected_samples[0])" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "To facilitate the training process, create new columns in the data set that combine the prompt with chosen and rejected responses. This combination helps in evaluating the responses in a structured dialogue format.\n", - "\n", - "**Function definition**: Define a function `add_combined_columns` that takes an example (a single data point) and adds two new columns:\n", - "- `prompt_chosen`: Combines the `prompt` with the `chosen` response in the same labeled format.\n", - "- `prompt_rejected`: Combines the `prompt` with the `rejected` response in the same labeled format.\n", - "\n", - "**Apply function**: The `map` method is used to apply this function to each example in the training split of the data set. This method iterates over all the examples and modifies them in place.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Define a function to combine 'prompt' with 'chosen' and 'rejected' responses\n", - "def add_combined_columns(example):\n", - " # Combine 'prompt' with 'chosen' response, formatting it with \"Human:\" and \"Assistant:\" labels\n", - " example['prompt_chosen'] = \"\\n\\nHuman: \" + example[\"prompt\"] + \"\\n\\nAssistant: \" + example[\"chosen\"]\n", - " \n", - " # Combine 'prompt' with 'rejected' response, formatting it with \"Human:\" and \"Assistant:\" labels\n", - " example['prompt_rejected'] = \"\\n\\nHuman: \" + example[\"prompt\"] + \"\\n\\nAssistant: \" + example[\"rejected\"]\n", - " \n", - " # Return the modified example\n", - " return example\n", - "\n", - "# Apply the function to each example in the 'train' split of the dataset\n", - "dataset['train'] = dataset['train'].map(add_combined_columns)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "When using pretrained transformers for classification tasks, understanding the maximum sequence length supported by the model is crucial, as pretrained transformers have a fixed maximum token length, for example, GPT-2 has 1024 tokens. Inputs longer than this are truncated, potentially losing important information. So a function is written to determine the max length.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "get_max_len= lambda samples: max([len(sample) for sample in samples])\n", - "get_max_len" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "print(\"rejected samples length\",get_max_len(rejected_samples))\n", - "print(\"chosen samples length\",get_max_len(chosen_samples))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Sometimes, you might want to identify samples shorter than a specified maximum length. This can be useful for filtering or handling special cases during preprocessing.\n", - "\n", - "The lambda function `find_short` takes a data set and a maximum length (`max_length`) as input. It uses a list comprehension to iterate over each example in the data set, enumerating both the index and the (chosen, rejected) pair. It zips `prompt_chosen` and `prompt_rejected` to pair each chosen response with its corresponding rejected response. For each pair, it checks if the length of either `chosen` or `rejected` is less than the specified `max_length`. If the condition is met, the index of that pair is included in the resulting list. The resulting list contains the index of all examples where either `prompt_chosen` or `prompt_rejected` is shorter than the specified `max_length`.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "find_short = lambda dataset, max_length: [\n", - " i for i, (chosen, rejected) in enumerate(zip(dataset['prompt_chosen'], dataset['prompt_rejected']))\n", - " if len(chosen) < max_length or len(rejected) < max_length\n", - "]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "To ensure that your data set only includes samples that meet the required length criteria, filter out any samples that are shorter than the specified `max_length`. This step is important for maintaining consistency in the input data for the model.\n", - "\n", - "Now, use the GPT-2 model for classification with a max length of 1024. First, set the maximum length (`max_length`) to 1024. The `find_short` function is then called with the training data set (`dataset['train']`) and this maximum length as arguments to find indices of examples where either `prompt_chosen` or `prompt_rejected` is shorter than the specified `max_length`. The resulting index (`subset_indices`) is used to create a subset of the training data set by selecting only the examples at these indices. The training data set (`dataset['train']`) is updated to this subset, and the `subset_indices` are returned or printed.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "max_length=1024\n", - "subset_indices=find_short (dataset['train'], max_length)\n", - "dataset['train'] = dataset['train'].select(subset_indices)\n", - "subset_indices[0:10]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - " The ```preprocess_function``` tokenizes the ```prompt_chosen``` and ```prompt_rejected``` keys, which are crucial for the RewardTrainer. The ```chosen``` key represents the preferred responses, while the ```rejected``` key represents the less preferred responses.\n", - " Tokenizing these keys allows the model to process and understand the differences between high-quality and low-quality responses. By providing both ```chosen``` and ```rejected``` inputs, the RewardTrainer can learn to distinguish and prioritize better responses, which is essential for training models to follow instructions effectively.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Define a preprocessing function to tokenize the 'prompt_chosen' and 'prompt_rejected' keys\n", - "def preprocess_function(examples):\n", - " # Tokenize the 'prompt_chosen' text with truncation and padding to the maximum length\n", - " tokenized_chosen = tokenizer(examples['prompt_chosen'], truncation=True, max_length=max_length, padding=\"max_length\")\n", - " \n", - " # Tokenize the 'prompt_rejected' text with truncation and padding to the maximum length\n", - " tokenized_rejected = tokenizer(examples['prompt_rejected'], truncation=True, max_length=max_length, padding=\"max_length\")\n", - " \n", - " # Return the tokenized inputs as a dictionary\n", - " return {\n", - " \"input_ids_chosen\": tokenized_chosen[\"input_ids\"], # Token IDs for 'chosen' responses\n", - " \"attention_mask_chosen\": tokenized_chosen[\"attention_mask\"], # Attention masks for 'chosen' responses\n", - " \"input_ids_rejected\": tokenized_rejected[\"input_ids\"], # Token IDs for 'rejected' responses\n", - " \"attention_mask_rejected\": tokenized_rejected[\"attention_mask\"], # Attention masks for 'rejected' responses\n", - " }" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "The `input_ids_chosen` and `input_ids_rejected` fields contain the token IDs for the `chosen` and `rejected` responses, respectively, which are the numerical representations of the text used by the model. The `attention_mask_chosen` and `attention_mask_rejected` fields contain the attention masks for the `chosen` and `rejected` responses, respectively, which indicates tokens that should be attended to (1) and should ignore (0). These fields are crucial for the `RewardTrainer` because they provide the necessary tokenized inputs and attention masks for both the preferred and less preferred responses. By comparing the token IDs and attention patterns of the `chosen` and `rejected` responses, the `RewardTrainer` can distinguish between high- and low-quality responses, thereby improving the model's ability to prioritize better responses in instruction-following tasks.\n", - "\n", - "You can apply the ```reprocess_function``` to one sample:\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "example=preprocess_function(dataset['train'][0])\n", - "example.keys()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now, create a dictionary with 'chosen' and 'rejected' samples from the training data set. This dictionary is created to make it easier to validate the model later.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "train_str={'chosen': [sample for sample in dataset['train'] ['prompt_chosen']], 'rejected':[sample for sample in dataset['train'] ['prompt_rejected']]}" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The code applies the preprocess_function to each example in the training data set using the map method, which tokenizes the ```prompt_chosen``` and ```prompt_rejected``` texts. The `batched = True` parameter allows the function to process multiple examples at once, improving efficiency. Additionally, the `remove_columns` parameter specifies a list of columns (```prompt```, ```chosen```, ```rejected```, ```prompt_chosen```, ```prompt_rejected```) to be removed from the data set after processing. This ensures that only the tokenized inputs and attention masks generated by `preprocess_function` are retained, simplifying the data set structure and making it more suitable for model training and validation.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "dataset['train'] = dataset['train'].map(preprocess_function, batched=True, remove_columns=['prompt',\"chosen\", \"rejected\",'prompt_chosen', 'prompt_rejected'])" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The only columns left are the tokens and masks indexes.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "dataset.column_names" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Finally, split the data set into training and testing data sets.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "split_dataset = dataset['train'].train_test_split(test_size=0.2)\n", - "\n", - "# Create a DatasetDict to hold train and test splits\n", - "dataset_dict = DatasetDict({\n", - " 'train': split_dataset['train'],\n", - " 'test': split_dataset['test'],\n", - "})" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "## LoRA configuration\n", - "Now that the training data set is ready, you use the pretrain transformer model to start training. However, it is advisable to use a more efficient LoRA configuration for the model. Now, define the LoRA configuration and training arguments.\n", - "\n", - "First, initialize a `LoraConfig` configuration for low-rank adaptation (LoRA) in a sequence classification task. The configuration is created by using the `LoraConfig` class from the `peft` library and specifies several parameters:\n", - "\n", - "- **task_type=TaskType.SEQ_CLS**: Specifies the type of task, that is, the sequence classification for this lab.\n", - "- **inference_mode=False**: Indicates that the configuration is for training mode rather than inference.\n", - "- **r=8**: Sets the rank of the LoRA matrices.\n", - "- **lora_alpha=32**: Sets the alpha value for scaling the LoRA matrices.\n", - "- **lora_dropout=0.1**: Specifies the dropout rate for the LoRA layers, helping to prevent overfitting.\n", - "- **target_modules=[\"attn.c_attn\", \"attn.c_proj\"]**: Lists the specific attention layers in the model that will be adapted using LoRA. This includes the \"attn.c_attn\" and \"attn.c_proj\" modules.\n", - "\n", - "This configuration is useful for applying LoRA to the specific parts of the model, enabling efficient fine-tuning by adapting only a subset of the model parameters.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "peft_config = LoraConfig(\n", - " task_type=TaskType.SEQ_CLS,\n", - " inference_mode=False,\n", - " r=8,\n", - " lora_alpha=32,\n", - " lora_dropout=0.1,\n", - " target_modules=[\"attn.c_attn\", \"attn.c_proj\"] # Target attention layers\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Training arguments\n", - "\n", - "Define the training arguments by using the `TrainingArguments` class from the `transformers` library. These arguments configure various aspects of the training process:\n", - "\n", - "- **per_device_train_batch_size=3**: Sets the batch size per device (GPU/CPU) to 3\n", - "- **num_train_epochs=3**: Specifies the number of training epochs and is set to 3.\n", - "- **gradient_accumulation_steps=8**: Accumulates gradients over 8 steps before performing a backward/update pass, effectively increasing the batch size\n", - "- **learning_rate=1.41e-5**: Sets the learning rate for the optimizer to 1.41e-5\n", - "- **output_dir=\"./model_output3\"**: Specifies the directory where the model checkpoints and other outputs are saved\n", - "- **logging_steps=10**: Logs training progress every 10 steps\n", - "- **evaluation_strategy=\"steps\"**: Sets the evaluation strategy to evaluate the model at regular steps\n", - "- **eval_steps=500**: Evaluates the model every 500 steps\n", - "- **save_steps=500**: Saves the model checkpoint every 500 steps\n", - "- **save_total_limit=2**: Limits the number of saved checkpoints to 2, deleting older checkpoints to save space\n", - "\n", - "These arguments configure the training loop, including batch size, learning rate, logging, evaluation, and checkpoint-saving strategies.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Define training arguments\n", - "\n", - "training_args = TrainingArguments(\n", - " per_device_train_batch_size=3, # Set to 3\n", - " num_train_epochs=3, # Set to 3\n", - " gradient_accumulation_steps=8,\n", - " learning_rate=1.41e-5,\n", - " output_dir=\"./model_output3\",\n", - " logging_steps=10,\n", - " eval_strategy=\"steps\",\n", - " eval_steps=500,\n", - " save_steps=500,\n", - " save_total_limit=2,\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### RewardTrainer\n", - "\n", - "The `RewardTrainer` is a specialized trainer that is designed to train models with a reward signal. This is often used in reinforcement learning scenarios where the model learns to optimize for better responses. It is initialized with several parameters:\n", - "\n", - "- **model**: The model to be trained\n", - "- **args**: The training arguments. Typically, an instance of `TrainingArguments`\n", - "- **tokenizer**: The tokenizer used to process the text inputs\n", - "- **train_dataset**: The training data set\n", - "- **eval_dataset**: The evaluation data set\n", - "- **peft_config**: The configuration for LoRA\n", - "\n", - "The `RewardTrainer` orchestrates the training process, handling tasks such as batching, optimization, evaluation, and saving model checkpoints. It is particularly useful for training models that need to learn from feedback signals, improving their ability to generate high-quality responses.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Initialize RewardTrainer\n", - "trainer = RewardTrainer(\n", - " model=model,\n", - " args=training_args,\n", - " tokenizer=tokenizer,\n", - " train_dataset=dataset_dict['train'],\n", - " eval_dataset=dataset_dict['test'],\n", - " peft_config=peft_config,\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - ">Note: You can safely ignore the above warning.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The next step is training, saving, and evaluating a model by using the `RewardTrainer`. The `trainer.train()` method initiates the training process, where the model learns from the training data set, optimizing its parameters to improve performance. After training, the `trainer.save_model(output_dir)` method saves the trained model to the specified output directory, allowing for future use or deployment. Finally, the `trainer.evaluate()` method evaluates the model's performance on the evaluation data set, returning metrics that provide insights into how well the model performs. These metrics are then printed to give a detailed view of the model's evaluation results. \n", - "\n", - "Note: The training takes a very long time. Therefore, the model has already been trained and saved for you. If you want to train the model yourself, go ahead and uncomment the following cell.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# output_dir=\"./model_output3\"\n", - "\n", - "# # Train the model\n", - "# trainer.train()\n", - "\n", - "# # Save the model\n", - "# trainer.save_model(output_dir)\n", - "\n", - "# # Evaluate the model\n", - "# metrics = trainer.evaluate()\n", - "# print(metrics)\n", - "\n", - "# model.config.save_pretrained(\"./backup\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now, download the pretained model. If you have trained the model yourself, you can skip this step.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "!wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/VZcK8FJ-kQ3nEJoxWGNYTQ/RetriverTrainerModel.zip" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "!unzip -o RetriverTrainerModel.zip -d extracted_model" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "DEVICE = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n", - "model = GPT2ForSequenceClassification.from_pretrained(\"./extracted_model/model_output3\", num_labels=1).to(DEVICE)\n", - "model" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - ">Note: You can safely ignore the above warning.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Next, plot the loss. You can see it converges nicely.\n", - "\n", - "Run the below code to unzip the file.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "log_file = f\"extracted_model/model_output3/checkpoint-2500/trainer_state.json\"\n", - "\n", - "# Read the log file\n", - "with open(log_file, 'r') as f:\n", - " logs = json.load(f)\n", - "\n", - "# Extract training loss values\n", - "steps = []\n", - "losses = []\n", - "for log in logs[\"log_history\"]:\n", - " if \"loss\" in log:\n", - " steps.append(log[\"step\"])\n", - " losses.append(log[\"loss\"])\n", - "\n", - "# Plot the training loss\n", - "plt.figure(figsize=(10, 5))\n", - "plt.plot(steps, losses, label=\"Training Loss\")\n", - "plt.xlabel(\"Steps\")\n", - "plt.ylabel(\"Loss\")\n", - "plt.title(\"Training Loss Over Time\")\n", - "plt.legend()\n", - "plt.show()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "## Evaluating the model\n", - "The `RewardTrainer` uses pairwise comparison to measure the model's ability to distinguish between high and low-quality responses. In pairwise comparison, the model is presented with two responses: one chosen as the preferred response and one as the less preferred (rejected) response. The model evaluates each response in the pair and assigns a score or reward based on its learned criteria. To perform an evaluation, two separate responses are inputted, and the model generates a score that is logit for each response. The response with the higher score is selected as the preferred one, demonstrating the model's capability to accurately prioritize better responses.\n", - "\n", - "First, load the model.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "model_=GPT2ForSequenceClassification.from_pretrained(\"./extracted_model/model_output3\", num_labels=1).to(DEVICE)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now, define the device (CPU or GPU) for training. You'll check if a GPU is available for use; otherwise, use CPU.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')\n", - "DEVICE" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The code first tokenizes `text1` by using the tokenizer, converting it into the format required by the model. The `tokenizer` function processes the input text into tensors with padding and truncation to ensure uniform input length, up to a maximum of 512 tokens. The `inputs` are then moved to the GPU, if available, for faster computation. The `model` is also transferred to the GPU. The inputs dictionary is updated to move all of its items to the device (GPU or CPU).\n", - "\n", - "The model is then used to generate outputs without computing gradients (`torch.no_grad()`), making the inference process faster and more memory-efficient. The score, the logits from the model's output, are extracted, representing the raw predictions. These logits are then passed through a sigmoid function to convert them into probabilities, which can be interpreted as the model's confidence in the projections. The resulting probabilities are printed or returned for further use.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "text1=train_str['chosen'][0]\n", - "print(text1)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "inputs = tokenizer(text1, return_tensors=\"pt\", padding=True, truncation=True, max_length=512)\n", - "\n", - "# Move inputs to the GPU if available\n", - "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n", - "model.to(device)\n", - "inputs = {k: v.to(device) for k, v in inputs.items()}\n", - "with torch.no_grad():\n", - " outputs = model(**inputs)\n", - "logit_1 = outputs.logits\n", - "print(\"Score :\",logit_1 )" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Do the same for the rejected sample \n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "text2=train_str['rejected'][0]\n", - "print(text2)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "inputs = tokenizer(text2, return_tensors=\"pt\", padding=True, truncation=True, max_length=512)\n", - "\n", - "# Move inputs to the GPU if available\n", - "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n", - "model.to(device)\n", - "inputs = {k: v.to(device) for k, v in inputs.items()}\n", - "with torch.no_grad():\n", - " outputs = model(**inputs)\n", - "logit_2 = outputs.logits\n", - "print(\"Score :\",logit_2 )" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "To demonstrate how pairwise comparison is useful for evaluating and ranking responses based on the model's predictions, use the following code that performs a pairwise comparison. To determine which of the two responses, represented by `logit_1` and `logit_2`, is preferred, compare the logits, which are raw scores output by the model that indicate the quality of the responses. If `logit_1` is greater than `logit_2`, the first response (`text1`) is selected as the better response, and the second response (`text2`) is rejected, printing both the selected and rejected responses along with their respective scores. Conversely, if `logit_2` is greater or equal, the second response (`text2`) is selected, and the first response (`text1`) is rejected, again printing both responses and their scores. \n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "if logit_1 > logit_2:\n", - " print(\"--------selected---------\")\n", - " print(text1, logit_1.detach().item())\n", - " print(\"--------rejected---------\")\n", - " print(text2, logit_2.detach().item())\n", - "else:\n", - " print(\"selected \")\n", - " print(text2, logit_2.detach().item())\n", - " print(\"rejected\")\n", - " print(text2, logit_2.detach().item()) " - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now, convert the process of tokenizing, generating a score, and comparing the outputs into two separate functions. The first function handles tokenizing the text and generating the model's output scores, while the second function performs the pairwise comparison of these scores. Structuring the code this way ensures that the process is modular and easier to manage, facilitating a clear and efficient workflow for evaluating and selecting the better response based on the model's predictions.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Function to make a prediction and get the logits\n", - "def predict_and_get_logits(text):\n", - " # Tokenize the input text\n", - " inputs = tokenizer(text, return_tensors=\"pt\", padding=True, truncation=True, max_length=512)\n", - " inputs = {k: v.to(device) for k, v in inputs.items()}\n", - "\n", - " # Perform the forward pass\n", - " with torch.no_grad():\n", - " outputs = model_(**inputs)\n", - " \n", - " # Extract the logits from the outputs\n", - " logits = outputs.logits.squeeze().item() # Assuming binary classification and batch size of 1\n", - " \n", - " return logits" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Function to compare two texts\n", - "def compare_texts(text1, text2):\n", - " logit1 = predict_and_get_logits(text1)\n", - " logit2 = predict_and_get_logits(text2)\n", - "\n", - " if logit1 > logit2:\n", - " print(\"selected---------\")\n", - " print(text1, f\"score: {logit1}\")\n", - "\n", - " return text1\n", - " else:\n", - " print(\"selected---------\")\n", - " print(text2, f\"score: {logit2}\")\n", - "\n", - " return text2" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Finally, evaluate the performance of a model by using a pairwise comparison approach over a subset of the data set. It begins by defining N, the number of samples to evaluate, and initializes a counter `correct_selections` to keep track of how many times the model correctly identifies the preferred response. The code then iterates over the first N pairs of chosen and rejected responses from the training data set (`train_str['chosen']` and `train_str['rejected']`).\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Define the number of samples to evaluate\n", - "N = 10\n", - "\n", - "# Initialize a counter for correct selections\n", - "correct_selections = 0\n", - "\n", - "# Iterate over the first N pairs of chosen and rejected responses\n", - "for chosen, rejected in zip(train_str['chosen'][0:N], train_str['rejected'][0:N]):\n", - " # Print the chosen response for reference\n", - " print(\"Chosen Response:\\n\", chosen)\n", - " \n", - " # Use the compare_texts function to determine which response is better\n", - " selected_text = compare_texts(chosen, rejected)\n", - " \n", - " # Check if the selected text is the chosen response\n", - " if selected_text == chosen:\n", - " correct_selections += 1\n", - "\n", - "# Calculate the accuracy as the ratio of correct selections to the total number of samples\n", - "accuracy = correct_selections / N\n", - "\n", - "# Print the accuracy\n", - "print(\"Accuracy:\", accuracy) " - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Exercise \n", - "\n", - "#### Evaluate model's preference accuracy on a different subset of data\n", - "\n", - "1. Define a new variable `K` to set the number of samples for evaluation from a different subset of the data.\n", - "2. Initialize a counter to track the number of correct selections made by the model.\n", - "3. Iterate over the `K` pairs of chosen and rejected responses from a different subset of the data set (for example, from the middle of the data set).\n", - "4. For each pair, use the `compare_texts` function to determine which response is better.\n", - "5. Count the number of times the model correctly identifies the chosen response.\n", - "6. Calculate and print the accuracy of the model's preferences on this different subset.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Write your code here" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "
\n", - " Click here for Solution\n", - "\n", - "```python\n", - "# Define the number of samples to evaluate from a different subset\n", - "K = 50\n", - "\n", - "# Initialize a counter for correct selections\n", - "correct_selections = 0\n", - "\n", - "# Determine the starting index for the different subset (e.g., middle of the dataset)\n", - "start_index = len(train_str['chosen']) // 2\n", - "\n", - "# Iterate over K pairs of chosen and rejected responses from the different subset\n", - "for chosen, rejected in zip(train_str['chosen'][start_index:start_index + K], train_str['rejected'][start_index:start_index + K]):\n", - " # Use the compare_texts function to determine which response is better\n", - " selected_text = compare_texts(chosen, rejected)\n", - " \n", - " # Check if the selected text is the chosen response\n", - " if selected_text == chosen:\n", - " correct_selections += 1\n", - "\n", - "# Calculate the accuracy as the ratio of correct selections to the total number of samples\n", - "accuracy = correct_selections / K\n", - "\n", - "# Print the accuracy\n", - "print(\"Accuracy on different subset:\", accuracy)\n", - "```\n", - "\n", - "
\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Congratulations! You have completed the lab\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Authors\n", - "\n", - "[Joseph Santarcangelo](https://author.skills.network/instructors/joseph_santarcangelo) has a Ph.D. in Electrical Engineering, his research focused on using machine learning, signal processing, and computer vision to determine how videos impact human cognition. Joseph has been working for IBM since he completed his PhD.\n", - "\n", - "[Ashutosh Sagar](https://www.linkedin.com/in/ashutoshsagar/) is completing his MS in CS from Dalhousie University. He has previous experience working with Natural Language Processing and as a Data Scientist.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Other Contributors\n", - "\n", - "[Hailey Quach](https://author.skills.network/instructors/hailey_quach) is a Data Scientist at IBM. She's completing her Bsc, Honors in Computer Science at Concordia University, Montreal.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "© Copyright IBM Corporation. All rights reserved.\n" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.8.19" - }, - "prev_pub_hash": "e304da04016a4b41134ab217e0a1138cc3a8efdada740e1e4f9acc4b4e744ba9" - }, - "nbformat": 4, - "nbformat_minor": 4 -} diff --git a/notebooks/LLM_Specialization/Generative AI Advance Fine-Tuning for LLMs/Reinforcement Learning from Human Feedback/DPO Fine-Tuning-v1.ipynb b/notebooks/LLM_Specialization/Generative AI Advance Fine-Tuning for LLMs/Reinforcement Learning from Human Feedback/DPO Fine-Tuning-v1.ipynb deleted file mode 100755 index 38960d3..0000000 --- a/notebooks/LLM_Specialization/Generative AI Advance Fine-Tuning for LLMs/Reinforcement Learning from Human Feedback/DPO Fine-Tuning-v1.ipynb +++ /dev/null @@ -1,1199 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "

\n", - " \n", - " \"Skills\n", - " \n", - "

\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# **Direct Preference Optimization (DPO) Using Hugging Face**\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Estimated time needed: **60** minutes\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Large language models (LLMs) have revolutionized the field of natural language processing (NLP) by achieving remarkable performance in various tasks. However, it is challenging to align these models with human preferences. Therefore, the direct preference optimization (DPO) method comes in place which directly optimizes LLMs based models on user preferences, enhancing their alignment with human expectations. In this hands-on lab, you'll use the transformer reinforcement learning (trl) library from Hugging Face to implement DPO and fine-tune LLMs.\n", - "\n", - "The objective of this lab is to provide a practical understanding of the DPO method and its implementation using the trl library. \n", - "\n", - "By the end of this lab, you'll have hands-on experience in creating a data set formatted for DPO, implementing the optimization process, and evaluating the enhanced performance of LLMs.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## __Table of Contents__\n", - "\n", - "
    \n", - "
  1. Objectives
  2. \n", - "
  3. \n", - " Setup\n", - "
      \n", - "
    1. Installing required libraries
    2. \n", - "
    3. Importing required libraries
    4. \n", - "
    \n", - "
  4. \n", - "
  5. \n", - " Create and configure the model and tokenizer\n", - "
      \n", - "
    1. Quantized model configuration (Optional)
    2. \n", - "
    \n", - "
  6. \n", - "
  7. Preprocess dataset
  8. \n", - "
  9. DPO configuration
  10. \n", - "
  11. DPO training
  12. \n", - "
  13. Exercise\n", - "
\n", - " \n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Objectives\n", - "\n", - "After completing this lab, you'll be able to: \n", - "- Understand the fundamentals of DPO and how it is different from proximal policy optimization (PPO)\n", - "- Set up an environment by installing and configuring necessary tools and libraries, such as trl library from Hugging Face\n", - "- Prepare a suitable environment for running DPO experiments with LLMs\n", - "- Create a data set for DPO\n", - "- Understand the required format for data sets used in DPO\n", - "- Create and preprocess a data set that includes user preferences\n", - "- Implement DPO by following a step-by-step guideline using the trl library\n", - "- Set training arguments, create a base quantized LoRA model, and train it using a DPO trainer\n", - "- Evaluate the performance of the LLM before and after applying DPO\n", - "- Analyze the impact of DPO on aligning the model with user preferences\n", - "\n", - "By the end of this hands-on lab, you will be equipped with the knowledge and skills needed to apply DPO for fine-tuning LLMs using the trl library. This will enable you to enhance LLMs' performance and user alignment in various NLP applications.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "----\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Setup\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Installing required libraries\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The following required libraries are __not__ pre-installed in the Skills Network Labs environment. You will need to run the following cell to install them.\n", - "\n", - "**Note:** In this lab, you don't have a pinned version to demonstrate the latest functionality, but you can always pin versions in your labs.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "!pip install torch\n", - "!pip install trl # for optimization training\n", - "!pip install peft # for creating LoRA architecture\n", - "!pip install matplotlib" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Importing required libraries\n", - "\n", - "_It's recommended to import all required libraries in one place (here):_\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "##imports\n", - "import multiprocessing\n", - "import os\n", - "import requests\n", - "import tarfile\n", - "import pandas as pd\n", - "import matplotlib.pyplot as plt\n", - "\n", - "import torch\n", - "from datasets import load_dataset\n", - "\n", - "from peft import LoraConfig\n", - "from transformers import AutoModelForCausalLM, AutoTokenizer,TrainingArguments, GPT2Tokenizer, set_seed, GenerationConfig\n", - "from trl import DPOConfig, DPOTrainer\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Create and configure the model and tokenizer\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "\n", - "# Load the GPT-2 model\n", - "model = AutoModelForCausalLM.from_pretrained(\"gpt2\")\n", - "\n", - "# Load a reference model \n", - "model_ref = AutoModelForCausalLM.from_pretrained(\"gpt2\")\n", - "\n", - "# Load the GPT-2 tokenizer\n", - "tokenizer = GPT2Tokenizer.from_pretrained(\"gpt2\")\n", - "\n", - "# Set the pad token to the end-of-sequence token\n", - "tokenizer.pad_token = tokenizer.eos_token\n", - "# Set the padding side to \"right\" to fix the overflow issue with FP16 training\n", - "tokenizer.padding_side = \"right\"\n", - "\n", - "# Disable the use of the cache during the model's forward pass\n", - "model.config.use_cache = False" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Here, you can check the model architecture.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "model" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Quantized model configuration (Optional)\n", - "If you want memory-efficient training and have access to a GPU-powered environment, you can download the complete lab, uncomment the following two code blocks to create a quantized model and proceed with training the model on GPU. This is because you will need GPUs for the bits and bytes package.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "#!pip install -U bitsandbytes # this package is required for quantization" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "**_Note:_** _You can run the installed package by restarting a Kernel._\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "'''## Quantized model --only available on GPU\n", - "from transformers import BitsAndBytesConfig\n", - "\n", - "# Configure the quantization parameters\n", - "quantization_config = BitsAndBytesConfig(\n", - " # Load the model in 4-bit quantized format\n", - " load_in_4bit=True,\n", - " # Enable double quantization for better accuracy\n", - " bnb_4bit_use_double_quant=True,\n", - " # Use non-uniform 4-bit quantization (nf4)\n", - " bnb_4bit_quant_type=\"nf4\",\n", - " # Use bfloat16 as the computation data type during quantization\n", - " bnb_4bit_compute_dtype=torch.bfloat16\n", - ")\n", - "\n", - "# Load GPT-2 model with the specified quantization configuration\n", - "model = AutoModelForCausalLM.from_pretrained(\"gpt2\", quantization_config=quantization_config)\n", - "\n", - "# Load a reference model with the same quantization configuration\n", - "model_ref = AutoModelForCausalLM.from_pretrained(\"gpt2\", quantization_config=quantization_config)\n", - "\n", - "# Load GPT-2 tokenizer\n", - "tokenizer = GPT2Tokenizer.from_pretrained(\"gpt2\")\n", - "\n", - "# Set the pad token to the end-of-sequence token\n", - "tokenizer.pad_token = tokenizer.eos_token\n", - "# Set the padding side to \"right\" to fix the overflow issue with FP16 training\n", - "tokenizer.padding_side = \"right\"\n", - "\n", - "# Disable the use of the cache during the model's forward pass\n", - "model.config.use_cache = False'''" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Preprocess data set\n", - "\n", - "The \"ultrafeedback_binarized\" data set on Hugging Face is a collection of prompts and responses. \n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Load the dataset from the specified location\n", - "ds = load_dataset(\"BarraHome/ultrafeedback_binarized\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "This data set includes six splits. \n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "ds.keys()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Each record has different features among which you need to select from the three features, that is \"chosen,\" \"rejected,\" and \"prompt.\" This means that for each prompt, a prefered response and a rejected response are provided.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "ds[\"train_prefs\"][0].keys()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "You can check the sample record of data, where you can see three features along with other features that is the prompt, the rejected, and chosen responses.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "ds[\"train_prefs\"][0]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now, put the data set in the format that the DPO trainer accepts.\n", - "\n", - "| Chosen | Rejected | Prompt |\n", - "| --- | --- | --- |\n", - " | Developing a daily habit of drawing can be challenging
but with consistent practice, and a few tips. | One way to develop a habit of drawing daily is
to allocate a specific time interval for drawing. | How can I develop a habit of drawing daily?|\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# You can reduce the volume of data (due to resource limitations) by selecting the first 5% examples from each split of the dataset\n", - "for key in ds:\n", - " #cnt = round(ds[key].__len__()*0.05)\n", - " cnt=50\n", - " ds[key] = ds[key].select(range(cnt))\n", - "\n", - "# Define a function to process the data\n", - "def process(row):\n", - " # delete unwanted columns\n", - " del row[\"prompt_id\"]\n", - " del row[\"messages\"]\n", - " del row[\"score_chosen\"]\n", - " del row[\"score_rejected\"]\n", - " # retrieve the actual response text\n", - " row[\"chosen\"] = row[\"chosen\"][-1][\"content\"]\n", - " row[\"rejected\"] = row[\"rejected\"][-1][\"content\"]\n", - "\n", - " return row\n", - "\n", - "# Apply the data processing function to the dataset\n", - "ds = ds.map(\n", - " process,\n", - " num_proc=multiprocessing.cpu_count(),\n", - " load_from_cache_file=False,\n", - ")\n", - "\n", - "# Split the dataset into training and evaluation sets\n", - "train_dataset = ds['train_prefs']\n", - "eval_dataset = ds['test_prefs']\n", - "\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Let's check the data record.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "train_dataset[0]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Next, define LoRAConfig for efficient fine-tuning.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# PEFT (Parameter-Efficient Finetuning) configuration\n", - "peft_config = LoraConfig(\n", - " # The rank of the low-rank adaptation weights\n", - " r=4,\n", - " # The target modules to apply the low-rank adaptation to\n", - " target_modules=['c_proj','c_attn'],\n", - " # The task type for the low-rank adaptation\n", - " task_type=\"CAUSAL_LM\",\n", - " # The scaling factor for the low-rank adaptation weights\n", - " lora_alpha=8,\n", - " # The dropout probability for the low-rank adaptation weights\n", - " lora_dropout=0.1,\n", - " # The bias mode for the low-rank adaptation\n", - " bias=\"none\",\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### DPO configuration\n", - "\n", - "First, define training arguments.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# DPO configuration\n", - "training_args = DPOConfig(\n", - " # The beta parameter for the DPO loss function\n", - " #beta is the temperature parameter for the DPO loss, typically something in the range of 0.1 to 0.5 . \n", - " beta=0.1,\n", - " # The output directory for the training\n", - " output_dir=\"dpo\",\n", - " # The number of training epochs\n", - " num_train_epochs=5,\n", - " # The batch size per device during training\n", - " per_device_train_batch_size=1,\n", - " # The batch size per device during evaluation\n", - " per_device_eval_batch_size=1,\n", - " # Whether to remove unused columns from the dataset\n", - " remove_unused_columns=False,\n", - " # The number of steps between logging training progress\n", - " logging_steps=10,\n", - " # The number of gradient accumulation steps\n", - " gradient_accumulation_steps=1,\n", - " # The learning rate for the optimization\n", - " learning_rate=1e-4,\n", - " # The evaluation strategy (e.g., after each step or epoch)\n", - " evaluation_strategy=\"epoch\",\n", - " # The number of warmup steps for the learning rate scheduler\n", - " warmup_steps=2,\n", - " # Whether to use 16-bit (float16) precision\n", - " fp16=False,\n", - " # The number of steps between saving checkpoints\n", - " save_steps=500,\n", - " # The maximum number of checkpoints to keep\n", - " #save_total_limit=2,\n", - " # The reporting backend to use (set to 'none' to disable, you can also report to wandb or tensorboard)\n", - " report_to='none'\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### DPO training\n", - "\n", - "Next step is creating the actual trainer using DPOTrainer class.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "tokenizer.pad_token = tokenizer.eos_token\n", - "\n", - "# Create a DPO trainer\n", - "# This trainer will handle the fine-tuning of the model using the DPO technique\n", - "trainer = DPOTrainer(\n", - " # The model to be fine-tuned\n", - " model,\n", - " # The reference model (not used in this case because LoRA has been used)\n", - " ref_model=None,\n", - " # The DPO training configuration\n", - " args=training_args,\n", - " # The beta parameter for the DPO loss function\n", - " beta=0.1,\n", - " # The training dataset\n", - " train_dataset=train_dataset,\n", - " # The evaluation dataset\n", - " eval_dataset=eval_dataset,\n", - " # The tokenizer for the model\n", - " tokenizer=tokenizer,\n", - " # The PEFT (Parallel Efficient Finetuning) configuration\n", - " peft_config=peft_config,\n", - " # The maximum prompt length\n", - " max_prompt_length=512,\n", - " # The maximum sequence length\n", - " max_length=512,\n", - " )\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Please note that when using LoRA for the base model, it's efficient to leave the model_ref param null, in which case the DPOTrainer will unload the adapter for reference inference.\n", - "\n", - "\n", - "Now, you're all set for training the model.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Training model\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "**Keep in mind that training the model on a CPU can be time-consuming and may cause the kernel to crash due to memory issues. If this happens, you can bypass training by loading the pre-trained model provided in the next section and proceed from there.**\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Start the training process\n", - "trainer.train()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Let's retrieve and plot the training loss versus evaluation loss.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Retrieve log_history and save it to a dataframe\n", - "log = pd.DataFrame(trainer.state.log_history)\n", - "log_t = log[log['loss'].notna()]\n", - "log_e = log[log['eval_loss'].notna()]\n", - "\n", - "# Plot train and evaluation losses\n", - "plt.plot(log_t[\"epoch\"], log_t[\"loss\"], label = \"train_loss\") \n", - "plt.plot(log_e[\"epoch\"], log_e[\"eval_loss\"], label = \"eval_loss\") \n", - "plt.legend() \n", - "plt.show()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "![image](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/7KEnvtpUyNcJTINdArLf7A/loss%20dpo.png)\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Load the trained DPO model you just trained\n", - "dpo_model = AutoModelForCausalLM.from_pretrained('./dpo/checkpoint-250')\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Loading trained model\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "If you encounter difficulty in running the training cell due to resource limitations, you can download the model to be fine-tuned: \n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Define the URL and the filename\n", - "url = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/YIDeT3qihEpWChdXN_RmTg/DPO-tar.gz'\n", - "filename = './DPO.tar'\n", - "\n", - "# Download the file\n", - "response = requests.get(url)\n", - "\n", - "# Save the file locally\n", - "with open(filename, 'wb') as f:\n", - " f.write(response.content)\n", - "\n", - "# Extract the tar file\n", - "if tarfile.is_tarfile(filename):\n", - " with tarfile.open(filename, 'r') as tar:\n", - " tar.extractall()\n", - " print(\"Files extracted:\", tar.getnames())\n", - "else:\n", - " print(\"The adownloaded file is not a tar file.\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Then, load it into the model for further inference:\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Load the trained DPO model tiy just trained\n", - "dpo_model = AutoModelForCausalLM.from_pretrained('./DPO')\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Generation\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Load the GPT-2 tokenizer\n", - "tokenizer = GPT2Tokenizer.from_pretrained('gpt2')" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Set a seed for reproducibility\n", - "set_seed(42)\n", - "\n", - "\n", - "# Define the generation configuration for the DPO model\n", - "# This sets the parameters for text generation\n", - "generation_config = GenerationConfig(\n", - " # Use sampling to generate diverse text\n", - " do_sample=True,\n", - " # Top-k sampling parameter\n", - " top_k=1,\n", - " # Temperature parameter to control the randomness of the generated text\n", - " temperature=0.1,\n", - " # Maximum number of new tokens to generate\n", - " max_new_tokens=25,\n", - " # Use the end-of-sequence token as the padding token\n", - " pad_token_id=tokenizer.eos_token_id\n", - " )\n", - "\n", - "# Define the input prompt for text generation\n", - "PROMPT = \"Is a higher octane gasoline better for your car?\"\n", - "# Encode the prompt using the tokenizer\n", - "inputs = tokenizer(PROMPT, return_tensors='pt')\n", - "\n", - "# Generate text using the DPO model\n", - "outputs = dpo_model.generate(**inputs, generation_config=generation_config)\n", - "# Decode the generated text and print it\n", - "print(\"DPO response:\\t\",tokenizer.decode(outputs[0], skip_special_tokens=True))\n", - "\n", - "# Load the pre-trained GPT-2 model\n", - "gpt2_model = AutoModelForCausalLM.from_pretrained('gpt2')\n", - "# Generate text using the GPT-2 model\n", - "outputs = gpt2_model.generate(**inputs, generation_config=generation_config)\n", - "# Decode the generated text and print it\n", - "print(\"\\nGPT2 response:\\t\",tokenizer.decode(outputs[0], skip_special_tokens=True))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Althought the model is trained on a small data for 5 epochs only, it can be seen that the response generated by the DPO-tuned model is more concise and straightforward.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Exercise\n", - "\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Exercise 1: Preprocess the `argilla/ultrafeedback-binarized-preferences-cleaned` Dataset\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "This data set comprises user-generated prompts along with corresponding responses categorized as either \"chosen\" or \"rejected.\" It provides a rich source of binary feedback, making it ideal for training models to align with user preferences.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "##### Load the data set from the `argilla/ultrafeedback-binarized-preferences-cleaned`\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "#TODO" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "
\n", - " Click here for hint\n", - "\n", - "```python\n", - "dataset = load_dataset(\"argilla/ultrafeedback-binarized-preferences-cleaned\")\n", - "```\n", - "\n", - "
\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "dataset['train']" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "##### Set the variable cnt to 50 and then select the first 50 (cnt) examples to reduce the volume of data for resource limitations.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "#TODO" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "
\n", - " Click here for hint\n", - "\n", - "```python\n", - "cnt = 50 # You can adjust this count based on your requirements\n", - "\n", - "# Select the first 5% of examples\n", - "dataset['train'] = dataset['train'].select(range(cnt))\n", - "```\n", - "\n", - "
\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "##### Create a function named `process` that takes a row of data as input. Within this function, remove unwanted columns such as `source, chosen-rating, chosen-model, rejected-rating, and rejected-model`. Then, use the map function to apply the process function to each row in the training data set.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "#TODO" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "
\n", - " Click here for hint\n", - "\n", - "```python\n", - "def process(row):\n", - " # Delete unwanted columns\n", - " del row[\"source\"]\n", - " del row[\"chosen-rating\"]\n", - " del row[\"chosen-model\"]\n", - " del row[\"rejected-rating\"]\n", - " del row[\"rejected-model\"]\n", - " \n", - " # Retrieve the actual response text\n", - " row[\"chosen\"] = row[\"chosen\"][-1][\"content\"]\n", - " row[\"rejected\"] = row[\"rejected\"][-1][\"content\"]\n", - " \n", - " return row\n", - "\n", - "# Apply the data processing function to the dataset\n", - "dataset['train'] = dataset['train'].map(\n", - " process,\n", - " num_proc=multiprocessing.cpu_count(),\n", - " load_from_cache_file=False,\n", - ")\n", - "```\n", - "\n", - "
\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "##### Split the data set into training and evaluation sets:\n", - "Calculate the size for the training set as 80% of the total data. The remaining 20% will be used for evaluation.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "#TODO" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "
\n", - " Click here for hint\n", - "\n", - "```python\n", - "train_size = int(0.8 * len(dataset['train'])) # 80% for training\n", - "eval_size = len(dataset['train']) - train_size # Remaining 20% for evaluation\n", - "\n", - "train_dataset = dataset['train'].select(range(train_size))\n", - "eval_dataset = dataset['train'].select(range(train_size, train_size + eval_size))\n", - "```\n", - "\n", - "
\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "train_dataset" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "train_dataset[0]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Exercise 2: Prompt Inferencing and Comparison with GPT-2\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "PROMPT = input()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "##### Initialize the GPT-2 Tokenizer\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "#TODO" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "
\n", - " Click here for hint\n", - "\n", - "```python\n", - "tokenizer = GPT2Tokenizer.from_pretrained('gpt2')\n", - "```\n", - "\n", - "
\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "##### Create a generation_config object to set the parameters for text generation.\n", - "- do_sample=True (It enables sampling, which allows for more diverse outputs.)\n", - "- top_k=1 (It specifies the number of highest probability vocabulary tokens to consider during generation.)\n", - "- temperature=0.1 (It controls the randomness of the output.)\n", - "- max_new_tokens=25 (It sets the maximum number of new tokens to generate during inference.)\n", - "- pad_token_id=tokenizer.eos_token_id (It specifies the token to use for padding.)\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "#TODO" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "
\n", - " Click here for hint\n", - "\n", - "```python\n", - "generation_config = GenerationConfig(\n", - " # Use sampling to generate diverse text\n", - " do_sample=True,\n", - " # Top-k sampling parameter: controls the number of highest probability tokens to consider\n", - " top_k=1,\n", - " # Temperature parameter: controls the randomness of the generated text\n", - " temperature=0.1,\n", - " # Maximum number of new tokens to generate\n", - " max_new_tokens=25,\n", - " # Use the end-of-sequence token as the padding token\n", - " pad_token_id=tokenizer.eos_token_id\n", - ")\n", - "```\n", - "\n", - "
\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "##### Create a function named `generate_dpo_response` that takes a prompt as input and generates a response using the DPO model (`dpo_model`).\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "#TODO" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "
\n", - " Click here for hint\n", - "\n", - "```python\n", - "def generate_dpo_response(prompt):\n", - " # Tokenize the prompt\n", - " inputs = tokenizer(prompt, return_tensors='pt')\n", - "\n", - " # Generate text using the DPO model\n", - " outputs = dpo_model.generate(**inputs, generation_config=generation_config)\n", - " \n", - " # Decode and return the response\n", - " return tokenizer.decode(outputs[0], skip_special_tokens=True)\n", - "```\n", - "\n", - "
\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "##### Create another function named `generate_gpt2_response` that takes a prompt as input and generates a response using the GPT-2 model (`gpt2_model`).\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "#TODO" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "
\n", - " Click here for hint\n", - "\n", - "```python\n", - "def generate_gpt2_response(prompt):\n", - " # Tokenize the prompt\n", - " inputs = tokenizer(prompt, return_tensors='pt')\n", - "\n", - " # Generate text using the GPT-2 model\n", - " outputs = gpt2_model.generate(**inputs, generation_config=generation_config)\n", - " \n", - " # Decode and return the response\n", - " return tokenizer.decode(outputs[0], skip_special_tokens=True)\n", - "```\n", - "\n", - "
\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "##### Call both functions with a prompt and compare the responses.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "#TODO" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "\n", - "
\n", - " Click here for hint\n", - "\n", - "```python\n", - "# Generate responses\n", - "dpo_response = generate_dpo_response(PROMPT)\n", - "gpt2_response = generate_gpt2_response(PROMPT)\n", - "\n", - "# Print the responses\n", - "print(\"DPO response:\\t\", dpo_response)\n", - "print(\"\\nGPT-2 response:\\t\", gpt2_response)\n", - "```\n", - "\n", - "
\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Congratulations! You have completed the lab!\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Authors\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "[Fateme Akbari](https://www.linkedin.com/in/fatemeakbari/) is a Ph.D. candidate in Information Systems at McMaster University with demonstrated research experience in Machine Learning and NLP.\n", - "\n", - "[Kunal Makwana](https://author.skills.network/instructors/kunal_makwana) is a Data Scientist at IBM and is currently pursuing his Master's in Computer Science at Dalhousie University.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## References\n", - "[DPO Trainer](https://huggingface.co/docs/trl/main/en/dpo_trainer)\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "© Copyright IBM Corporation. All rights reserved.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.8.19" - }, - "prev_pub_hash": "21ff78b44c97c4a9c4f0d7965c976d0f5a40a6c0de593f10a90787e44e4637df" - }, - "nbformat": 4, - "nbformat_minor": 4 -} diff --git a/notebooks/LLM_Specialization/Generative AI Advance Fine-Tuning for LLMs/Reinforcement Learning from Human Feedback/PPOTrainer-v1.ipynb b/notebooks/LLM_Specialization/Generative AI Advance Fine-Tuning for LLMs/Reinforcement Learning from Human Feedback/PPOTrainer-v1.ipynb deleted file mode 100755 index 78c8832..0000000 --- a/notebooks/LLM_Specialization/Generative AI Advance Fine-Tuning for LLMs/Reinforcement Learning from Human Feedback/PPOTrainer-v1.ipynb +++ /dev/null @@ -1,2190 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "

\n", - " \n", - " \"Skills\n", - " \n", - "

\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Reinforcement Learning from Human Feedback Using PPO\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Estimated time needed: **30** minutes\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "Imagine you are an AI engineer who wants to train a \"Happy LLM\" and a \"Pessimistic LLM\" to train customer service agents. You have a reward function trained on the sentiment classifier from the IMDb dataset, and you will now use Reinforcement Learning (RL). RL is a subfield of machine learning where an agent learns to make decisions by performing actions in an environment to maximize a cumulative reward. The agent, in this case, will be the LLM, and the decisions will be about what text to output. Unlike supervised learning, which requires labeled input/output pairs, RL relies on the agent exploring the environment and learning from the feedback it receives in the form of rewards or penalties. This trial-and-error approach enables the agent to improve its decision-making strategy over time.\n", - "\n", - "Proximal Policy Optimization (PPO) is one of the most effective and widely used RL algorithms. Introduced by OpenAI, PPO strikes a balance between simplicity and performance, making it a popular choice for training RL agents. PPO optimizes the policy directly and employs mechanisms to ensure the updates are not too drastic, thereby maintaining stability and reliability during training.\n", - "\n", - "In this lab, you will be guided through the process of training an RL agent using the PPO algorithm with a focus on sentiment analysis. You will use the IMDb dataset, a large collection of movie reviews, to train your model. By the end of this lab, you will have a solid understanding of how to implement and train an RL agent using PPO, and you will be equipped with practical skills to apply RL techniques to other problems and datasets.\n", - "This lab is based on [a HF example code titled `Tune GPT2 to generate positive reviews`](https://github.com/huggingface/trl/blob/main/examples/notebooks/gpt2-sentiment.ipynb).\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## __Table of Contents__\n", - "\n", - "
    \n", - "
  1. Objectives
  2. \n", - "
  3. \n", - " Setup\n", - "
      \n", - "
    1. Installing required libraries
    2. \n", - "
    3. Importing required libraries
    4. \n", - "
    5. Defining helper functions
    6. \n", - "
    \n", - "
  4. \n", - "
  5. Initializing the PPO configuration, model, and tokenizer
  6. \n", - "
  7. Dataset and dataset tokenization
  8. \n", - "
  9. Collator function
  10. \n", - "
  11. Initialize PPOTrainer
  12. \n", - "
  13. Reward function
  14. \n", - "
  15. \n", - " Generating responses using PPO\n", - "
      \n", - "
    1. Tokenizing and preparing the input batch
    2. \n", - "
    3. Scoring function
    4. \n", - "
    5. Proximal policy optimization
    6. \n", - "
    \n", - "
  16. \n", - "
  17. Plotting PPO training loss and mean
  18. \n", - "
  19. Generating and analyzing text with PPO and reference models
  20. \n", - "
  21. \n", - " Comparing PPO and reference models on\n", - "
      \n", - "
    \n", - "
  22. \n", - "
  23. Running the PPO model with negative sentiment
  24. \n", - "
  25. Comparing models with negative sentiment
  26. \n", - "
  27. Exercise: Comparing PPO models
  28. \n", - "
\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Objectives\n", - "\n", - "After completing this lab you will be able to:\n", - "\n", - "- Apply the basics of reinforcement learning and proximal policy optimization (PPO).\n", - "- Set up the environment and load the IMDb dataset for training.\n", - "- Define and configure the PPO agent and tokenizer.\n", - "- Implement the PPO training loop.\n", - "- Generate and evaluate text responses from the trained model.\n", - "- Compare the performance of two models on the dataset.\n", - "- Save and load the trained model for future use.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "----\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Setup\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "For this lab, you will use the following libraries:\n", - "\n", - "* [`pandas`](https://pandas.pydata.org/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0187ENSkillsNetwork31430127-2021-01-01) for managing the data.\n", - "* [`torch`](https://pytorch.org/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0187ENSkillsNetwork31430127-2021-01-01) for tensor operations and model training.\n", - "* [`tqdm`](https://tqdm.github.io/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0187ENSkillsNetwork31430127-2021-01-01) for progress bars.\n", - "* [`transformers`](https://huggingface.co/transformers/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0187ENSkillsNetwork31430127-2021-01-01) for pretrained language models.\n", - "* [`datasets`](https://huggingface.co/docs/datasets/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0187ENSkillsNetwork31430127-2021-01-01) for loading and processing datasets.\n", - "* [`trl`](https://github.com/lvwerra/trl/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0187ENSkillsNetwork31430127-2021-01-01) for Proximal Policy Optimization (PPO) training.\n", - "* [`matplotlib`](https://matplotlib.org/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0187ENSkillsNetwork31430127-2021-01-01) for plotting tools.\n", - "* [`tarfile`](https://docs.python.org/3/library/tarfile.html/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0187ENSkillsNetwork31430127-2021-01-01) for handling tar file operations.\n", - "* [`pickle`](https://docs.python.org/3/library/pickle.html/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0187ENSkillsNetwork31430127-2021-01-01) for serializing and deserializing Python objects.\n", - "* [`json`](https://docs.python.org/3/library/json.html/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0187ENSkillsNetwork31430127-2021-01-01) for parsing and writing JSON data.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Installing required libraries\n", - "\n", - "The following required libraries are __not__ preinstalled in the Skills Network Labs environment. __You must run the following cell__ to install them:\n", - "\n", - "**Note:** The version has been pinned to specify the version. It's recommended that you do this as well. Even if the library is updated in the future, the installed library could still support this lab work.\n", - "\n", - "This might take approximately 1 minute. \n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "!pip install datasets trl==0.11.0\n", - "!pip install --upgrade typing_extensions\n", - "!pip install matplotlib" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Importing required libraries\n", - "\n", - "_It is recommended that you import all required libraries in one place (here):_\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "%load_ext autoreload\n", - "%autoreload 2" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import torch\n", - "from tqdm import tqdm\n", - "import pandas as pd\n", - "\n", - "tqdm.pandas()\n", - "\n", - "from transformers import pipeline, AutoTokenizer,AutoModelForCausalLM\n", - "from datasets import load_dataset\n", - "\n", - "from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead\n", - "from trl.core import LengthSampler\n", - "import os\n", - "\n", - "import tarfile\n", - "import pickle\n", - "import json\n", - "import matplotlib.pyplot as plt\n", - "import torch\n", - "import pandas as pd\n", - "import warnings\n", - "\n", - "warnings.filterwarnings('ignore')\n", - "# Disable warnings for a cleaner notebook or console experience\n", - "def warn(*args, **kwargs):\n", - " pass\n", - "warnings.warn = warn" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Defining helper functions\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "def save_to_json(data, file_path):\n", - " \"\"\"\n", - " Save a dictionary to a JSON file.\n", - "\n", - " Args:\n", - " data (dict): The dictionary to save.\n", - " file_path (str): The path to the JSON file.\n", - " \"\"\"\n", - " with open(file_path, 'w') as json_file:\n", - " json.dump(data, json_file, indent=4)\n", - " print(f\"Data successfully saved to {file_path}\")\n", - " \n", - " \n", - "def load_from_json(file_path):\n", - " \"\"\"\n", - " Load data from a JSON file.\n", - "\n", - " Args:\n", - " file_path (str): The path to the JSON file.\n", - "\n", - " Returns:\n", - " dict: The data loaded from the JSON file.\n", - " \"\"\"\n", - " with open(file_path, 'r') as json_file:\n", - " data = json.load(json_file)\n", - " return data \n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "def pad_sequence_to_length(tensor, length, pad_token_id):\n", - " padding_length = length - tensor.size(0)\n", - " if padding_length > 0:\n", - " padding = torch.full((padding_length,), pad_token_id, dtype=torch.long, device=tensor.device)\n", - " return torch.cat((tensor, padding))\n", - " return tensor\n", - "\n", - "def pad_list_to_batch_size(tensors, batch_size, pad_token_id):\n", - " max_length = max(t.size(0) for t in tensors)\n", - " padded_tensors = [pad_sequence_to_length(t, max_length, pad_token_id) for t in tensors]\n", - "\n", - " # Add additional padding-only tensors if needed\n", - " while len(padded_tensors) < batch_size:\n", - " padded_tensors.append(torch.full((max_length,), pad_token_id, dtype=torch.long, device=tensors[0].device))\n", - "\n", - " return padded_tensors[:batch_size]" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "def print_ppo_stats(stats, related_to_objective=False):\n", - " print(\"PPO Training Statistics\\n\")\n", - "\n", - " if related_to_objective:\n", - " print(\"Objective Statistics:\")\n", - " print(f\" KL Divergence (objective/kl): {stats['objective/kl']}\")\n", - " print(f\" KL Coefficient (objective/kl_coef): {stats['objective/kl_coef']}\")\n", - " print(f\" Entropy (objective/entropy): {stats['objective/entropy']}\\n\")\n", - " \n", - " print(\"PPO Losses (Related to Minimizing Objective Function):\")\n", - " print(f\" Policy Loss (ppo/loss/policy): {stats['ppo/loss/policy']}\")\n", - " print(f\" Value Loss (ppo/loss/value): {stats['ppo/loss/value']}\")\n", - " print(f\" Total Loss (ppo/loss/total): {stats['ppo/loss/total']}\\n\")\n", - " \n", - " print(\"PPO Policy Statistics:\")\n", - " print(f\" Policy Entropy (ppo/policy/entropy): {stats['ppo/policy/entropy']}\")\n", - " print(f\" Approx KL (ppo/policy/approxkl): {stats['ppo/policy/approxkl']}\")\n", - " print(f\" Clip Fraction (ppo/policy/clipfrac): {stats['ppo/policy/clipfrac']}\\n\")\n", - " else:\n", - " print(\"Reward and Value Function Estimation:\")\n", - " print(f\" Mean Non-Score Reward (ppo/mean_non_score_reward): {stats['ppo/mean_non_score_reward']}\")\n", - " print(f\" Mean Scores (ppo/mean_scores): {stats['ppo/mean_scores']}\")\n", - " print(f\" Std Scores (ppo/std_scores): {stats['ppo/std_scores']}\")\n", - " print(f\" Value Prediction (ppo/val/vpred): {stats['ppo/val/vpred']}\")\n", - " print(f\" Value Prediction Error (ppo/val/error): {stats['ppo/val/error']}\")\n", - " print(f\" Value Prediction Variance (ppo/val/var): {stats['ppo/val/var']}\")\n", - " print(f\" Value Prediction Mean (ppo/val/mean): {stats['ppo/val/mean']}\")\n", - " print(f\" Explained Variance (ppo/val/var_explained): {stats['ppo/val/var_explained']}\\n\")\n", - " \n", - " print(\"Token Lengths:\")\n", - " print(f\" Queries Length Mean (tokens/queries_len_mean): {stats['tokens/queries_len_mean']}\")\n", - " print(f\" Responses Length Mean (tokens/responses_len_mean): {stats['tokens/responses_len_mean']}\\n\")\n", - " \n", - " print(\"Time Statistics:\")\n", - " print(f\" Total Time (time/ppo/total): {stats['time/ppo/total']} seconds\\n\")\n", - "\n", - "# Example usage with the provided stats and the flag" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Initializing the PPO configuration, model, and tokenizer\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The `PPOConfig` class is used to specify the model and learning rate for the PPO training. In this case, the model is `\"lvwerra/gpt2-imdb\"` and the learning rate is set to `1.41e-5`.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "config = PPOConfig(\n", - " model_name=\"lvwerra/gpt2-imdb\",\n", - " learning_rate=1.41e-5)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Please ignore above warning as the trl version you installed supports this module.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "`config.model_name` refers to the specific model identifier used in the configuration for loading the pretrained model. It specifies which model to load from the Hugging Face model repository. In this case, `config.model_name` is set to `\"lvwerra/gpt2-imdb\"`, indicating that the GPT-2 model fine-tuned on the IMDB dataset (by user lvwerra) should be used. This identifier is essential for loading the correct model architecture and weights during the fine-tuning or inference process.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "config.model_name" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The `sent_kwargs` dictionary contains parameters for the sentiment analysis pipeline, specifying that all scores should be returned, the function to apply is `\"none\"`, and the batch size is `2`.\n", - "python\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "sent_kwargs = {\"top_k\":None, \"function_to_apply\": \"none\", \"batch_size\": 2}" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The `AutoModelForCausalLMWithValueHead` class is used to load the pretrained GPT-2 model with a value head for PPO training. The model is loaded from the specified model name in the configuration.\n", - "\n", - "The `AutoTokenizer` class is used to load the tokenizer corresponding to the pretrained model. The tokenizer's padding token is set to the end-of-sequence (EOS) token.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "model_1 = AutoModelForCausalLMWithValueHead.from_pretrained(config.model_name)\n", - "\n", - "tokenizer = AutoTokenizer.from_pretrained(config.model_name)\n", - "tokenizer.pad_token = tokenizer.eos_token" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Please ignore above warning as the trl version you installed handles it automatically.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# first model\n", - "model = AutoModelForCausalLMWithValueHead.from_pretrained(config.model_name)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "During PPO training, update the model. In addition, the reference model is used to stabilize the model using the Kullback-Leibler (KL) divergence between the current policy and the reference policy.The KL divergence acts as a regularization term.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "ref_model = AutoModelForCausalLMWithValueHead.from_pretrained(config.model_name)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Dataset and dataset tokenization\n", - "\n", - "**Dataset Name:** IMDB\n", - "\n", - "**Description:** The IMDB dataset is a collection of 50,000 movie reviews labeled as \"positive\" or \"negative,\" indicating the sentiment of each review. This dataset is commonly used for sentiment analysis tasks.\n", - "\n", - "**Loading the Dataset:**\n", - "The dataset is loaded using the `load_dataset` function from the `datasets` library, specifically loading the \"train\" split.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "dataset_name = \"imdb\"\n", - "ds = load_dataset(dataset_name, split = \"train\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "N = 5\n", - "for sample in range(N):\n", - " print('text',ds[sample]['text'])\n", - " print('label',ds[sample]['label'])" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - " Rename the column \"text\" to \"review\"\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "ds = ds.rename_columns({\"text\": \"review\"})\n", - "ds" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The dataset is filtered to include only reviews that are longer than 200 characters.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "ds = ds.filter(lambda x: len(x[\"review\"]) > 200, batched=False)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Using a ```LengthSampler``` to sample different text lengths during data processing introduces variability, making the model more robust and capable of handling varying input lengths in real-world scenarios. This approach prevents overfitting by exposing the model to diverse input sizes, improving generalization to new data. It also ensures efficient training by managing the length of text inputs, maintaining practicality and performance. Overall, LengthSampler enhances model adaptability and effectiveness by simulating realistic, varied training conditions. Where sample length is between ```input_min_text_length``` and ```input_max_text_length```\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "input_min_text_length, input_max_text_length = 2, 8" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Create a ```LengthSampler``` object\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "input_size = LengthSampler(input_min_text_length, input_max_text_length)\n", - "input_size" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "This code uses the input_size object, an instance of ```LengthSampler```, to sample and print a random text length between 2 and 8 for each of 10 iterations.\"\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "for i in range(10):\n", - " size=input_size()\n", - " print(f\"sample {i} has length {size}\\n\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Finally, you will need to sample tokens and obtain tokenized indexes. Let's verify this process with one sample.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "sample=ds[0]\n", - "sample" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Next, tokenize the ```review``` text into input IDs, truncate the tokenized sequence to the desired length, and assign it to ```input_ids```\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "sample[\"input_ids\"] = tokenizer.encode(sample[\"review\"])[: input_size()]\n", - "sample[\"input_ids\"]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Decode the truncated input IDs back into text and assign it to 'query', this is a will need the raw text for the reward fuction.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "sample[\"query\"] = tokenizer.decode(sample[\"input_ids\"])\n", - "sample[\"query\"] " - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "In this function, combine the process of tokenizing the 'review' text, truncating it to the desired length, and decoding it back to text. This allows you to apply it to the dataset.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "def tokenize(sample):\n", - " sample[\"input_ids\"] = tokenizer.encode(sample[\"review\"])[: input_size()]\n", - " sample[\"query\"] = tokenizer.decode(sample[\"input_ids\"])\n", - " return sample" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "You can apply ```tokenize``` function to the dataset\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "ds = ds.map(tokenize, batched=False)\n", - "ds.set_format(type=\"torch\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - ">Note: you can safely ignore the above warning.\n", - "You can see the sample before and after:\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "ds[0]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "You can now iterate over the dataset, printing the first 5 samples with their 'review' and the added 'input_ids', and 'query' :\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "for i, sample in enumerate(ds):\n", - " if i >= 5:\n", - " break\n", - " print(f\"Sample {i+1}:\")\n", - " print(f\"Review: {sample['review']}\")\n", - " print(f\"Input IDs: {sample['input_ids']}\")\n", - " print(f\"Query: {sample['query']}\")\n", - " print(\"-\" * 50)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The ```build_dataset``` function incorporates the necessary steps to build a dataset object for use as an input to ```PPOTrainer```. You will then reinstantiate the dataset object.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "del(ds)\n", - "dataset_name=\"imdb\"\n", - "ds = load_dataset(dataset_name, split=\"train\")\n", - "ds = ds.rename_columns({\"text\": \"review\"})" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "def build_dataset(config, dataset_name=\"imdb\", input_min_text_length=2, input_max_text_length=8,tokenizer=tokenizer):\n", - " \"\"\"\n", - " Build dataset for training. This builds the dataset from `load_dataset`, one should\n", - " customize this function to train the model on its own dataset.\n", - "\n", - " Args:\n", - " dataset_name (`str`):\n", - " The name of the dataset to be loaded.\n", - "\n", - " Returns:\n", - " dataloader (`torch.utils.data.DataLoader`):\n", - " The dataloader for the dataset.\n", - " \"\"\"\n", - " \n", - " tokenizer = AutoTokenizer.from_pretrained(config.model_name)\n", - " tokenizer.pad_token = tokenizer.eos_token\n", - " # load imdb with datasets\n", - " ds = load_dataset(dataset_name, split=\"train\")\n", - " ds = ds.rename_columns({\"text\": \"review\"})\n", - " ds = ds.filter(lambda x: len(x[\"review\"]) > 200, batched=False)\n", - "\n", - " input_size = LengthSampler(input_min_text_length, input_max_text_length)\n", - "\n", - " def tokenize(sample):\n", - " sample[\"input_ids\"] = tokenizer.encode(sample[\"review\"])[: input_size()]\n", - " sample[\"query\"] = tokenizer.decode(sample[\"input_ids\"])\n", - " return sample\n", - "\n", - " ds = ds.map(tokenize, batched=False)\n", - " ds.set_format(type=\"torch\")\n", - " return ds" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Create the dataset object \n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "dataset = build_dataset(config)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "You can see each sample has ```input_ids``` and ```query```\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "dataset[0]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Collator function \n", - "The collator function is crucial for preparing data batches in a format suitable for the PPOTrainer. It ensures that each feature from the data samples is grouped together,\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "def collator(data):\n", - " return dict((key, [d[key] for d in data]) for key in data[0])" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The collator function is best understood with an example. You can input two samples each with 'input_ids', 'query', and 'review'.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "data = [\n", - " {'input_ids': [1, 2, 3, 4], 'query': \"sample text\", 'review': \"This is a sample review.\"},\n", - " {'input_ids': [5, 6, 7, 8], 'query': \"another sample\", 'review': \"Another sample review.\"}\n", - "]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Apply the collator function to the above data\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "batch = collator(data)\n", - "batch" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now, 'input_ids', 'query', and 'review' each have their corresponding samples.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Initialize PPOTrainer \n", - "\n", - "Proximal Policy Optimization (PPO) is a reinforcement learning algorithm that is particularly well-suited for training generative models, including those used for chatbots. It helps address specific challenges in training these models, such as maintaining coherent and contextually appropriate dialogues.\n", - "\n", - "Proximal Policy Optimization (PPO) improves policy gradient methods for chatbots by using a clipped objective function, which ensures gradual and stable policy updates. This helps maintain consistent dialogue quality. Traditional policy gradient methods can lead to high variance and instability, resulting in inconsistent chatbot behavior. PPO's trust region balances exploring new responses and exploiting known good ones, making it more reliable for training chatbots. \n", - "\n", - "The PPO Trainer collects dialogue samples, optimizes the chatbot's policy based on these samples, and manages the neural network models. This ensures stable and efficient training, leading to high-quality, coherent, and contextually appropriate chatbot responses. \n", - "\n", - "Lets Initialize PPOTrainer with specified configuration and components\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "```config``` : Configuration settings for PPO training, such as learning rate and model name\n", - "\n", - "```model``` : The primary model to be fine-tuned using PPO\n", - "\n", - "```tokenizer```:Tokenizer corresponding to the model, used for processing input text\n", - "\n", - "```dataset```: Dataset to be used for training, providing the input data for the model\n", - "\n", - "```data_collator```: Data collator to handle batching and formatting of the input data\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "ppo_trainer = PPOTrainer(config, model, ref_model, tokenizer, dataset=dataset, data_collator=collator)\n", - "print(\"ppo_trainer object \",ppo_trainer)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Please ignore above warnings as the trl version you installed supports this module.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Determine the appropriate device (CPU or GPU) for training with the PPO Trainer.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "device = ppo_trainer.accelerator.device\n", - "if ppo_trainer.accelerator.num_processes == 1:\n", - " device = 0 if torch.cuda.is_available() else \"cpu\" \n", - "print(device)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Reward function\n", - "\n", - "In reinforcement learning with PPO (Proximal Policy Optimization), a reward function is used to provide feedback on the quality of the actions taken by the policy. For a generative model like a chatbot, the reward function can evaluate the quality of the generated responses. Here’s how the sentiment analysis pipeline can be used as a reward function:\n", - "\n", - "In reinforcement learning with PPO, the sentiment analysis pipeline serves as a reward function to evaluate a chatbot's responses. By analyzing the sentiment of each response and assigning a reward based on the sentiment score, the PPO Trainer can optimize the chatbot’s policy to generate more positively received and engaging responses. This approach leverages sentiment analysis to provide meaningful feedback, guiding the chatbot towards improved performance in dialogue generation. Although not a typical reward model, it allows you to train the chatbot in a simple and effective way.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "First, let's initialize a sentiment analysis pipeline using a pretrained model fine-tuned on IMDB reviews.\n", - "The model predicts the sentiment of text inputs, providing scores for positive and negative sentiments.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "sentiment_pipe = pipeline(\"sentiment-analysis\", model=\"lvwerra/distilbert-imdb\", device=device)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "You'll get the sentiment value as negative here.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "text = \"this movie was really bad!!\"\n", - "sentiment_pipe(text, **sent_kwargs)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The `score` key represents the model's confidence in its prediction. Higher score values indicate greater confidence in the sentiment classification, such as \"POSITIVE\" or \"NEGATIVE\". Thus, the value for `POSITIVE` class can be used to determine the reward values. For example, a high score for \"POSITIVE\" means the model is confident, which can increase rewards. Conversely, if the model isn’t confident that a review is positive, it results in a negative reward, lowering the total reward. This means negative sentiment reviews decrease the overall reward, while positive ones increase it.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "text = \"this movie was really good!!\"\n", - "sentiment_pipe(text, **sent_kwargs)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Generating responses using PPO \n", - "\n", - "### Tokenizing and preparing the input batch\n", - "This section of code demonstrates how to generate responses using the PPO (Proximal Policy Optimization) Trainer. The process involves tokenizing the input, preparing the batch for training, generating responses, and decoding the generated tokens into readable text.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The code first retrieves a batch of data from the PPO Trainer's dataloader and selects the first two entries for processing.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "batch = next(iter(ppo_trainer.dataloader))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The batch contains ```label```, ```input_ids```, and ```query```\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "batch.keys()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now let's create a new batch containing only the first two samples from the original batch \n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Let's take the first two sample in the batch\n", - "batch = {key: batch[key][0:2] for key in batch}\n", - "batch" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Initialize a list of ```response_tensors``` to store the responses for scoring\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "response_tensors = []" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The below code extracts the `input_ids` from the `batch` and assigns them to `query_tensors`. These tensors represent the tokenized input sequences that will be used in the subsequent steps. They are called \"query tensors\" because they represent the initial input queries that will be processed by the model to generate responses.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "query_tensors = batch[\"input_ids\"]\n", - "query_tensors" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The below code defines a lambda function `get_text` that takes a list of responses (`response`) and decodes each tensor in the list using the tokenizer, converting the tensor back to readable text. The `squeeze()` method is used to remove any dimensions of size 1 from the tensor.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "get_text = lambda response:''.join([tokenizer.decode(r.squeeze()) for r in response])" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "You can see the original input queries in their text form.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "get_text(query_tensors)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "\n", - "The dictionary `generation_kwargs` sets the parameters for generating a sequence from the LLM (Language Model). The parameters include:\n", - "- `\"min_length\": -1` - No minimum length for the generated text.\n", - "- `\"top_k\": 0.0` - No filtering of the top-k most probable tokens.\n", - "- `\"top_p\": 1.0` - No nucleus sampling, using the entire distribution.\n", - "- `\"do_sample\": True` - Enables sampling, allowing for varied responses.\n", - "- `\"pad_token_id\": 50256` - ID of the padding token, ensuring uniform length across sequences.\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "generation_kwargs = {\n", - " \"min_length\": -1,\n", - " \"top_k\": 0.0,\n", - " \"top_p\": 1.0,\n", - " \"do_sample\": True,\n", - " \"pad_token_id\": 50256,\n", - "}\n", - "generation_kwargs" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The `output_length_sampler` is initialized with `LengthSampler(output_min_length, output_max_length)`. This object is used to sample output lengths for the generated sequences, ensuring they fall within the specified minimum and maximum length range. By varying the lengths, you can produce more diverse and natural outputs from the language model, preventing the generation of overly short or excessively long sequences and enhancing the overall quality of the responses.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "output_min_length = 4\n", - "output_max_length = 16\n", - "output_length_sampler = LengthSampler(output_min_length, output_max_length)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The code calls the `output_length_sampler` to determine a length for the generated sequences. The sampled length is then stored in the variable `gen_len`.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "gen_len = output_length_sampler()\n", - "gen_len " - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Next, set the `max_new_tokens` parameter in the `generation_kwargs` dictionary to the value of `gen_len`, which was sampled from `output_length_sampler`. This ensures that the maximum number of new tokens generated by the language model is within the desired length range, promoting more controlled and appropriately lengthened responses.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "generation_kwargs[\"max_new_tokens\"] = gen_len\n", - "generation_kwargs" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now, let's process one sample using PPO. Start by extracting the first query tensor.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "query=query_tensors[0]\n", - "query" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Lets generate a response for the extracted query using the PPO trainer with the specified generation parameters (generation_kwargs). The generated response tensor is stored in ```response```.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "response = ppo_trainer.generate(query, **generation_kwargs)\n", - "response " - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - ">Note: You can safely ignore the above warning\n", - "\n", - "You can print the decoded text of the query and response tensors using the get_text function, converting the generated response back into a human-readable format. This demonstrates how the model has appended some text to the original query.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "print(\"query:\",get_text(query))\n", - "print(\"response:\", get_text(response))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Finally, append the tokens of the ```response_tensors``` list. The ```squeeze()``` method removes any single-dimensional entries from the shape of the tensor, and the slicing``` [-gen_len:]``` ensures only the newly generated tokens are included, ignoring any preceding tokens.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "response_tensors.append(response.squeeze()[-gen_len:])\n", - "print(\"newly generated tokens form response:\", get_text(response_tensors[-gen_len:]))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Repeat the process for the second sample. This section generates a response for a given query, decodes the relevant part, and appends it to the `response_tensors` list.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "query=query_tensors[1]\n", - "gen_len = output_length_sampler()\n", - "generation_kwargs[\"max_new_tokens\"] = gen_len\n", - "response = ppo_trainer.generate(query, **generation_kwargs)\n", - "tokenizer.decode(response.squeeze()[-gen_len:], skip_special_tokens=True)\n", - "print(\"query:\",get_text(query))\n", - "print(\"response ouput :\", get_text(response_tensors))\n", - "response_tensors.append(response.squeeze()[-gen_len:])\n", - "print(\"newly generated tokens form response:\", get_text(response_tensors[-gen_len:]))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Convert each tensor in `response_tensors` into human-readable text and store it in the `batch` dictionary under the key `response`.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "batch[\"response\"] = [tokenizer.decode(r.squeeze()) for r in response_tensors]\n", - "batch[\"response\"]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The batch now contains both `response` and `query` keys.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "batch" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Scoring function \n", - "\n", - "Next, prepare the text data for sentiment analysis, which can be a part of a reward function in a PPO setup where the sentiment analysis of interactions helps determine the reward.\n", - "\n", - "Now, extract the `query` and `response` tensors and add them to the batch.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "texts = [q + r for q, r in zip(batch[\"query\"], batch[\"response\"])]\n", - "texts" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The sentiment scores (`pipe_outputs`) can be used as feedback to update the policy\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "pipe_outputs = sentiment_pipe(texts, **sent_kwargs)\n", - "pipe_outputs" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "These scores can be used to evaluate the quality or relevance of the generated responses, indicating the model's confidence in the likelihood of the responses being positive. The scores for the generated responses are extracted from the `pipe_outputs` list. Each element in `pipe_outputs` contains a list of scores corresponding to the model's output.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "This line iterates over the `pipe_outputs` list, extracts the score from each output, converts it into a tensor, and stores it in the `rewards` list. The scores represent the model's confidence in the likelihood of the responses being positive sentences.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "positive_scores = [\n", - " item[\"score\"]\n", - " for output in pipe_outputs\n", - " for item in output\n", - " if item[\"label\"] == \"POSITIVE\"\n", - "]\n", - "rewards = [torch.tensor(score) for score in positive_scores]\n", - "rewards" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Proximal policy optimization \n", - "\n", - "The training loop is responsible for performing a single update step of the PPO algorithm. The inputs to this process are the query, response, and score tensors.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "print(\"query:\", get_text(query_tensors))\n", - "print(\"\\n\")\n", - "print(\"response:\", get_text(response_tensors))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "To meet the PPO trainer's minimum batch size requirement of 128, you can pad the response tensors with additional sample.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "batch_size=128\n", - "pad_token_id = tokenizer.pad_token_id\n", - "\n", - "query_tensors = pad_list_to_batch_size(query_tensors, batch_size, pad_token_id)\n", - "\n", - "response_tensors = pad_list_to_batch_size(response_tensors, batch_size, pad_token_id)\n", - "rewards=rewards+[torch.tensor(0) for _ in range(batch_size-len(rewards))]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now, call the PPO `step` method that updates the model using the PPO algorithm with `query_tensors`, `response_tensors`, and `rewards`.\n", - "\n", - "- It uses these inputs to calculate the policy and value function losses.\n", - "- It computes the gradients and updates the policy network parameters to improve the policy.\n", - "- It ensures that the policy update stays within a certain range to avoid large policy shifts, which is a core aspect of PPO.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "*Note: The following code is commented out to prevent the kernel from crashing due to the absence of a GPU in the current environment. To execute this code, please download the notebook and run it in an environment equipped with a GPU. Simply uncomment the code before running it.*\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# stats = ppo_trainer.step(query_tensors, response_tensors, rewards)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The `stats` variable is a dictionary containing various statistics from the PPO training step. You can print out its keys using the function `print_ppo_stats`. These keys can be organized into two main categories:\n", - "\n", - "- **Minimizing the language model loss**: `related_to_objective=True`\n", - " - This includes statistics related to optimizing the model parameters, such as policy loss and value loss.\n", - "\n", - "- **Calculating the reward**:\n", - " - This involves metrics more relevant to reinforcement learning, such as advantage estimates and reward calculations.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# stats.keys()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# print_ppo_stats(stats, related_to_objective = True)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# print_ppo_stats(stats)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "all_stats = []" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The `sentiment`should be set to NEGATIVE for bad responses and POSITIVE for good responses score .\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "sentiment = \"POSITIVE\"" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "\n", - "This code snippet represents a training loop for the PPO (Proximal Policy Optimization) algorithm using sentiment analysis. The loop iterates over batches of data from the `ppo_trainer` dataloader and performs the following steps:\n", - "\n", - "1. **Extract query tensors**:\n", - " - The input IDs (query tensors) are extracted from the batch.\n", - "\n", - "2. **Generate responses**:\n", - " - For each query tensor, a response is generated using the `ppo_trainer.generate` method with the specified `generation_kwargs`.\n", - " - The responses are then decoded and added to the batch under the `response` key.\n", - "\n", - "3. **Compute sentiment scores**:\n", - " - Text data is prepared by concatenating queries and responses.\n", - " - Sentiment analysis is performed on the combined texts to compute the sentiment scores.\n", - " - The scores are converted into tensors and stored in the `rewards` list.\n", - "\n", - "4. **Run PPO step**:\n", - " - The `ppo_trainer.step` method is called to update the model using the PPO algorithm with the `query_tensors`, `response_tensors`, and `rewards`.\n", - " - This step calculates the policy and value function losses, computes gradients and updates the policy network parameters.\n", - " - The policy update ensures it stays within a certain range to avoid large policy shifts.\n", - "\n", - "5. **Logging statistics**:\n", - " - The statistics from the PPO training step are logged and stored in the `all_stats` list.\n", - " \n", - "**Note:** Training the model on a CPU will be very time-consuming. You have pretrained the model using a GPU and saved it for your convenience. You can skip the training part and proceed to the next block of code and load the saved model. You can uncomment the below block of code to train the model yourself.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# for epoch, batch in tqdm(enumerate(ppo_trainer.dataloader)):\n", - "# query_tensors = batch[\"input_ids\"]\n", - "# print(f\"epoch {epoch}\")\n", - "\n", - "# #### Get response from gpt2\n", - "# response_tensors = []\n", - "# for query in query_tensors:\n", - "# gen_len = output_length_sampler()\n", - "# generation_kwargs[\"max_new_tokens\"] = gen_len\n", - "# response = ppo_trainer.generate(query, **generation_kwargs)\n", - "# response_tensors.append(response.squeeze()[-gen_len:])\n", - "# batch[\"response\"] = [tokenizer.decode(r.squeeze()) for r in response_tensors]\n", - "\n", - "# #### Compute sentiment score\n", - "# texts = [q + r for q, r in zip(batch[\"query\"], batch[\"response\"])]\n", - "# pipe_outputs = sentiment_pipe(texts, **sent_kwargs)\n", - "# positive_scores = [\n", - "# item[\"score\"]\n", - "# for output in pipe_outputs\n", - "# for item in output\n", - "# if item[\"label\"] == sentiment\n", - "# ]\n", - "# rewards = [torch.tensor(score) for score in positive_scores]\n", - "\n", - "# #### Run PPO step\n", - "# stats = ppo_trainer.step(query_tensors, response_tensors, rewards)\n", - "# ppo_trainer.log_stats(stats, batch, rewards)\n", - " \n", - "# all_stats.append(stats)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# # Save the model\n", - "\n", - "# model_dir = \"ppo-good\"\n", - "# os.makedirs(model_dir, exist_ok=True)\n", - "\n", - "# # Save model configuration and weights\n", - "# model_1.save_pretrained(model_dir)\n", - "# tokenizer.save_pretrained(model_dir)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "!wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/gSWo8GeztngSmzHpqX_RaQ/ppo-good.pkl\n", - "!wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/we8t5N-45dVq3VhxGwYRAg/ppo-good-tar.gz" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# File name\n", - "file_name = \"ppo-good-tar.gz\"\n", - "\n", - "# Open the tar.gz file\n", - "with tarfile.open(file_name, \"r:gz\") as tar:\n", - " # Extract all the contents into the current directory\n", - " tar.extractall()\n", - "\n", - "print(\"Extraction completed.\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "model_dir = \"ppov3new1\"\n", - "model_1 = AutoModelForCausalLMWithValueHead.from_pretrained(model_dir)\n", - "tokenizer = AutoTokenizer.from_pretrained(model_dir)\n", - "\n", - "# Load training stats\n", - "file_name = \"ppo-good.pkl\"\n", - "with open(file_name, 'rb') as f:\n", - " all_stats = pickle.load(f)\n", - "\n", - "model_1.to(device)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - ">Note: You can safely ignore the above warning.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Plotting PPO training loss and mean \n", - "\n", - "1. **Extracting values**:\n", - " - `loss_values`: Total loss values from `all_stats`.\n", - " - `reward_values`: Mean reward values from `all_stats`.\n", - "\n", - "2. **Plotting the loss**:\n", - " - Line plot of total loss over epochs.\n", - "\n", - "3. **Plotting the rewards**:\n", - " - Line plot of mean reward over epochs.\n", - "\n", - "4. **Displaying the plots**:\n", - " - Arrange and show the plots using `plt.tight_layout()` and `plt.show()`.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "loss_values = [stat['ppo/loss/total'] for stat in all_stats]\n", - "reward_values = [stat['ppo/mean_scores'] for stat in all_stats]\n", - "\n", - "# Plotting the loss\n", - "plt.figure(figsize=(12, 6))\n", - "plt.subplot(2, 1, 1)\n", - "plt.plot(loss_values, label='Total Loss', color='b')\n", - "plt.xlabel('Epoch')\n", - "plt.ylabel('Loss')\n", - "plt.title('PPO Training Loss over Time')\n", - "plt.legend()\n", - "plt.grid(True)\n", - "\n", - "# Plotting the rewards\n", - "plt.subplot(2, 1, 2)\n", - "plt.plot(reward_values, label='Mean Reward', color='g')\n", - "plt.xlabel('Epoch')\n", - "plt.ylabel('Reward')\n", - "plt.title('PPO Mean Reward over Time')\n", - "plt.legend()\n", - "plt.grid(True)\n", - "\n", - "# Show the plots\n", - "plt.tight_layout()\n", - "plt.show() " - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Generating and analyzing text with PPO and reference models\n", - "**Device Setup**:\n", - " - Determine if CUDA is available and set the device accordingly.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n", - "# Set the pipeline device\n", - "pipeline_device = 0 if device.type == \"cuda\" else -1" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "**Text generation function**:\n", - " - `generate_some_text(input_text, my_model)`: Tokenizes input text, generates a response, and decodes it.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "gen_kwargs = {\"min_length\": -1, \"max_new_tokens\":20, \"top_k\": 0.0, \"top_p\": 1.0, \"do_sample\": True, \"pad_token_id\": tokenizer.eos_token_id}\n", - "def generate_some_text(input_text,my_model):\n", - "# Tokenize the input text\n", - " input_ids = tokenizer(input_text, return_tensors='pt').input_ids.to(device)\n", - " generated_ids = my_model.generate(input_ids,**gen_kwargs )\n", - "\n", - " # Decode the generated text\n", - " generated_text_ = tokenizer.decode(generated_ids[0], skip_special_tokens=True)\n", - "\n", - " return generated_text_" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "**Generate text with PPO model**:\n", - " - Generate text using the PPO-trained model.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "input_text = \"Once upon a time in a land far\"\n", - "\n", - "generated_text=generate_some_text(input_text,model_1)\n", - "generated_text" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "**Sentiment Analysis**:\n", - " - Analyze the sentiment of the generated text using `sentiment_pipe`.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "pipe_outputs = sentiment_pipe(generated_text, **sent_kwargs)\n", - "pipe_outputs" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "**Generate text with reference model**:\n", - " - Generate text using the reference model.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "generated_text = generate_some_text(input_text,ref_model)\n", - "generated_text" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Comparing PPO and reference models on \n", - "\n", - "1. **Generation Parameters**:\n", - " - Define `gen_kwargs` for text generation.\n", - "\n", - "2. **Prepare Batch**:\n", - " - Sample a batch of size `bs` from the dataset and extract query tensors.\n", - "\n", - "3. **Generate Responses**:\n", - " - For each query tensor, generate responses using both the reference model and the PPO model.\n", - "\n", - "4. **Decode Responses**:\n", - " - Decode the generated response tensors into human-readable text.\n", - "\n", - "5. **Compute Sentiment Scores**:\n", - " - Prepare texts by concatenating queries and responses.\n", - " - Compute sentiment scores for the responses before and after training using `sentiment_pipe`.\n", - "\n", - "6. **Store Results**:\n", - " - Store queries, responses, and sentiment scores in `game_data`.\n", - " - Convert `game_data` into a DataFrame and return it.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "def compare_models_on_dataset(model, ref_model, dataset, tokenizer, sentiment_pipe, sent_kwargs, device, output_length_sampler):\n", - " gen_kwargs = {\n", - " \"min_length\": -1, \n", - " \"top_k\": 0.0, \n", - " \"top_p\": 1.0, \n", - " \"do_sample\": True, \n", - " \"pad_token_id\": tokenizer.eos_token_id\n", - " }\n", - " \n", - " bs = 16\n", - " game_data = dict()\n", - " dataset.set_format(\"pandas\")\n", - " df_batch = dataset[:].sample(bs)\n", - " game_data[\"query\"] = df_batch[\"query\"].tolist()\n", - " query_tensors = df_batch[\"input_ids\"].tolist()\n", - "\n", - " response_tensors_ref, response_tensors = [], []\n", - "\n", - " # Get maximum position embeddings for both models\n", - " max_position_embeddings_ref = ref_model.config.max_position_embeddings\n", - " max_position_embeddings_model = model.config.max_position_embeddings\n", - "\n", - " for i in range(bs):\n", - " gen_len = output_length_sampler()\n", - "\n", - " # Convert query tensors to input IDs\n", - " input_ids = torch.tensor(query_tensors[i]).unsqueeze(dim=0).to(device)\n", - "\n", - " # ********** Process for ref_model **********\n", - " total_length_ref = input_ids.shape[-1] + gen_len\n", - " if total_length_ref > max_position_embeddings_ref:\n", - " # Truncate input_ids to fit within the max length\n", - " max_input_length_ref = max_position_embeddings_ref - gen_len\n", - " input_ids_ref = input_ids[:, -max_input_length_ref:]\n", - " total_length_ref = input_ids_ref.shape[-1] + gen_len\n", - " else:\n", - " input_ids_ref = input_ids\n", - " \n", - " output = ref_model.generate(\n", - " torch.tensor(query_tensors[i]).unsqueeze(dim=0).to(device), \n", - " max_new_tokens=gen_len, \n", - " **gen_kwargs\n", - " ).squeeze()[-gen_len:]\n", - " response_tensors_ref.append(output)\n", - "\n", - " # ********** Process for model **********\n", - " total_length_model = input_ids.shape[-1] + gen_len\n", - " if total_length_model > max_position_embeddings_model:\n", - " max_input_length_model = max_position_embeddings_model - gen_len\n", - " input_ids_model = input_ids[:, -max_input_length_model:]\n", - " total_length_model = input_ids_model.shape[-1] + gen_len\n", - " else:\n", - " input_ids_model = input_ids\n", - " \n", - " output = model.generate(\n", - " torch.tensor(query_tensors[i]).unsqueeze(dim=0).to(device), \n", - " max_new_tokens=gen_len, \n", - " **gen_kwargs\n", - " ).squeeze()[-gen_len:]\n", - " response_tensors.append(output)\n", - "\n", - " game_data[\"response (before)\"] = [tokenizer.decode(response_tensors_ref[i]) for i in range(bs)]\n", - " game_data[\"response (after)\"] = [tokenizer.decode(response_tensors[i]) for i in range(bs)]\n", - "\n", - " texts_before = [q + r for q, r in zip(game_data[\"query\"], game_data[\"response (before)\"])]\n", - " game_data[\"rewards (before)\"] = [output[1][\"score\"] for output in sentiment_pipe(texts_before, **sent_kwargs)]\n", - "\n", - " texts_after = [q + r for q, r in zip(game_data[\"query\"], game_data[\"response (after)\"])]\n", - " game_data[\"rewards (after)\"] = [output[1][\"score\"] for output in sentiment_pipe(texts_after, **sent_kwargs)]\n", - "\n", - " df_results = pd.DataFrame(game_data)\n", - " return df_results" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "df_results = compare_models_on_dataset(model_1, ref_model, dataset, tokenizer, sentiment_pipe, sent_kwargs, device, output_length_sampler)\n", - "df_results" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Running the PPO model with negative sentiment\n", - "\n", - "This code runs the PPO training loop with the sentiment set to NEGATIVE, which evaluates the model's performance when negative sentiment scores are prioritized. The training loop generates responses, computes sentiment scores, updates the model, and logs the statistics for each epoch.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "sentiment = \"NEGATIVE\"" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# for epoch, batch in tqdm(enumerate(ppo_trainer.dataloader)):\n", - "# query_tensors = batch[\"input_ids\"]\n", - "# print(f\"epoch {epoch}\")\n", - "\n", - "# #### Get response from gpt2\n", - "# response_tensors = []\n", - "# for query in query_tensors:\n", - "# gen_len = output_length_sampler()\n", - "# generation_kwargs[\"max_new_tokens\"] = gen_len\n", - "# response = ppo_trainer.generate(query, **generation_kwargs)\n", - "# response_tensors.append(response.squeeze()[-gen_len:])\n", - "# batch[\"response\"] = [tokenizer.decode(r.squeeze()) for r in response_tensors]\n", - "\n", - "# #### Compute sentiment score\n", - "# texts = [q + r for q, r in zip(batch[\"query\"], batch[\"response\"])]\n", - "# pipe_outputs = sentiment_pipe(texts, **sent_kwargs)\n", - "# negative_scores = [\n", - "# item[\"score\"]\n", - "# for output in pipe_outputs\n", - "# for item in output\n", - "# if item[\"label\"] == sentiment\n", - "# ]\n", - "# rewards = [torch.tensor(score) for score in negative_scores]\n", - "\n", - "# #### Run PPO step\n", - "# stats = ppo_trainer.step(query_tensors, response_tensors, rewards)\n", - "# ppo_trainer.log_stats(stats, batch, rewards)\n", - " \n", - "# all_stats.append(stats)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# # Save the model\n", - "\n", - "# model_dir = \"ppo-bad\"\n", - "# os.makedirs(model_dir, exist_ok=True)\n", - "\n", - "# # Save model configuration and weights\n", - "# model_0.save_pretrained(model_dir)\n", - "# tokenizer.save_pretrained(model_dir)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "**Note:** Training the model on a CPU will be very time-consuming. The model has been pretrained using a GPU and saved for your convenience. You can skip the training part, proceed to the next block of code, and load the saved model. You can also uncomment the above training block of code to train the model yourself.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "!wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/8zCp__SHRSgGVlf5yP50Ag/ppo-bad-tar.gz\n", - "!wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/jMW99Z9mvxesgYR-H6y6Yw/ppo-bad.pkl" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import tarfile\n", - "# File name\n", - "file_name = \"ppo-bad-tar.gz\"\n", - "\n", - "# Open the tar.gz file\n", - "with tarfile.open(file_name, \"r:gz\") as tar:\n", - " # Extract all the contents into the current directory\n", - " tar.extractall()\n", - "\n", - "print(\"Extraction completed.\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import tarfile\n", - "model_dir = \"ppov3new_bad1\"\n", - "model_0 = AutoModelForCausalLMWithValueHead.from_pretrained(model_dir)\n", - "tokenizer = AutoTokenizer.from_pretrained(model_dir)\n", - "\n", - "# Load training stats\n", - "file_name = \"ppo-bad.pkl\"\n", - "with open(file_name, 'rb') as f:\n", - " all_stats = pickle.load(f)\n", - "\n", - "model_0.to(device)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - ">Note: You can safely ignore the above warning.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Comparing models with negative sentiment\n", - "\n", - "The below code compares the performance of the PPO-trained model (`model_0`) and the reference model on the given dataset. The `compare_models_on_dataset` function generates responses from both models, computes their sentiment scores, and returns the results in a DataFrame (`df_results`). This comparison helps evaluate how well the PPO-trained model performs in generating positive responses when the `sentiment` is set to NEGATIVE.\n", - "\n", - "Since the dataset is fairly large, we will only use a subset of the dataset for testing.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "df_results = compare_models_on_dataset(model_0, ref_model, dataset, tokenizer, sentiment_pipe, sent_kwargs, device, output_length_sampler)\n", - "df_results" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Exercise: Comparing PPO models\n", - "\n", - "In this exercise, you will compare the performance of two PPO-trained models (`model_0` and `model_1`) using the `compare_models_on_dataset` function and note the difference in performance of both.\n", - "\n", - "**Compare Models**:\n", - " - Use the `compare_models_on_dataset` function to compare `model_0` and `model_1`.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Write your code here" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "
\n", - " Click here for Solution\n", - "\n", - "```python\n", - "df_results = compare_models_on_dataset(model_0, model_1, dataset, tokenizer, sentiment_pipe, sent_kwargs, device, output_length_sampler)\n", - "df_results\n", - "```\n", - "\n", - "
\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Authors\n", - "\n", - "[Joseph Santarcangelo](https://author.skills.network/instructors/joseph_santarcangelo) has a Ph.D. in Electrical Engineering, his research focused on using machine learning, signal processing, and computer vision to determine how videos impact human cognition. Joseph has been working for IBM since he completed his PhD.\n", - "\n", - "[Ashutosh Sagar](https://www.linkedin.com/in/ashutoshsagar/) is completing his MS in CS from Dalhousie University. He has previous experience working with Natural Language Processing and as a Data Scientist.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Contributors\n", - "\n", - "[Hailey Quach](https://author.skills.network/instructors/hailey_quach) is a Data Scientist at IBM. She's completing her Bsc, Honors in Computer Science at Concordia University, Montreal.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## References\n", - "\n", - "\n", - "[TEXT CLASSIFICATION WITH THE TORCHTEXT LIBRARY](https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html)\n", - "\n", - "[Parameter-Efficient Transfer Learning for NLP](https://arxiv.org/pdf/1902.00751.pdf)\n", - "\n", - "[Simple, Scalable Adaptation for Neural Machine Translation](https://arxiv.org/pdf/1909.08478)\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "```{## Change Log}\n", - "```\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "```{|Date (YYYY-MM-DD)|Version|Changed By|Change Description||-|-|-|-||2024-06-27|0.1|Kang Wang|Create the lab|}\n", - "```\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "© Copyright IBM Corporation. All rights reserved.\n" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.8.19" - }, - "prev_pub_hash": "febcb0ff319ab930e46d30d4d1bc1329ad2f8aa613c9a5ec96659fa44d3daf95" - }, - "nbformat": 4, - "nbformat_minor": 4 -} diff --git a/notebooks/LLM_Specialization/LoRA_with_Pytorch.ipynb b/notebooks/LLM_Specialization/LoRA_with_Pytorch.ipynb new file mode 100644 index 0000000..7aa576d --- /dev/null +++ b/notebooks/LLM_Specialization/LoRA_with_Pytorch.ipynb @@ -0,0 +1,3421 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "f157571e-3e1b-477c-a61f-c1e6c062ca6f", + "metadata": {}, + "source": [ + "

\n", + " \n", + " \"Skills\n", + " \n", + "

\n", + "\n", + "# LoRA with PyTorch\n", + "\n", + "Estimated time needed: **60** minutes\n", + "\n", + "As an AI engineer, you are tasked with fine-tuning a model for sentiment analysis on the IMDB dataset, starting with a model that is pretrained on the AG News dataset. By leveraging Low-Rank Adaptation (LoRA), the model is initially trained on AG News, benefiting from its extensive labeled data and broad categorization capabilities. This robust foundation enhances the model’s language understanding.\n", + "\n", + "Subsequently, LoRA is used to fine-tune the model on the IMDB dataset, adapting its knowledge to the nuances of movie reviews for sentiment analysis. This two-phase process — starting with AG News and refining with IMDB data — ensures that the model is both well-rounded and specialized, achieving superior performance in sentiment analysis tasks.\n", + "\n", + "**Note: If you are already familiar with training a model on the IMDB dataset, you can run the cells and then jump to the Low-Rank Adaptation (LoRA) section**\n", + "\n", + "\n", + "![Documents Overload](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-GPXX0Y15EN/docs.png)\n", + "\n", + "```Efficiency in parameter updates:``` LoRA introduces only a small fraction of additional parameters compared to the total number of parameters in a large model. This makes the training process faster and less resource-intensive because fewer parameters need to be updated during backpropagation.\n", + "\n", + "```Preservation of pretrained knowledge:``` By keeping the majority of the model's weights fixed and only adjusting them through low-rank matrices, LoRA helps preserve the rich representations that the model learned during pretraining. This is particularly beneficial for tasks that do not require drastic deviations from the behavior learned during pretraining.\n", + "\n", + "```Customization to specific tasks:``` Despite the minimal updates, the changes introduced by LoRA are significant enough to adapt the model to specific tasks. This lets you fine-tune large models on specialized tasks without the need for extensive retraining.\n", + "\n", + "```Reduction in overfitting:``` Because only a limited number of parameters are adapted, the risk of overfitting is lower compared to full model fine-tuning, especially when adapting to smaller datasets.\n", + "\n", + "```Scalability:``` LoRA scales well with model size. As models become larger, the relative increase in the number of parameters introduced by LoRA becomes even smaller, making it a particularly attractive option for adapting very large models.\n", + "\n", + "```Compatibility and simplicity:``` The method can be easily applied to different types of neural networks, especially those based on the transformer architecture. It doesn't require major changes to the existing architecture, which simplifies integration into existing pipelines.\n" + ] + }, + { + "cell_type": "markdown", + "id": "29f9693c-5bf2-4e3b-9974-ed709cf69670", + "metadata": {}, + "source": [ + "# __Table of Contents__\n", + "\n", + "
    \n", + "
  1. Objectives
  2. \n", + "
  3. \n", + " Setup\n", + "
      \n", + "
    1. Install required libraries
    2. \n", + "
    3. Import required libraries
    4. \n", + "
    5. Defining helper functions
    6. \n", + "
    \n", + "
  4. \n", + "
  5. \n", + " Data pipeline\n", + "
      \n", + "
    1. Tokenizer
    2. \n", + "
    \n", + "
  6. \n", + "
  7. \n", + " IMDB dataset
  8. \n", + "
      \n", + "
    1. Dataset composition
    2. \n", + "
    3. Applications
    4. \n", + "
    5. Challenges
    6. \n", + "
    7. Train and validate
    8. \n", + "
    9. Data loader
    10. \n", + "
    11. Neural network
    12. \n", + "
    \n", + " \n", + "
  9. \n", + " Train the model on the full dataset
  10. \n", + "
      \n", + "
    1. Train the model
    2. \n", + "
    \n", + " \n", + "
  11. \n", + " Low-Rank Adaptation (LoRA)
  12. \n", + "
      \n", + "
    1. LoRA
    2. \n", + "
    3. Rank
    4. \n", + "
    5. Understanding LoRA in PyTorch
    6. \n", + "
    7. Applying LoRA
    8. \n", + "
    9. Loading the model
    10. \n", + "
    \n", + " \n", + "
  13. \n", + " Exercise: Apply LoRA to a different network\n", + "
  14. \n", + "
\n", + "\n", + "---" + ] + }, + { + "cell_type": "markdown", + "id": "80470861-da2d-41b1-a0ee-1048d18a0857", + "metadata": {}, + "source": [ + "## Objectives\n", + "\n", + "After completing this lab you are able to:\n", + "\n", + "- Construct and train a neural network from the ground up\n", + "- Fine-tune a neural network in the conventional manner by unfreezing specific layers\n", + "- Use LoRA to fine-tune a neural network\n", + "- Comprehend the functions of LoRA and the reasons behind its effectiveness\n", + "- Save and load models that employ LoRA efficiently\n", + "\n", + "---" + ] + }, + { + "cell_type": "markdown", + "id": "8b717c75-4b99-4895-b615-6ba25b33fa47", + "metadata": {}, + "source": [ + "## Setup\n", + "\n", + "### Install required libraries\n", + "\n", + "The following required libraries are __not__ preinstalled in the Skills Network Labs environment. __You must run the following cell__ to install them. Note that it can take between __5 and 10 minutes__ to install the required libraries:\n", + "\n", + "```bash\n", + "!pip install numpy==1.24.1\n", + "!pip install -U portalocker==2.8.2\n", + "!pip install torch==2.0.1\n", + "!pip install torchtext==0.15.2\n", + "!pip install torchdata==0.6.1\n", + "!pip install -U plotly==5.22.0\n", + "!pip install pandas==2.2.2\n", + "!pip install matplotlib==3.9.0\n", + "!pip install scikit-learn==1.5.0\n", + "```" + ] + }, + { + "cell_type": "markdown", + "id": "3068bedb-32c8-45f4-92c7-bbd77fc5754e", + "metadata": {}, + "source": [ + "### Import required libraries\n", + "\n", + "The following imports the required libraries:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "7759af5f-1618-4edb-955f-e424b5357054", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "True\n", + "Tesla P40\n", + "Import Successfully!\n" + ] + } + ], + "source": [ + "# Standard library imports\n", + "import io\n", + "import math\n", + "import os\n", + "import pickle\n", + "import tarfile\n", + "import tempfile\n", + "from itertools import accumulate\n", + "from urllib.request import urlopen\n", + "\n", + "# Third-party imports\n", + "import matplotlib.pyplot as plt\n", + "import numpy as np\n", + "import pandas as pd\n", + "import plotly.graph_objs as go\n", + "import torch\n", + "import torch.nn as nn\n", + "import torch.nn.functional as F\n", + "import torchtext # ; torchtext.disable_torchtext_deprecation_warning()\n", + "from IPython.display import Markdown as md\n", + "from sklearn.manifold import TSNE\n", + "from torch.utils.data import DataLoader\n", + "from torch.utils.data.dataset import Dataset, random_split\n", + "from torchtext.data.functional import to_map_style_dataset\n", + "from torchtext.datasets import AG_NEWS\n", + "from torchtext.vocab import GloVe, Vectors, build_vocab_from_iterator\n", + "from tqdm import tqdm\n", + "\n", + "# Suppress warnings\n", + "def warn(*args, **kwargs):\n", + " pass\n", + "\n", + "import warnings\n", + "warnings.warn = warn\n", + "warnings.filterwarnings('ignore')\n", + "\n", + "# CUDA-related checks\n", + "os.environ[\"CUDA_VISIBLE_DEVICES\"] = \"0\"\n", + "print(torch.cuda.is_available())\n", + "print(torch.cuda.get_device_name())\n", + "\n", + "print(\"Import Successfully!\")\n" + ] + }, + { + "cell_type": "markdown", + "id": "65025fce-7345-4630-b541-71b5703b71f8", + "metadata": {}, + "source": [ + "### Defining helper functions\n", + "\n", + "The following are some helper functions to help with plotting, saving, and loading files. These functions are not the main focus of this lab, you do not have to dwell on these too long. However, do run the cells in this section to define these helper functions:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "aa667f44-288d-463f-9c22-2401e3277c88", + "metadata": {}, + "outputs": [], + "source": [ + "def plot(COST,ACC):\n", + " fig, ax1 = plt.subplots()\n", + " color = 'tab:red'\n", + " ax1.plot(COST, color=color)\n", + " ax1.set_xlabel('epoch', color=color)\n", + " ax1.set_ylabel('total loss', color=color)\n", + " ax1.tick_params(axis='y', color=color)\n", + "\n", + " ax2 = ax1.twinx()\n", + " color = 'tab:blue'\n", + " ax2.set_ylabel('accuracy', color=color) # You already handled the x-label with ax1\n", + " ax2.plot(ACC, color=color)\n", + " ax2.tick_params(axis='y', color=color)\n", + " fig.tight_layout() # otherwise the right y-label is slightly clipped\n", + "\n", + " plt.show()\n", + "\n", + "\n", + "\n", + "def save_list_to_file(lst, filename):\n", + " \"\"\"\n", + " Save a list to a file using pickle serialization.\n", + "\n", + " Parameters:\n", + " lst (list): The list to be saved.\n", + " filename (str): The name of the file to save the list to.\n", + "\n", + " Returns:\n", + " None\n", + " \"\"\"\n", + " with open(filename, 'wb') as file:\n", + " pickle.dump(lst, file)\n", + "\n", + "\n", + "def load_list_from_file(filename):\n", + " \"\"\"\n", + " Load a list from a file using pickle deserialization.\n", + "\n", + " Parameters:\n", + " filename (str): The name of the file to load the list from.\n", + "\n", + " Returns:\n", + " list: The loaded list.\n", + " \"\"\"\n", + " with open(filename, 'rb') as file:\n", + " loaded_list = pickle.load(file)\n", + " return loaded_list" + ] + }, + { + "cell_type": "markdown", + "id": "2239f5cc-059f-44ee-9bd0-f26f8880ad69", + "metadata": {}, + "source": [ + "---" + ] + }, + { + "cell_type": "markdown", + "id": "ad8914c5-a862-4573-adbe-94efa43a10d2", + "metadata": {}, + "source": [ + "## Data pipeline\n" + ] + }, + { + "cell_type": "markdown", + "id": "dddf985a-c832-457f-bfd5-7ac48e1187ad", + "metadata": {}, + "source": [ + "### Tokenizer\n", + "\n", + "A tokenizer takes as input a document and breaks it up into individual tokens. Now, you might wonder, what's a token?\n", + "This example might help you understand it better.\n", + "\n", + "Imagine a token as a puzzle piece of a jigsaw puzzle. Each word, number, or small part of a word is a token. When you tokenize a document, you break it into these puzzle pieces so that a computer can understand and work with the text more easily, just like how you solve a puzzle by arranging its pieces.\n", + "\n", + "First, import the **```get_tokenizer```** function from **```torchtext.data.utils```**.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "8a5f684b-e84f-4768-a152-bc51ff205ca9", + "metadata": {}, + "outputs": [], + "source": [ + "from torchtext.data.utils import get_tokenizer" + ] + }, + { + "cell_type": "markdown", + "id": "9c2b2a53-ed87-4110-b631-8a38f5be539b", + "metadata": {}, + "source": [ + "Next, we'll create the tokenizer. We'll set it to the \"basic_english\" tokenizer that is provided by `torchtext`. The \"basic_english\" tokenizer is designed to handle basic English text and splits the text into individual tokens based on spaces and punctuation marks.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "60bc3e17-5998-4cc0-9a86-10001ce78f4f", + "metadata": {}, + "outputs": [], + "source": [ + "tokenizer = get_tokenizer(\"basic_english\")" + ] + }, + { + "cell_type": "markdown", + "id": "98b0055f-f35c-4d2e-a565-e52723a493b7", + "metadata": {}, + "source": [ + "Our dataset is going to be an iterable. Therefore, We'll use a generator function **```yield_tokens```** to apply **```tokenizer```**. The purpose of the generator function **```yield_tokens```** is to yield tokenized texts one at a time. Instead of processing the entire dataset and returning all of the tokenized texts in one go, the generator function processes and yields each tokenized text individually as it is requested. The tokenization process is performed lazily, which means the next tokenized text is generated only when needed, saving memory and computational resources.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "e25b8369-e601-4cbd-8345-b2d9bde1247d", + "metadata": {}, + "outputs": [], + "source": [ + "def yield_tokens(data_iter):\n", + " for _,text in data_iter:\n", + " yield tokenizer(text)" + ] + }, + { + "cell_type": "markdown", + "id": "2bddee85-4f64-4911-aa9c-e6145681f0d6", + "metadata": {}, + "source": [ + "The following loads a pretrained word embedding model called GloVe into a variable called `glove_embedding`:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "25aabfc6-0ed7-4749-a2f3-3c4693b5d96b", + "metadata": {}, + "outputs": [], + "source": [ + "# Note that GloVe embeddings are typically downloaded using:\n", + "#glove_embedding = GloVe(name=\"6B\", dim=100)\n", + "# However, the GloVe server is frequently down. The code below offers a workaround\n", + "\n", + "\n", + "class GloVe_override(Vectors):\n", + " url = {\n", + " \"6B\": \"https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/tQdezXocAJMBMPfUJx_iUg/glove-6B.zip\",\n", + " }\n", + "\n", + " def __init__(self, name=\"6B\", dim=100, **kwargs) -> None:\n", + " url = self.url[name]\n", + " name = \"glove.{}.{}d.txt\".format(name, str(dim))\n", + " #name = \"glove.{}/glove.{}.{}d.txt\".format(name, name, str(dim))\n", + " super(GloVe_override, self).__init__(name, url=url, **kwargs)\n", + "\n", + "class GloVe_override2(Vectors):\n", + " url = {\n", + " \"6B\": \"https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/tQdezXocAJMBMPfUJx_iUg/glove-6B.zip\",\n", + " }\n", + "\n", + " def __init__(self, name=\"6B\", dim=100, **kwargs) -> None:\n", + " url = self.url[name]\n", + " #name = \"glove.{}.{}d.txt\".format(name, str(dim))\n", + " name = \"glove.{}/glove.{}.{}d.txt\".format(name, name, str(dim))\n", + " super(GloVe_override2, self).__init__(name, url=url, **kwargs)\n", + "\n", + "try:\n", + " glove_embedding = GloVe_override(name=\"6B\", dim=100)\n", + "except:\n", + " try:\n", + " glove_embedding = GloVe_override2(name=\"6B\", dim=100)\n", + " except:\n", + " glove_embedding = GloVe(name=\"6B\", dim=100)" + ] + }, + { + "cell_type": "markdown", + "id": "eaf6bdf2-2052-4b45-8e90-0ee9ffefe01f", + "metadata": {}, + "source": [ + "The following builds a vocabulary object from a pretrained GloVe word embedding model and sets the default index to the token:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "98e18d96-f913-4af2-8b9d-40787ff47a67", + "metadata": {}, + "outputs": [], + "source": [ + "from torchtext.vocab import vocab\n", + "\n", + "vocab = vocab(glove_embedding .stoi, 0,specials=('', ''))\n", + "vocab.set_default_index(vocab[\"\"])" + ] + }, + { + "cell_type": "markdown", + "id": "793e527e-a76b-4acf-bb08-0d027bf8d391", + "metadata": {}, + "source": [ + "The following prepares the text processing pipeline with the tokenizer and vocabulary. The text pipeline will be used to process the raw data strings from the dataset iterators.\n", + "\n", + "The function **```text_pipeline```** first tokenizes the input text, following which **```vocab```** is applied to get the token indices.\n", + "\n", + "The function **```label_pipeline```** simply converts labels into their integer values.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "c2494315-f59a-4904-84b8-d00beee9e7c2", + "metadata": {}, + "outputs": [], + "source": [ + "def text_pipeline(x):\n", + " return vocab(tokenizer(x))\n", + "\n", + "def label_pipeline(x):\n", + " return int(x) " + ] + }, + { + "cell_type": "markdown", + "id": "f115c03b-5da3-4142-9ee4-33accb9ab0ba", + "metadata": {}, + "source": [ + "---" + ] + }, + { + "cell_type": "markdown", + "id": "5766072a-eed9-4dea-8aac-857c091613d9", + "metadata": {}, + "source": [ + "## IMDB dataset \n", + "\n", + "The following loads the IMDB dataset into a temporary folder. This might take some time, so please be patient.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "7a571d7b-69a1-4bed-b66c-06f98357e59c", + "metadata": {}, + "outputs": [], + "source": [ + "urlopened = urlopen('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/35t-FeC-2uN1ozOwPs7wFg.gz')\n", + "tar = tarfile.open(fileobj=io.BytesIO(urlopened.read()))\n", + "tempdir = tempfile.TemporaryDirectory()\n", + "tar.extractall(tempdir.name)\n", + "tar.close()" + ] + }, + { + "cell_type": "markdown", + "id": "c93501ce-1ea0-4503-a2b4-aa5335121706", + "metadata": {}, + "source": [ + "The **IMDB dataset** contains movie reviews from the Internet Movie Database (IMDB) and is commonly used for binary sentiment classification tasks. It's a popular dataset for training and testing models in natural language processing (NLP), particularly in the context of sentiment analysis.\n", + "\n", + "### Dataset composition\n", + "\n", + "- **Reviews**: The dataset consists of 50,000 movie reviews, divided evenly into 25,000 training and 25,000 testing samples.\n", + "- **Sentiment labels**: Each review is labeled as either positive or negative, indicating the sentiment expressed in the review. The dataset is balanced, with an equal number of positive and negative reviews in both the training and testing sets.\n", + "- **Text content**: Reviews are presented as plain text and have been preprocessed to some extent. For example, HTML tags are removed, but the text retains its original punctuation and capitalization.\n", + "- **Usage**: The dataset is commonly used to train models for binary sentiment classification, where the goal is to predict whether a given review is positive or negative based on its text content.\n", + "\n", + "### Applications\n", + "\n", + "- **Sentiment analysis**: The primary application of the IMDB dataset is in sentiment analysis, where it serves as a benchmark for various text classification algorithms.\n", + "- **Natural language processing (NLP)**: The dataset is widely used in NLP research and applications, providing a basis for testing the effectiveness of different models and approaches in understanding human language.\n", + "\n", + "### Challenges\n", + "\n", + "The dataset is small, so it's hard to train a model from scratch.\n", + "\n", + "The following class is defined to traverse the IMDB dataset. The need to define this class arises from the fact that the IMDB dataset is split across a large number of files:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "f0b1ac02-3b52-4e4d-ac61-a33e3e429c0f", + "metadata": {}, + "outputs": [], + "source": [ + "class IMDBDataset(Dataset):\n", + " def __init__(self, root_dir, train=True):\n", + " \"\"\"\n", + " root_dir: The base directory of the IMDB dataset.\n", + " train: A boolean flag indicating whether to use training or test data.\n", + " \"\"\"\n", + " self.root_dir = os.path.join(root_dir, \"train\" if train else \"test\")\n", + " self.neg_files = [os.path.join(self.root_dir, \"neg\", f) for f in os.listdir(os.path.join(self.root_dir, \"neg\")) if f.endswith('.txt')]\n", + " self.pos_files = [os.path.join(self.root_dir, \"pos\", f) for f in os.listdir(os.path.join(self.root_dir, \"pos\")) if f.endswith('.txt')]\n", + " self.files = self.neg_files + self.pos_files\n", + " self.labels = [0] * len(self.neg_files) + [1] * len(self.pos_files)\n", + " self.pos_inx=len(self.pos_files)\n", + "\n", + " def __len__(self):\n", + " return len(self.files)\n", + "\n", + " def __getitem__(self, idx):\n", + " file_path = self.files[idx]\n", + " label = self.labels[idx]\n", + " with open(file_path, 'r', encoding='utf-8') as file:\n", + " content = file.read()\n", + " \n", + " return label, content" + ] + }, + { + "cell_type": "markdown", + "id": "b43c4995-df1e-4dcf-b40d-5c4a42e1899d", + "metadata": {}, + "source": [ + "The following uses the `IMDBDataset` class defined above to create iterators for the train and test datasets:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "075f1957-46e9-4ad0-a273-600bfe5b8336", + "metadata": {}, + "outputs": [], + "source": [ + "root_dir = tempdir.name + '/' + 'imdb_dataset'\n", + "train_iter = IMDBDataset(root_dir=root_dir, train=True) # For training data\n", + "test_iter = IMDBDataset(root_dir=root_dir, train=False) # For test dataart=train_iter.pos_inx\n" + ] + }, + { + "cell_type": "markdown", + "id": "6242e5c7-b354-4888-a996-e81a6618a6f0", + "metadata": {}, + "source": [ + "The following prints 20 samples from the training set:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "fb923043-bde7-4289-bcb0-6c5c6ba051d2", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "(1, 'Let me start off by saying that after watching this episode for the first time on DVD at 10 o\\'clock P.M. one night, I could not fall asleep until about 3:00 A.M.

This brief review may contain spoilers.

I\\'m a long-time fan of The Sopranos and I can safely say this is the best episode I\\'ve seen. I\\'m not saying everyone should feel this way, but I do. This episode is identical to the weekend I spent with my family, watching over my own father, comatose in the ICU before he passed.

The episode begins with Tony in an alternate reality: he is a salesman who\\'s identity has been mistaken for that of a man named Kevin Finnerty.

By the time ten minutes had gone by, I knew either Tony was dreaming, or I was watching some other show. It wasn\\'t like the normal Sopranos and I loved it.

Option 1 is confirmed when Anthony (or \"Kevin\") looks into the sky at a \"helicopter spotlight\" and we see prodding through it, a doctor with a flashlight. We see this only for a moment and the sequence plays out until we go back to real life in a situation similar to the one I just stated.

Tony has come out of the coma for only a moment. His boys take A.J. home and Carmella, overcome by stress, breaks down in the hallway: a signature moment in the episode.

For the remainder of the episode, we cut in between the real world: the family dealing with the potential negative outcome of this coma, and Tony\\'s alternate reality, which parallels what\\'s going on both in his mind and in the real world around him.

Then comes the stellar point in the episode: after A.J. finishes telling his mother he\\'s flunked school, she walks in to see Meadow sitting at Anthony\\'s side.

She approaches Tony, and utters the best line of the episode: \"Anthony, can you hear us?\" In Tony\\'s world, he enters a dark hotel room and turns on a light. He takes off his shoes and goes to the phone. He tries to dial, but he cannot--as if he were trying to say something back to Carmella, but couldn\\'t physically bring himself to do so. Not yet.

He sits down and looks out his window. A shimmering light that has reoccurred throughout the episode now seems to call to him from the other side of the city.

\"When It\\'s Cold I\\'d Like To Die\" by Moby marries perfectly with these last images and helps in creating an emotional roller-coaster of an episode.

10 out of 10.

P.S.: Watch the next episode. You find out what the light is. It\\'s wonderful.')\n", + "(1, \"This is, in my opinion, a very good film, especially for Michael Jackson lovers. It contains a message on drugs, stunning special effects, and an awesome music video.

The main film is centered around the song and music video 'Smooth Criminal.' Unlike the four-minute music video, it is normal speed and, in my opinion, much easier to watch.

The plot is rather weird, however. Michael Jackson plays a magical 'gangster' that, when he sees a shooting star, he transforms into a piece of machinery. Throughout the film, he transforms into a race car, a giant robot, and a space ship.

The robot scene in particular is a bit drawn out and strange. I found it a little out-of-whack compared to the rest of the film.

A child is kidnapped, Michael tries to save her, is tortured and beaten, and suddenly turns into a giant robot that blows up all the bad guys. A little weird? Yeah.

But besides the bizarre robot scene, it's a very good movie, and any Michael Jackson fan will enjoy both the Smooth Criminal music video and the movie.\")\n", + "(1, \"This move actually had me jumping out of my chair in anticipation of what the actors were going to do! The acting was the best, Farrah should have gotten a Oscar for this she was fabulous. James Russo was so good I hated him he was the villain and played it wonderful. There aren't many movies that have riveted me as this one. The cast was great Alfie looking shocked with those big eyes Farrah looking like a victim and you re-lived her horror as she went through it. Farrah made you feel like you were there and feeling the same anger she felt you wanted her to hurt him, yet you also knew it was the wrong thing to do. The movie had you on a roller coaster ride and you went up and down with each scene.\")\n", + "(1, \"Surprisingly not terrible and well animated for one of Disney's straight to video throw away sequels. Like the previous sequel (The Lion King 2) I was glad that Disney brought back most of the original voice actors which makes a big difference and they kept a good level of traditional animation. The plot wanders around for a while but we are distracted by an unending string of jokes ranging from hilarious to dull. To break up the detached plot and jokes they gave us some silly musical sequences, which much like the jokes, range from entertaining to a quick trip to the fridge. For the most part the MST3K-like moments are bland and full of untapped potential and really don't add a whole lot to the movie other than to act as a vehicle for an hour-long flashback. The new characters are at least likable, and the old characters are out doing their thing so I can't fault them there. Overall this movie in not bad and it makes for a nice frivolous filler between the more serious Lion King titles.\")\n", + "(1, \"At one end of the Eighties Warren Beatty created and starred in the literate epic Reds about the founding of the Soviet Union as seen through the eyes of iconoclast radical John Reed. It was a profound film both entertaining and with a message presented by an all star cast. At the end of the decade Warren Beatty created another kind of epic in Dick Tracy that makes no pretense to being anything other than entertainment with a whole bunch of the best actors around just having a great old time hamming it up under tons of makeup.

That both Reds and Dick Tracy could come from the same individual speaks volumes about the range this man has as a player. In this film Beatty managed to get all the famous cartoon characters from the strip and put them in one original screenplay.

The city's top mobster Big Boy Caprice is making a move to really eliminate competition. The film opens with him rubbing out Lips Manlis's henchmen in a Valentine Massacre style shooting and then Lips himself being fitted for a cement overcoat. But Caprice's moves are making him a target for Tracy.

In the meantime a third mysterious and faceless individual is looking to topple Caprice himself. Will our hero sort out this thicket of crime?

The spirit of fun this film has is truly infectious. When people like Al Pacino, Dustin Hoffman, Paul Sorvino, William Forsythe, R.G. Armstrong get themselves outrageously made-up to look like the cartoon creations of strip author Chester Gould and then indulge in an exercise of carving the biggest slice of ham, you've got to love this film.

Al Pacino got a nomination for Best Supporting Actor, but any of these guys could have, it's only that Pacino as Big Boy Caprice gets the most screen time. Only Beatty plays it completely straight, the others all seem to play off of him. Dick Tracy won Oscars for Best Art&Set Design, Best Song written by Stephen Sondheim and introduced by Madonna, Sooner Or Later. The fact he was even able to get somebody like Sondheim to write a score for this film only shows Sondheim wanted to get in on the fun. As for Madonna, the Material Girl does more than hold her own with all these acting heavyweights as club torch singer Breathless Mahoney.

Before this film, Dick Tracy movies were consigned to the B pictures and worse as Saturday afternoon serials. The only thing that rivals this all star extravaganza is a radio broadcast done for Armed Forces Radio during World War II that got to vinyl. Can you believe a cast like Bing Crosby, Bob Hope, Frank Sinatra, Dinah Shore, Jimmy Durante, Judy Garland, Frank Morgan, and the Andrews Sisters? Try and find a recording of that gem.

Until then Warren Beatty's classic comic strip for the big screen will do nicely.\")\n", + "(1, \"This is one of the best Bollywood movies i have seen up until now. Family and friends feel the same way about it. This movie is really romantic and dramatic at the same time. In my opinion we need more films or movies like this to keep the south Asian culture alive. Shahid Kapoor and Amrita Rao acted extremely well in this, also their couple attracts a lot of people to the movies. This is a must see movie, it's a family and romantic movie. This movie is also from the makers of Hum Saath Saath Hain and Hum Apke Hain Kon. This movie is their best right now... the setting of the movie was beautiful which also is a huge attraction. This movie is must see... recommended to everyone!!\")\n", + "(1, 'I recently watched this, but when it started I had no idea what the concept was about, what the topic was.....in short - I had no idea what it was. Was it a documentary, was it a comedy routine.....Well, it was BOTH.

It started a little slow, but I think that\\'s because I had absolutely no idea what type of program I was viewing. But it quickly sucked me in. The episode I watched had Robert Wuhl discussing fact and fiction in history. Mainly how we (american\\'s) learn history that isn\\'t really true - and how we got to learn what we did. He did this in such a way as to keep the viewer completely entertained, and interested. I actually learned a few things and that is a true indicator of how effective this type of program can be.

I would love to see this picked up as a series for HBO. I believe it can be just as fun and effective with a variety of topics - especially if they are \"taught\" in the same type of manner as this episode.')\n", + "(1, \"Poor Basil Rathbone, an egotistical composer who's lost his muse. He's been faking it for some time, buying his lyrics and his music from various sources. Trouble is that two of the sources (Bing Crosby music) and (Mary Martin words) happen to meet and fall in love. And then they discover what they've been doing. Complications ensue, but all is righted at the end.

Crosby and Martin sing terrifically. Mary had signed a Paramount contract and also at the same time doubled as a regular on Crosby's Kraft Music Hall Radio Show. For reasons I don't understand, movie audiences didn't take to her, so she went back to Broadway and did One Touch of Venus in 1944 and stayed there.

Basil Rathbone in one of the few times he played comedy does it very well. His ego is constantly being deflated by sidekick Oscar Levant and again I'm surprised they didn't do more films together.

As in most of Crosby's Paramount vehicles, no big production numbers, but I agree with the previous reviewer about the title tune being done as an impromptu jam session in a pawn shop. Good job by all.

A surprisingly original plot and great entertainment.\")\n", + "(1, '\"Emma\" was a product of what might be called by the First Great Jane Austen Cycle of the mid-nineties, and it was recently shown on British television, doubtless because of the interest in the author created by the Second Great Jane Austen Cycle which started with \"Pride and Prejudice\" two years ago. We currently have in the cinemas the Austen biopic \"Becoming Jane\", and ITV have recently produced three TV movies based on Austen novels. These include \"Northanger Abbey\", the only one of the six major novels not to have been filmed previously, so the cycle should now be complete. No doubt, however, there will be more to come in the near future. (There is, after all, her juvenile \"Love and Freindship\" (sic), the short novella \"Lady Susan\", and someone, somewhere, has doubtless supplied endings to her two unfinished fragments \"The Watsons\" and \"Sanditon\". Then there are all those Austen sequels churned out by modern writers\\x85\\x85\\x85).

The main character is Emma Woodhouse, a young lady from an aristocratic family in Regency England. (Not, as some reviewers have assumed, Victorian England- Austen died before Queen Victoria was even born). Emma is, financially, considerably better off than most Austen heroines such as Elizabeth Bennett or Fanny Price, and has no need to find herself a wealthy husband. Instead, her main preoccupation seems to be finding husbands for her friends. She persuades her friend Harriet to turn down a proposal of marriage from a young farmer, Robert Martin, believing that Harriet should be setting her sights on the ambitious clergyman Mr Elton. This scheme goes disastrously wrong, however, as Elton has no interest in Harriet, but has fallen in love with Emma herself. The speed with which Emma rejects his proposal makes one wonder just why she was so keen to match her friend with a man she regards (with good reason) as an unsuitable marriage partner for herself. This being a Jane Austen plot, Emma turns out to be less of a committed spinster than she seems, and she too finds herself falling in love, leading to further complications.

Emma always insists that she will not marry without affection, and when she does find a partner, the handsome Mr Knightley, we feel that this will indeed be an affectionate marriage. It does not, however, seem likely to be a very passionate one (unlike, say, that of Elizabeth Bennett and Mr Darcy). Knightley, who is sixteen years older than Emma (she is 21, he 37), and related to her by marriage, is more like a father-figure than a lover. Much more of a father-figure, in fact, than her actual father, a querulous and selfish old hypochondriac who seems more like her grandfather. When Emma is rude to her unbearably garrulous and tedious friend Miss Bates, it is Knightley who chides her for her lack of manners. (His surname is probably meant to indicate his gentlemanly nature- nineteenth-century gentlemen liked to think of themselves as the modern equivalent of mediaeval knights with their elaborate codes of chivalry). Both Gwyneth Paltrow and Jeremy Northam play their parts very well, but this is not really one of the great screen romances.

Of the other characters, I liked Juliet Stephenson\\'s vulgar Mrs Elton and Toni Collette\\'s Harriet. I know that in the novel Harriet was a naïve young teenager, whereas here she is more like the character Collette played in \"Muriel\\'s Wedding\"- a gauche, slightly overweight twentysomething, fretting about her chances of finding a man. Nevertheless, I felt that this characterisation worked well in the context of the film and did not detract from Austen\\'s themes.

\"Emma\" is one of Austen\\'s more light-hearted works, without the darker overtones of \"Mansfield Park\" or even \"Pride and Prejudice\", and this is reflected on screen. We see a world of beauty and grace, full of stately homes and elegant costumes and fine manners. Apart from the ruffianly gypsies, who make a very brief appearance, the only \"poor\" people we see are Mrs Bates and her daughter, and, as they live in the sort of picturesque rose-strewn thatched cottage which today would change hands for over £500,000, we can be sure that their poverty is relative, not absolute. In Emma\\'s world, poverty is defined as not having your own stately home. This is, of course, not a comprehensive picture of early nineteenth-century life, but nobody has ever claimed Austen as the Regency equivalent of a kitchen-sink realist. Sophisticated romantic comedy, combined with a keen eye for analysing human character, was more in her line.

I would not rate this film quite as highly as the 1994 \"Sense and Sensibility\" or the recent \"Pride and Prejudice\"- it tends to drag a bit in the middle, although it has a strong beginning and strong ending- but it is, in the main, a highly enjoyable Austen adaptation. 7/10')\n", + "(1, \"A very strong movie. Bruce is good and Brad also.

As I think there are two cities missed in the receptionist list from the list Bruce remembered.

That means the woman was a real insurance and she did her job.

Well, Novikov property seems to me work in this movie. However, I do believe in Back to the future theory of worlds' multiplicity.

So Bruce could save the world, but not his world.

In the theory of parallel worlds the man can meet himself.

And I do believe there is no problem in that. Here I disagree with Dr. Brown from Back...

But the story pf 12 Monkeys has its own beauty. Inspite of all these theories of one world or many or continuum one can believe that he is really insane and the doctor - his girlfriend was just lost.

A sequence of events which may lead her to believe that he is from the future. The bullet - well it might be some mistake, some falsification.

Well I like this movie - has to buy a DVD.

Best.\")\n", + "(0, \"I have no clue as to what this was shot on but you can definitely tell that they had no budget. Bad acting, horrible cinematography, and lame plot and some decent special effects do not make a good movie. The WWF style cinemtography will make you cry...where's the tripod?! The filmakers aimed high, but sorely missed their mark.\")\n", + "(0, \"This film can be judged from three viewpoints: as history, as a profile of Amin, as a fictional thriller.

It fails as history, it mentions in passing the coup that threw out Obote, the expulsion of the Asians, and has the Entebbe hi-jack as background, but not in any chronologically consistent time frame.

As a profile of Amin it may have been interesting, because Forest Whitaker is incredibly good, and if this was a better film, he would get an Oscar. (He got it - which proves the Oscar voters don't watch the films they vote on.) It ignores relevant historical episodes in the novel, which observed Amin and the history of Uganda from the point of view of the doctor. It tells instead the fictitious story of the Scots doctor and his impossible love life from the point of view of Amin. But the story told is the one incident that Amin was probably innocent of.

As a fictional thriller, there is no plot to hold it together. The beginning is taut - it takes cinematic liberties with the novel, but sets up the story. The character of the doctor is well-defined, but becomes lost in the second half of the film which suffers as a result.

Why the doctor decides to stay in Kampala is badly explained - seduced by power? Why he befriends no-one is strange. The character of the friend in the novel has been lost because the Scotsman has the affair instead of the black doctor - a ludicrous entanglement which does not seem even faintly believable, but allows the writers of the film to show the ferocity of Amin close at hand. The Man called Horse bit at the end is risible.

Finally in 1971, Uganda drove on the left, not right, the number plates were three letters and two or three numbers - and where are the Equator tusks?!

In short - if you've never heard of Amin, you may want to spend two hours watching this film to appreciate Forest Whitaker's acting, but the last hour will bore you to confusion. If you know Uganda or have read the book - don't see the film - it will only depress you. And if you want to know why the doctor was so foolhardy - he wasn't.\")\n", + "(0, \"This show comes up with interesting locations as fast as the travel channel. It is billed as reality but in actuality it is pure prime time soap opera. It's tries to use exotic locales as a facade to bring people into a phony contest & then proceeds to hook viewers on the contestants soap opera style.

It also borrows from an early CBS game show pioneer- Beat The Clock- by inventing situations for its contestants to try & overcome. Then it rewards the winner money. If they can spice it up with a little interaction between the characters, even better. While the game format is in slow motion versus Beat The Clock- the real accomplishment of this series is to escape reality.

This show has elements of several types of successful past programs. Reality television, hardly, but if your hooked on the contestants, locale or contest, this is your cup of tea. If your not, this entire series is as I say, drivel dripping with gravy. It is another show hiding behind the reality label which is the trend it started in 2000.

It is slick & well produced, so it might last a while yet. After all, so do re-runs of Gilligan's Island, Green Acres, The Beverly Hillbillies & The Brady Bunch. This just doesn't employ professional actors. The intelligence level is about the same.\")\n", + "(0, \"Besides all of the technical mistakes ....

How about a female flight attendant who's able to kill, all by herself, 4 out of the 7 terrorists (including ex marines), 2 of whom without even using a gun. Then, she lands the plane perfectly. We're not talking about Sigourney Weaver or Linda Hamilton; we're talking about a regular, frightened, yet very well composed flight attendant. :D How about the leader in charge of the assault/rescue squad, having a full-proof (according to the logic of the script) plan of sleep-gassing everyone and having someone from his team fly the plane. Only he decides at the spur of the moment to change plans and instead lead an attack on the terrorists, guns blazing, not knowing where the terrorists are, or how many, and not securing a position of advantage, so that his whole team gets easily wiped out. Yeah, that's using the old noggin. Only later to decide to use the sleep gas anyway. And it turns out useless for all intensive purposes.

Bad as this movie was, though, I couldn't stop myself from watching and wondering, what next? :D I can't help but imagine all the excellent, unemployed script writers thinking to themselves, it's not fair. lol! :D\")\n", + "(0, 'We purchased this series on DVD because of all of the glowing reviews we had seen here. I gave it three stars because there can be little doubt that sometimes the acting, directing and writing are brilliant. In fact they are so brilliant we did not see the propaganda that was being transmitted so smoothly on the series. If one watches it with discernment, one will see the entire litany of the radical right wing beliefs being promulgated by the Fox (Faux) News Network. To avoid giving away any spoilers I will refrain from pointing out all of the dozens of specific instances. A brief look at the plots found here on IMDb will disclose that everything from torture to gun control to the right of a network to provide \"Infomercials\" and call them news is justified with cute plot twists and impassioned speeches given by some of the best actors in the world. We watched many shows and finally gave up in disgust when they justified torture using Attorney General Gonzales as a shining example of why all kinds of torture should be used in the name of protecting all of us. The series also manages to demean male and female gays in subtle ways by using them as plot devices depicting evil people. All in all the complete litany of the radical religious right wing.

No doubt the popularity of this program will be used by future historians as proof that America lost its way in the early part of the this century. As a student of history myself I would characterize this program as being in a league with the propaganda produced by Goebbels for Hitler and some of the propaganda produced by Hollywood for the American audience during WWII.

So if you want to use this as a teaching tool to help your students understand how subtle propaganda can be then by all means do so. Just be sure to purchase an inexpensive used copy so you can avoid enriching the ultra right wingers at Faux Network who produced this travesty.')\n", + "(0, 'It says that a girl named Susan Montford both wrote and directed this \"movie.\" No wonder she has no other credits to her name for writing or directing. She made a severe vocational error in choosing this as her career. This is one of the worst human creations of this millennium.

The fundamental thing wrong with this movie other than its ridiculous story of a woman running away from four weak thugs, is the blatant and complete lack of LOGIC.

**After she leaves the mall, she gets approached by four thugs as they surround her. Tell me, what woman would aggressively SHOVE a potential attacker while being surrounded, and insult them verbally? I don\\'t mean after an attack had already started, because then of course it\\'s completely normal for someone to fight back. But she shoved that guy and pretty much escalated it to the next level. No woman would do that unless she 1) had a weapon, 2) has the confidence of knowing that backup is very close, and so is relatively safe from harm, or 3) the attackers are so young, and weak looking that she\\'s pretty sure she can take them. None of that applied in this situation, so she was just acting like someone that\\'s asking to get raped or mugged. And by the way, when the security guard approached, as SOON as he came within viewing distance of Kim Basinger, why wouldn\\'t she immediately either run towards him for help, or scream??

**When she drives off after the security guard gets shot in the head, she drives into a deserted part of town, and crashes. She had a good three minute lead on the pursuers, instead of simply running off on foot in a diagonal direction behind houses and climbing fences and continuing, she gets out her Red Toolbox and starts messing around under her hood. I understand she was trying to fix her car, but she should\\'ve ran.

(I didn\\'t even mean this to be a chronological summary of the movie, because I loathe people who do that in their reviews, but it just so happens that every main sequence of this movie has something so blatantly stupid that I have to comment on it).

**Why would she carry a loud, Red Toolbox as she\\'s trying to sneak away in the dark? When she does get caught, one of the jokers demands for her to open the toolbox. First she resists, then eventually opens it. And takes out a wrench. This scene here is so rich in subtle overtones of the complete failure of dramatic effect I have to break it down, it\\'s one of the dumbest scenes in the entire movie. When asked to open the box, she\\'s resisting at first as if it were her plan to somehow get one of the thugs to open it themselves out of anger after she didn\\'t open it, in the same way that someone in some action movie might have some device that an enemy demands that person to touch/push/open/manipulate, and once that hero refuses to open it, the enemy grabs that device, only to have that device automatically dispense a chemical/shoot him in the face/render him unconscious, which was the hero\\'s plan all along. It feels like that\\'s what they tried to do with Kim Basinger here, as she opens the toolbox dramatically and quickly takes out a WRENCH and dispatches one of the thugs, and somehow GETS AWAY from him and the three other thugs.

**Throughout the rest of the movie, basically what you see is this suburban house wife, sneaking around the woods as she carries her Red Toolbox, taking out various tools used as weapons to KILL HER ATTACKERS.

**When she was running away, how did she end up moving BACK to where the thugs were? I think it was the scene where they had that radio playing loudly in tribute to the dead dude. She somehow crept up on them when I thought she was moving AWAY from them.

**Finally, this whole premise is so weak because the whole reason she\\'s being chased in the first place is because from the thugs\\' perspective, she was a witness to a murder they committed against the security officer earlier, and so they felt they had to kill her. How ridiculous. As one of the thugs even said, they could\\'ve just left town and returned back to whatever city they drove from, no one but her had seen them anyway, and she probably didn\\'t get the license plate. Even if these possibilities wouldn\\'t work in their favor, how is raising hell and hunting down someone to kill them improving your chances to get away with the original murder?')\n", + "(0, 'Well I just gave away 95 minutes and 47 seconds that I\\'ll never get back on this piece of trash. I heard someone online describe this movie\\'s villains as \"subhuman cannibals\", and I thought it was promising because I thought it would be like the Descent. WRONG! The Descent was a psychological thriller with dynamic characters and strong storyline. These villains are totally unrealistic and no part of their performance is enjoyable to watch. This movie isn\\'t so controversial, I\\'ve seen this level of gore in many films. This movie plain sucks. SYNOPSIS: A blonde who thinks she\\'s real hot (but she isn\\'t), her admirer, and her admirer\\'s friend (no, I don\\'t remember their names) go into the woods. Their car breaks down. They are warned to leave by a man named Mark. The blonde gets unreasonably hysterical and the next morning they can\\'t find the admirer\\'s friend. Admirer impales his foot (whoops!). Don\\'t worry, he is much more upset when his car won\\'t start than when he gets impaled by nails. After a nanosecond of coaxing, the blond leaves to find help. Events ensue that I cannot remember. During this and throughout the movie, we are shown grotesque torture scenes with no substance including one that made me gag. Blonde goes to save admirer from house of cannibals (even though all they are seen eating is intestines, which would logically be the last choice for real cannibals to eat since they contain actual food). Blonde finds admirer hurt and works very hard (unsuccessfully) to work up tears. Then you get a good laugh when the blonde is in the house and announces she can \"out think them\". Mark (the man who warned them to leave) has a remarkable change of character when he reveals the cannibals are his family. Then there is some shooting, they leave the house, the shooting continues, then a random guy shows up and says he\\'s been watching them. Before he is shot, we are shown an acid-trip inspired scene of more killing. The blonde or her admirer shoots him because he did not help them. There\\'s more killing, the admirer professes his love for the blonde. Then a mysterious hand covers the camera. What does that imply? I don\\'t know, hopefully not a sequel.')\n", + "(0, \"This was not enjoyable to watch. Frank puts all his dreams on the back burner and gets a normal (boring!) job just so his stepson can go to film school, but his stepson decides that he'll make a humiliating documentary about the man instead. A documentary filmmaker should point the camera and simply shoot, not manipulate and comment with snide captions. The bitterness and resentment of the filmmaker towards his stepfather is obvious. And sad. The goal seems to be to make Frank appear dumb and pathetic, instead he comes across as the most human of the 3 people featured.

Essentially a smear campaign all dressed up as something much smarter and edgier than it really is. It left me with an intense dislike for the filmmaker.\")\n", + "(0, 'The plot for Descent, if it actually can be called a plot, has two noteworthy events. One near the beginning - one at the end. Together these events make up maybe 5% of the total movie time. Everything (and I mean _everything_) in between is basically the director\\'s desperate effort to fill in the minutes. I like disturbing movies, I like dark movies and I don\\'t get troubled by gritty scenes - but if you expect me to sit through 60 minutes of hazy/dark (literally) scenes with NO storyline you have another thing coming. Rosario Dawson, one of my favorite actresses is completely wasted here. And no, she doesn\\'t get naked, not even in the NC-17 version, which I saw.

If you have a couple of hours to throw away and want to watch \"Descent\", take a nap instead - you\\'ll probably have more interesting dreams.')\n", + "(0, \"There's nothing new here. All the standard romantic-comedy scenes, even down to the taxi sprinting to the airport to stop the woman flying away. The only thing that saves this is the acting of Alison Eastwood & some of the minor characters (blink and you'll miss Gabrielle Anwar), who obviously had some fun.

Turn it off when the pair are in bliss, and you won't have to go through the inevitable plot pain.\")\n" + ] + } + ], + "source": [ + "start=train_iter.pos_inx\n", + "start=0\n", + "\n", + "for i in range(-10,10):\n", + " print(train_iter[start+i])" + ] + }, + { + "cell_type": "markdown", + "id": "50a6efb5-01fb-4783-9237-ec83b1f8a3b6", + "metadata": {}, + "source": [ + "The following defines the mapping of numeric labels to positive and negative reviews:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "8963e017-2696-42cd-a0ba-eb2fb001b8c8", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'positive review'" + ] + }, + "execution_count": 13, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "imdb_label = {0: \" negative review\", 1: \"positive review\"}\n", + "imdb_label[1]" + ] + }, + { + "cell_type": "markdown", + "id": "66729636-d9d6-45b6-a408-10a3636ec607", + "metadata": {}, + "source": [ + "The following checks to make sure that there are exactly 2 classes in the train dataset:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "id": "fea7d361-1299-4e1c-a8e3-ba86a482edab", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "2" + ] + }, + "execution_count": 14, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "num_class = len(set([label for (label, text) in train_iter ]))\n", + "num_class" + ] + }, + { + "cell_type": "markdown", + "id": "804dfa23-be45-47d2-b420-a60ed40fb420", + "metadata": {}, + "source": [ + "The following are some token indices:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "id": "2a502a76-b4af-4681-bb64-dcce35f0d0de", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[466, 13077]" + ] + }, + "execution_count": 15, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "vocab([\"age\",\"hello\"])" + ] + }, + { + "cell_type": "markdown", + "id": "c7ae8bd3-a8d1-4ebe-b69b-f3da3d9f05a5", + "metadata": {}, + "source": [ + "### Train and validate\n", + "\n", + "The following converts the dataset into map-style datasets and then performs a random split to create separate training and validation datasets. The training dataset will contain 95% of the samples in the original training set, while the validation dataset will contain the remaining 5%. These datasets can be used for training and evaluating a machine learning model for text classification on the IMDB dataset. The final performance of the model will be evaluated on the hold-out test set:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "id": "a9283486-de17-4b3c-ae75-96c01ce83bb6", + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "# Convert the training and testing iterators to map-style datasets.\n", + "train_dataset = to_map_style_dataset(train_iter)\n", + "test_dataset = to_map_style_dataset(test_iter)\n", + "\n", + "# Determine the number of samples to be used for training and validation (5% for validation).\n", + "num_train = int(len(train_dataset) * 0.95)\n", + "\n", + "# Randomly split the training dataset into training and validation datasets using `random_split`.\n", + "# The training dataset will contain 95% of the samples, and the validation dataset will contain the remaining 5%.\n", + "split_train_, split_valid_ = random_split(train_dataset, [num_train, len(train_dataset) - num_train])" + ] + }, + { + "cell_type": "markdown", + "id": "8785376c-ec97-4bb9-94f1-8e3ccd0e3aea", + "metadata": {}, + "source": [ + "The following code checks if a CUDA-compatible GPU is available in the system using PyTorch, a popular deep learning framework. If a GPU is available, it assigns the device variable to \"cuda\" (which stands for CUDA, the parallel computing platform and application programming interface model developed by NVIDIA). If a GPU is not available, it assigns the device variable to \"cpu\" (which means the code will run on the CPU instead).\n" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "id": "52046e3f-5087-4b63-94cc-e8a7daf96407", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "device(type='cuda')" + ] + }, + "execution_count": 17, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n", + "device" + ] + }, + { + "cell_type": "markdown", + "id": "d6ae581f-0cb5-4a81-8e84-e5244ce7eea7", + "metadata": {}, + "source": [ + "### Data loader\n", + "\n", + "In PyTorch, the **`collate_fn`** function is used in conjunction with data loaders to customize the way batches are created from individual samples. The provided code defines a `collate_batch` function in PyTorch, which is used with data loaders to customize batch creation from individual samples. It processes a batch of data, including labels and text sequences. It applies the `text_pipeline` function to preprocess the text. The processed data is then converted into PyTorch tensors and returned as a tuple containing the label tensor, text tensor, and offsets tensor representing the starting positions of each text sequence in the combined tensor. The function also ensures that the returned tensors are moved to the specified device (e.g., GPU) for efficient computation.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "id": "9428567c-19c9-4bf6-9f55-11423d70c8bf", + "metadata": {}, + "outputs": [], + "source": [ + "from torch.nn.utils.rnn import pad_sequence\n", + "\n", + "def collate_batch(batch):\n", + " label_list, text_list = [], []\n", + " for _label, _text in batch:\n", + " label_list.append(label_pipeline(_label))\n", + " text_list.append(torch.tensor(text_pipeline(_text), dtype=torch.int64))\n", + "\n", + "\n", + " label_list = torch.tensor(label_list, dtype=torch.int64)\n", + " text_list = pad_sequence(text_list, batch_first=True)\n", + "\n", + "\n", + " return label_list.to(device), text_list.to(device)" + ] + }, + { + "cell_type": "markdown", + "id": "d59cb2e6-a0a2-43cd-8afd-55484bfed045", + "metadata": {}, + "source": [ + "You convert the dataset objects to a data loader by applying the collate function.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "id": "a2020739-e68e-434e-a57a-e7db17162eab", + "metadata": {}, + "outputs": [], + "source": [ + "BATCH_SIZE = 64\n", + "\n", + "train_dataloader = DataLoader(\n", + " split_train_, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_batch\n", + ")\n", + "valid_dataloader = DataLoader(\n", + " split_valid_, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_batch\n", + ")\n", + "test_dataloader = DataLoader(\n", + " test_dataset, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_batch\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "edb6cdd8-8f81-4530-9b3f-5abd5639cf62", + "metadata": {}, + "source": [ + "Let's check the what these data loaders generate:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "id": "e08fc3af-9ffe-4271-9d2b-c9e5507d6d16", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(tensor([0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0,\n", + " 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0,\n", + " 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0], device='cuda:0'),\n", + " tensor([[ 39, 16, 31, ..., 0, 0, 0],\n", + " [ 43, 59, 1995, ..., 0, 0, 0],\n", + " [ 2, 139, 5, ..., 0, 0, 0],\n", + " ...,\n", + " [ 2, 4182, 17, ..., 0, 0, 0],\n", + " [ 38, 9, 0, ..., 0, 0, 0],\n", + " [ 194, 4510, 33, ..., 0, 0, 0]], device='cuda:0'))" + ] + }, + "execution_count": 20, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "label,seqence=next(iter(valid_dataloader ))\n", + "label,seqence" + ] + }, + { + "cell_type": "markdown", + "id": "a1369331-f011-4df4-9f59-8ef962d7f3dc", + "metadata": {}, + "source": [ + "### Neural network\n", + "\n", + "This code defines a class called `TextClassifier` that represents a simple text classifier that uses an embedding layer, a hidden linear layer with a ReLU avtivation, and an output linear layer. The constructor takes the following arguments:\n", + "\n", + "- `num_class`: The number of classes to classify.\n", + "- `freeze`: Whether to freeze the embedding layer.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "id": "6e1f8a63-28ab-45dd-b1a4-c69664fe57d0", + "metadata": {}, + "outputs": [], + "source": [ + "from torch import nn\n", + "\n", + "class TextClassifier(nn.Module):\n", + " def __init__(self, num_classes,freeze=False):\n", + " super(TextClassifier, self).__init__()\n", + " self.embedding = nn.Embedding.from_pretrained(glove_embedding.vectors.to(device),freeze=freeze)\n", + " # An example of adding additional layers: A linear layer and a ReLU activation\n", + " self.fc1 = nn.Linear(in_features=100, out_features=128)\n", + " self.relu = nn.ReLU()\n", + " # The output layer that gives the final probabilities for the classes\n", + " self.fc2 = nn.Linear(in_features=128, out_features=num_classes)\n", + "\n", + " def forward(self, x):\n", + " # Pass the input through the embedding layer\n", + " x = self.embedding(x)\n", + " # Here you can use a simple mean pooling\n", + "\n", + " x = torch.mean(x, dim=1)\n", + " # Pass the pooled embeddings through the additional layers\n", + " x = self.fc1(x)\n", + " x = self.relu(x)\n", + " return self.fc2(x)\n" + ] + }, + { + "cell_type": "markdown", + "id": "67cd83d4-5789-48e6-929d-22287abee1d8", + "metadata": {}, + "source": [ + "## Train the model on the full dataset\n", + "\n", + "The model can then be trained on labeled data from the IMDB dataset with two classes.\n", + "\n", + "First, let's create the model.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 37, + "id": "3acca5e8-1a69-4589-82c7-46c6c2081dc2", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "TextClassifier(\n", + " (embedding): Embedding(400000, 100)\n", + " (fc1): Linear(in_features=100, out_features=128, bias=True)\n", + " (relu): ReLU()\n", + " (fc2): Linear(in_features=128, out_features=2, bias=True)\n", + ")" + ] + }, + "execution_count": 37, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "model=TextClassifier(num_classes=2,freeze=True)\n", + "model.to(device)" + ] + }, + { + "cell_type": "markdown", + "id": "05531fdc-68fb-4b6e-ad6d-edeadbb81dea", + "metadata": {}, + "source": [ + "The code line `predicted_label=model(text, offsets)` is used to obtain predicted labels from a model for a given input text and its corresponding offsets.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "id": "f96b774b-4435-4864-a264-576a28fed21b", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "torch.Size([64, 2])" + ] + }, + "execution_count": 25, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "model.eval()\n", + "predicted_label=model(seqence)" + ] + }, + { + "cell_type": "markdown", + "id": "9f9410bc-bd8c-465f-9fcb-48ba5493ce60", + "metadata": {}, + "source": [ + "The following returns the shape of `predicted_label`. Because your dataset iterators are batching 64 inputs, `predicted_label` should return 64 rows:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "id": "065e2be0-d613-4287-8227-016209a8b87c", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "torch.Size([64, 2])\n" + ] + } + ], + "source": [ + "print(predicted_label.shape)" + ] + }, + { + "cell_type": "markdown", + "id": "57ddc0ea-36d3-4f6f-9cb0-0e491a1844f1", + "metadata": {}, + "source": [ + "For each input, the model outputs two logits corresponding to the two classes in the classification task. If the value of the first logit is greater than the second, the predicted class is class 0, which maps to a negative review. If the second logit is greater than the first, the predicted class is class 1, which maps to a positive review:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "id": "32e6eb56-18fa-4f03-a431-53180c744c57", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "tensor([[0.1659, 0.0446],\n", + " [0.1570, 0.0366],\n", + " [0.1655, 0.0466],\n", + " [0.1657, 0.0450],\n", + " [0.1656, 0.0384],\n", + " [0.1721, 0.0519],\n", + " [0.1721, 0.0525],\n", + " [0.1489, 0.0255],\n", + " [0.1621, 0.0394],\n", + " [0.1737, 0.0522]], device='cuda:0', grad_fn=)" + ] + }, + "execution_count": 27, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "predicted_label[:10,]" + ] + }, + { + "cell_type": "markdown", + "id": "6eb1d187-d606-4da5-93cb-d79a8a4dd889", + "metadata": {}, + "source": [ + "The following **`predict`** function takes in a text, a text pipeline, and a model as inputs. It uses a pretrained model passed as a parameter to predict the label of the text for text classification on the IMDB dataset:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "id": "e8f1aefb-3058-4b46-9d9b-98ef5f392dbc", + "metadata": {}, + "outputs": [], + "source": [ + "def predict(text, model, text_pipeline):\n", + " with torch.no_grad():\n", + " text = torch.unsqueeze(torch.tensor(text_pipeline(text)),0).to(device)\n", + "\n", + " output = model(text)\n", + " return imdb_label[output.argmax(1).item()]" + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "id": "e5d08087-8aba-445d-86e1-5cdacabdcf92", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "' negative review'" + ] + }, + "execution_count": 29, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "predict(\"the is a good movie\",model,text_pipeline )" + ] + }, + { + "cell_type": "markdown", + "id": "b88a47f4-5326-487a-86cf-c74d1942fe5c", + "metadata": {}, + "source": [ + "You can create a function to evaluate the model's accuracy on a dataset:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 30, + "id": "99a0391e-5770-4247-8475-3ee78fd8ee13", + "metadata": {}, + "outputs": [], + "source": [ + "def evaluate(dataloader, model, device):\n", + " model.eval()\n", + " correct = 0\n", + " total = 0\n", + " with torch.no_grad():\n", + " for label, text in dataloader:\n", + " label, text = label.to(device), text.to(device)\n", + " outputs = model(text)\n", + " _, predicted = torch.max(outputs.data, 1)\n", + " total += label.size(0)\n", + " correct += (predicted == label).sum().item()\n", + " accuracy = 100 * correct / total\n", + " return accuracy" + ] + }, + { + "cell_type": "markdown", + "id": "5838996b-d0fa-4f4d-ba39-e7dd7ea6fc26", + "metadata": {}, + "source": [ + "The following evaluates the performance of your model on the test set:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "id": "eb37f6eb-84ab-4def-b607-391ae3fcb0c1", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "50.0" + ] + }, + "execution_count": 31, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "evaluate(test_dataloader , model, device)" + ] + }, + { + "cell_type": "markdown", + "id": "dac15043-ba2b-43ba-94bf-87528af585d5", + "metadata": {}, + "source": [ + "Note that the current performance of the model is no better than average. This outcome is expected, considering that the model has not undergone any training yet.\n" + ] + }, + { + "cell_type": "markdown", + "id": "a9e263a8-6689-4546-ac57-9b19fc60f509", + "metadata": {}, + "source": [ + "## Train the model\n", + "\n", + "The following defines the training function used to train the model:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 32, + "id": "31f35bc9-ee1d-439d-9ffd-905c821eda38", + "metadata": {}, + "outputs": [], + "source": [ + "def train_model(model, optimizer, criterion, train_dataloader, valid_dataloader, epochs=100, model_name=\"my_modeldrop\"):\n", + " cum_loss_list = []\n", + " acc_epoch = []\n", + " best_acc = 0\n", + " file_name = model_name\n", + " \n", + " for epoch in tqdm(range(1, epochs + 1)):\n", + " model.train()\n", + " cum_loss = 0\n", + " for _, (label, text) in enumerate(train_dataloader): \n", + " optimizer.zero_grad()\n", + " predicted_label = model(text)\n", + " loss = criterion(predicted_label, label)\n", + " loss.backward()\n", + " torch.nn.utils.clip_grad_norm_(model.parameters(), 0.1)\n", + " optimizer.step()\n", + " cum_loss += loss.item()\n", + " #print(\"Loss:\", cum_loss)\n", + " cum_loss_list.append(cum_loss)\n", + " acc_val = evaluate(valid_dataloader, model, device)\n", + " acc_epoch.append(acc_val)\n", + " \n", + " if acc_val > best_acc:\n", + " best_acc = acc_val\n", + " print(f\"New best accuracy: {acc_val:.4f}\")\n", + " torch.save(model.state_dict(), f\"{model_name}.pth\")\n", + " \n", + " save_list_to_file(cum_loss_list, f\"{model_name}_loss.pkl\")\n", + " save_list_to_file(acc_epoch, f\"{model_name}_acc.pkl\")\n" + ] + }, + { + "cell_type": "markdown", + "id": "c9529fe4-f63d-4404-9563-f68d73b1b648", + "metadata": {}, + "source": [ + "---" + ] + }, + { + "cell_type": "markdown", + "id": "3d36ca3a-0c98-481a-9c61-8d3a931b8c1f", + "metadata": {}, + "source": [ + "The following sets the learning rate (LR) to 1, which determines the step size at which the optimizer updates the model's parameters during training. The CrossEntropyLoss criterion is used to calculate the loss between the model's predicted outputs and the ground truth labels. This loss function is commonly employed for multi-class classification tasks.\n", + "\n", + "The chosen optimizer is Stochastic Gradient Descent (SGD), which optimizes the model's parameters based on the computed gradients with respect to the loss function. The SGD optimizer uses the specified learning rate to control the size of the weight updates.\n", + "\n", + "Additionally, a learning rate scheduler is defined using StepLR. This scheduler adjusts the learning rate during training, reducing it by a factor (gamma) of 0.1 after every epoch (step) to improve convergence and fine-tune the model's performance. These components together form the essential setup for training a neural network using the specified learning rate, loss criterion, optimizer, and learning rate scheduler.\n", + "\n", + "For the sake of time efficiency, the number of epochs has been set to 2. This is to give you a practical demonstration of what the training process looks like. However, if you were to train this model in a real-world scenario, you would likely increase the number of epochs to a larger figure, such as 100 or more. Given the reduced training set defined earlier, it takes approximately 2 minutes to complete 2 epochs of training:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 38, + "id": "ac97cc88-2c4c-4d81-926b-1a8b81e01cc6", + "metadata": {}, + "outputs": [], + "source": [ + "LR=1\n", + "\n", + "criterion = torch.nn.CrossEntropyLoss()\n", + "optimizer = torch.optim.SGD(model.parameters(), lr=LR)\n", + "scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1.0, gamma=0.1)" + ] + }, + { + "cell_type": "markdown", + "id": "c51c4267-910b-4cb1-86f9-448922630c65", + "metadata": {}, + "source": [ + "You have pretrained the model for 300 epochs using a GPU and saved this model for your convenience. However, to demonstrate the training process, the following code has been included that trains the model for just two epochs. Please note that you have limited the number of epochs to two because training on a CPU can be time-consuming. Even with just two epochs, you can expect the following code to run for approximately one minute.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 39, + "id": "a9120faf-0f9a-4c23-83d6-e67ac7474887", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + " 0%| | 0/10 [00:00" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "cum_loss_list=load_list_from_file(\"model_imdb_freeze_true2_loss.pkl\")\n", + "acc_epoch=load_list_from_file(\"model_imdb_freeze_true2_acc.pkl\")\n", + "plot(cum_loss_list,acc_epoch)" + ] + }, + { + "cell_type": "code", + "execution_count": 41, + "id": "9cc31ff0-ee7c-4f88-8005-01de11c81ff0", + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "%%capture \n", + "!wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/ZvhVWJU0flC7BmU1jjYxjg/model-imdb-freeze-true2.pth\n", + "!wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/2RdN-JG4Rm5Gx3UNtOP4NA/model-imdb-freeze-true2-acc.pkl\n", + "!wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/8qoGvWk0BdXRGoFAOT-dAw/model-imdb-freeze-true2-loss.pkl\n" + ] + }, + { + "cell_type": "markdown", + "id": "8495d949-6aec-4330-bf71-0ce599265cbd", + "metadata": {}, + "source": [ + "Let's plot the cost and accuracy for each epoch for the pretrained model that was trained for 300 epochs. From the plot, it becomes evident that with just a few epochs, the accuracy exhibits significant volatility.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 42, + "id": "8c51d7c9-8125-47d4-a43d-371b9ea9d9b4", + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "cum_loss_list=load_list_from_file(model_name.replace('_','-') + \"-loss.pkl\")\n", + "acc_epoch=load_list_from_file(model_name.replace('_','-') + \"-acc.pkl\")\n", + "plot(cum_loss_list,acc_epoch)" + ] + }, + { + "cell_type": "markdown", + "id": "8779d43e-2ae8-4f27-b90a-d436904f4d9e", + "metadata": {}, + "source": [ + "Here, you load the model that has been trained for you. Please comment out these lines if you want to train the model yourself.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 43, + "id": "cafb242d-1968-48f4-973f-b1894c138e8e", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "TextClassifier(\n", + " (embedding): Embedding(400000, 100)\n", + " (fc1): Linear(in_features=100, out_features=128, bias=True)\n", + " (relu): ReLU()\n", + " (fc2): Linear(in_features=128, out_features=2, bias=True)\n", + ")" + ] + }, + "execution_count": 43, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "model.load_state_dict(torch.load(model_name.replace('_','-') + \".pth\", map_location=device))\n", + "model.eval()" + ] + }, + { + "cell_type": "markdown", + "id": "0f52df3f-cac4-48ef-abec-7317fcc02dec", + "metadata": {}, + "source": [ + "The following evaluates the model on the test data. The pretrained model achieves an accuracy of 66%.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 44, + "id": "a80d646f-83df-4a88-b498-b3726935559c", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "66.216" + ] + }, + "execution_count": 44, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "evaluate(test_dataloader , model, device)" + ] + }, + { + "cell_type": "markdown", + "id": "c8cc8cd8-04f9-4089-be92-91927d136ce4", + "metadata": {}, + "source": [ + "## Low-Rank Adaptation (LoRA)\n", + "\n", + "PyTorch and the Hugging Face library provide robust tools for model manipulation with LoRA, but they are not intuitive. In this section, you delve into building a LoRA (Low-Rank Adaptation) implementation from scratch using PyTorch. LoRA is a general method, but it's commonly applied to the Attention layer. For the sake of simplicity, in this lab, you apply it to a Vanilla neural network. This decision is made because accessing the Attention Parameters in the PyTorch Encoder module can be challenging.\n", + "\n", + "### LoRA\n", + "1) For any arbitrary layer of a network, you have the model with pretrained parameters $ W_0 $, which are the parameters of the model. If you only consider the attention parameters for each layer, at a minimum $ 4 \\times m \\times n$ for each layer. For many models, this can reach in the trillions of learnable parameters. Each time you fine-tune a new dataset, you have to store trillions of parameters.\n", + "\n", + "2) $ \\Delta W $ represents two matrices $ B $ and $ A $, where $ B $ and $ A $ are constrained such that $ B \\in \\mathbb{R}^{m \\times r} $, $ A \\in \\mathbb{R}^{r \\times n} $, and $ r \\leq \\min(m, n) $. The total number of parameters is $ A $ and $ B $ is much smaller than $ W_1$ and much easier to store.\n", + "\n", + "$ W_1\\approx W_0 + \\Delta W = W_0 + BA $\n", + "\n", + "\n", + "\n", + "3) To train and predict, the forward pass holds $W_0$ constant.\n", + "\n", + "$h = W_0 + \\Delta W = W_0x + BAx $\n", + "\n", + "\n", + "To scale $\\Delta W \\times \\dfrac{\\alpha'}{r}$, where $\\alpha$ is a constant in $ r $. Adjusting $\\alpha'$ is similar to tuning the learning rate if the initialization is properly scaled. Therefore, you set $\\alpha'$ to the first $ r $ you try and do not tune it further; just use $\\alpha$. This scaling reduces the need to retune hyperparameters. The final form is:\n", + "\n", + "$h = W_0x + \\dfrac{\\alpha'}{r} BAx= W_0x + \\alpha BAx $\n", + "\n", + "The following example illustrates the process.\n", + "\n", + "\n", + "$\n", + "W_0 + BA = \n", + "\\begin{bmatrix}\n", + "w_{11} & w_{12} & w_{13} & w_{14} \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\n", + "w_{21} & w_{22} & w_{23} & w_{24} \\\\\\\\\\\\\n", + "w_{31} & w_{32} & w_{33} & w_{34} \\\\\\\\\\\\\n", + "w_{41} & w_{42} & w_{43} & w_{44} \\\\\\\\\\\\\n", + "\\end{bmatrix} +\n", + "\\begin{bmatrix}\n", + "a_1 \\\\\\\\\\\\\n", + "a_2 \\\\\\\\\\\\\n", + "a_3 \\\\\\\\\\\\\n", + "a_4 \\\\\\\\\\\\\n", + "\\end{bmatrix}\n", + "\\begin{bmatrix}\n", + "b_1 & b_2 & b_3 & b_4 \\\\\\\\\n", + "\\end{bmatrix}\n", + "$\n", + "\n", + "This illustrates the product of matrices $ A $ and $ B $, denoted as $ AB $, which can be added to $ W_0 $. However, the resulting matrix $ W_0 + AB $ is limited depending on the dimensions of $ A $ and $ B $. This limitation is due to the concept of rank.\n", + "\n", + "### Rank\n", + "The rank of a matrix is the number of dimensions the rows of the matrix \"live in.\" A square matrix is said to be **full rank** if its rank is equal to the number of its rows or columns. Let's make this idea more intuitive with an example.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 45, + "id": "f5054a80-14ad-4f3a-86f6-4b2475dfd125", + "metadata": {}, + "outputs": [], + "source": [ + "from sympy import Matrix, init_printing,Symbol\n", + "from numpy.linalg import qr,eig,inv,matrix_rank,inv, norm\n", + "from scipy.linalg import null_space\n", + "from sympy import Matrix, init_printing,Symbol\n", + "init_printing()" + ] + }, + { + "cell_type": "code", + "execution_count": 46, + "id": "0fa127f4-d7bd-44f2-a22b-5e4723f1ec42", + "metadata": {}, + "outputs": [], + "source": [ + "def plot_matrix_and_subspace(F):\n", + " assert F.shape[0] == 3, \"Matrix F must have rows equal to 3 for 3D visualization.\"\n", + " \n", + " ax = plt.figure().add_subplot(projection='3d')\n", + " \n", + " # Plot each column vector of F as a point and line from the origin\n", + " for i in range(F.shape[1]):\n", + " ax.quiver(0, 0, 0, F[0, i], F[1, i], F[2, i], color='blue', arrow_length_ratio=0.1, label=f'Column {i+1}')\n", + "\n", + " if F.shape[1] == 2:\n", + " # Calculate the normal to the plane spanned by the columns of F if they are exactly two\n", + " normal_vector = np.cross(F[:, 0], F[:, 1])\n", + " # Plot the plane\n", + " xx, yy = np.meshgrid(np.linspace(-3, 3, 10), np.linspace(-3, 3, 10))\n", + " zz = (-normal_vector[0] * xx - normal_vector[1] * yy) / normal_vector[2] if normal_vector[2] != 0 else 0\n", + " ax.plot_surface(xx, yy, zz, alpha=0.5, color='green', label='Spanned Plane')\n", + "\n", + " # Set plot limits and labels\n", + " ax.set_xlim([-3, 3])\n", + " ax.set_ylim([-3, 3])\n", + " ax.set_zlim([-3, 3])\n", + " ax.set_xlabel('$x_{1}$')\n", + " ax.set_ylabel('$x_{2}$')\n", + " ax.set_zlabel('$x_{3}$')\n", + " #ax.legend()\n", + "\n", + " plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": 47, + "id": "0718ea57-d8af-4bec-9608-d425656912dd", + "metadata": {}, + "outputs": [], + "source": [ + "def plot_matrix_and_subspace(F):\n", + " assert F.shape[0] == 3, \"Matrix F must have 3 rows to represent 3D space.\"\n", + "\n", + " ax = plt.figure().add_subplot(projection='3d')\n", + " \n", + " # Plot each column vector of F\n", + " for i in range(F.shape[1]):\n", + " ax.quiver(0, 0, 0, F[0, i], F[1, i], F[2, i], color='blue', arrow_length_ratio=0.1, label=f'Column {i+1}')\n", + "\n", + " # Calculate the null space of the transpose of F\n", + " normal_vector = null_space(F.T)\n", + " \n", + " # Check that the null space is 1-dimensional\n", + " if normal_vector.shape[1] == 1:\n", + " normal_vector = normal_vector[:, 0] # Simplify the array to 1D\n", + " # Create a meshgrid for the plane\n", + " xx, yy = np.meshgrid(np.linspace(-3, 3, 10), np.linspace(-3, 3, 10))\n", + " # Calculate corresponding z coordinates based on the plane equation ax + by + cz = 0\n", + " zz = (-normal_vector[0] * xx - normal_vector[1] * yy) / normal_vector[2] if normal_vector[2] != 0 else 0\n", + " ax.plot_surface(xx, yy, zz, alpha=0.5, color='green', label='Spanned Plane')\n", + " else:\n", + " print(\"The null space is not 1-dimensional, so a unique plane cannot be determined.\")\n", + "\n", + " # Set plot limits and labels\n", + " ax.set_xlim([-3, 3])\n", + " ax.set_ylim([-3, 3])\n", + " ax.set_zlim([-3, 3])\n", + " ax.set_xlabel('X axis')\n", + " ax.set_ylabel('Y axis')\n", + " ax.set_zlabel('Z axis')\n", + " #ax.legend()\n", + "\n", + " plt.show()\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "82792cf9-222a-4d02-8b3b-cf4b124b0dc5", + "metadata": {}, + "source": [ + "In the context of Low-Rank Adaptation (LoRA), where $B \\in \\mathbb{R}^{d \\times r}$, the matrix $B$:\n", + "\n", + "\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 48, + "id": "10fb63a2-c601-4911-a5ad-d495cb935e48", + "metadata": {}, + "outputs": [ + { + "data": { + "text/latex": [ + "$\\displaystyle \\left[\\begin{matrix}1 & 0\\\\0 & 1\\\\0 & 0\\end{matrix}\\right]$" + ], + "text/plain": [ + "⎡1 0⎤\n", + "⎢ ⎥\n", + "⎢0 1⎥\n", + "⎢ ⎥\n", + "⎣0 0⎦" + ] + }, + "execution_count": 48, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "B=torch.tensor([[1,0],[0,1],[0,0]]).numpy()\n", + "\n", + "Matrix(B)\n" + ] + }, + { + "cell_type": "markdown", + "id": "6d7e2826-6a0d-4244-9e48-f1803873cc8c", + "metadata": {}, + "source": [ + "This $3 \\times 2$ matrix has columns that span a 2-dimensional subspace in $\\mathbb{R}^3$. Specifically, the columns of $B$ are:\n", + "\n", + "- $\\mathbf{b}_1 = \\begin{bmatrix} 1 \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\ 0 \\\\ 0 \\end{bmatrix}$\n", + "- $\\mathbf{b}_2 = \\begin{bmatrix} 0 \\\\\\\\ 1 \\\\ 0 \\end{bmatrix}$\n", + "\n", + "These columns are standard basis vectors for the $xy$-plane in $\\mathbb{R}^3$, and, thus, they span the $xy$-plane shown in green in the following image. Muliplying each column vector in blue by a scaler always falls in the plane.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 49, + "id": "3c3e3709-0681-44c8-9301-36f3b52401f7", + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + " plot_matrix_and_subspace(B)" + ] + }, + { + "cell_type": "markdown", + "id": "cf941efb-add8-470b-862f-14dd911cc860", + "metadata": {}, + "source": [ + "In this scenario, the vectors, despite each having three components, can reach any point on the two-dimensional green plane depicted in the image. These vectors span the green plane, which resides within a two-dimensional subspace. This subspace's dimension, also known as its 'rank', is two—corresponding to the dimensionality of the plane. If the rank were three, any point in the 3D space could be reached by some combination of the columns of $𝐵$. The rank of a matrix can be determined by using the matrix_rank function provided by NumPy.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 50, + "id": "894b2da5-9f86-4336-ab4f-b3a3f158b978", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "2" + ] + }, + "execution_count": 50, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "matrix_rank(B)" + ] + }, + { + "cell_type": "markdown", + "id": "62c90995-d697-4bf4-96aa-0c60d0b42698", + "metadata": {}, + "source": [ + "Here, you plot a different matrix where the matrix spans a different plane, but the rank remains two.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 51, + "id": "c92466da-2d0e-485d-abda-5c7ed31cc4a2", + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "rank of B 2\n" + ] + } + ], + "source": [ + "B_=torch.tensor([[1,0],[-2,1],[0,1]]).numpy()\n", + "plot_matrix_and_subspace(B_)\n", + "print(\"rank of B\",matrix_rank(B_))" + ] + }, + { + "cell_type": "code", + "execution_count": 52, + "id": "17a1bce9-b00c-4516-a17c-94c91b5fa9e6", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 52, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "matrix_rank" + ] + }, + { + "cell_type": "markdown", + "id": "e7205fe3-ee86-4bdf-8ff8-a80974009792", + "metadata": {}, + "source": [ + "Here, you present the matrix ```A```. The rank of this matrix is also two.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 53, + "id": "16fcbaeb-94b8-4aee-af84-f4457cb31df1", + "metadata": {}, + "outputs": [ + { + "data": { + "text/latex": [ + "$\\displaystyle \\left[\\begin{matrix}1 & 1 & -1 & 1 & 0\\\\-2 & 2 & 2 & 0 & 1\\end{matrix}\\right]$" + ], + "text/plain": [ + "⎡1 1 -1 1 0⎤\n", + "⎢ ⎥\n", + "⎣-2 2 2 0 1⎦" + ] + }, + "execution_count": 53, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "A=torch.tensor([[1,1,-1,1,0],[-2,2,2,0,1]]).numpy()\n", + "Matrix(A)\n" + ] + }, + { + "cell_type": "code", + "execution_count": 54, + "id": "b092ae8f-57aa-48ba-b68a-a9f271fe57ce", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "2" + ] + }, + "execution_count": 54, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "matrix_rank(A)" + ] + }, + { + "cell_type": "markdown", + "id": "6bea05a6-cb00-45e9-b99f-9e304da883a2", + "metadata": {}, + "source": [ + "For the matrices $ C = BA $, if $B $ and $ A $ both have a rank of $ r $:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 55, + "id": "b9afc203-763c-4809-b3b7-03890312fe3d", + "metadata": {}, + "outputs": [ + { + "data": { + "text/latex": [ + "$\\displaystyle \\left[\\begin{matrix}1 & 1 & -1 & 1 & 0\\\\-2 & 2 & 2 & 0 & 1\\\\0 & 0 & 0 & 0 & 0\\end{matrix}\\right]$" + ], + "text/plain": [ + "⎡1 1 -1 1 0⎤\n", + "⎢ ⎥\n", + "⎢-2 2 2 0 1⎥\n", + "⎢ ⎥\n", + "⎣0 0 0 0 0⎦" + ] + }, + "execution_count": 55, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "C=B@A\n", + "Matrix(C)\n" + ] + }, + { + "cell_type": "markdown", + "id": "5874ff64-dd94-45fe-b54e-585ff8d386cf", + "metadata": {}, + "source": [ + " The columns of $ C $ will have the same rank as $ B $. Furthermore, the span of the columns of $ C $ will be the same as the span of the columns of $ B $.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 56, + "id": "ed566ee6-7f64-42db-8356-70e675c8e5e8", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "rank of C 2\n" + ] + }, + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "print(\"rank of C\",matrix_rank(C))\n", + "plot_matrix_and_subspace(C)" + ] + }, + { + "cell_type": "markdown", + "id": "32c5878b-07e6-4b9e-9048-f9a3f7d7e062", + "metadata": {}, + "source": [ + "### Understanding LoRA in PyTorch\n", + "\n", + "LoRA (Low-Rank Adaptation) is relatively simple to initialize in PyTorch. You initialize LoRA with the dimensions of the input (`in_dim`), $ m $, output (`out_dim`), $n $, a rank (`rank`), $ r $, and a scaling factor `alpha`. The parameters are initialized as follows:\n", + "\n", + "```\n", + "self.A = torch.nn.Parameter(torch.randn(in_dim, rank) * std_dev)\n", + "self.B = torch.nn.Parameter(torch.zeros(rank, out_dim))\n", + "```\n", + "\n", + "The use of ```nn.Parameter``` makes these values learnable parameters.\n", + "\n", + "In the forward function, LoRA uses the notation $BAx$ PyTorch, the input vector is a row, so the output becomes $x^TA^TB^T$ will drop the trapose from now on. The forward pass is implemented as:\n", + "```\n", + "x = self.alpha * (x @ self.A @ self.B)\n", + "```\n", + "The use of ```nn.Parameter``` makes these values learnable parameters.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 57, + "id": "32e2c20a-b272-4bf7-8985-296d2facc08c", + "metadata": {}, + "outputs": [], + "source": [ + "class LoRALayer(torch.nn.Module):\n", + " def __init__(self, in_dim, out_dim, rank, alpha):\n", + " super().__init__()\n", + " std_dev = 1 / torch.sqrt(torch.tensor(rank).float())\n", + " self.A = torch.nn.Parameter(torch.randn(in_dim, rank) * std_dev)\n", + " \n", + " self.B = torch.nn.Parameter(torch.zeros(rank, out_dim))\n", + " self.alpha = alpha\n", + "\n", + " def forward(self, x):\n", + " x = self.alpha * (x @ self.A @ self.B)\n", + " return x" + ] + }, + { + "cell_type": "markdown", + "id": "cccba2e5-91d6-41a4-af68-2b8404b587e9", + "metadata": {}, + "source": [ + "This class ```LinearWithLoRA``` copies the original linear model and creates a ```LoRALayer``` object. \n", + "\n", + "```python\n", + "self.linear = linear.to(device)\n", + " self.lora = LoRALayer(\n", + " linear.in_features, linear.out_features, rank, alpha\n", + " ).to(device)\n", + "```\n", + "\n", + "Then, in the forward method apply both the original linear model and the output Lora model to the input x and add them together ```self.linear(x) + self.lora(x)```. This corresponds to:\n", + "\n", + " $xW_0 + xAB $\n" + ] + }, + { + "cell_type": "code", + "execution_count": 58, + "id": "19dd8b64-7bb2-44eb-bfd2-7c788b78605f", + "metadata": {}, + "outputs": [], + "source": [ + "class LinearWithLoRA(torch.nn.Module):\n", + " def __init__(self, linear, rank, alpha):\n", + " super().__init__()\n", + " self.linear = linear.to(device)\n", + " self.lora = LoRALayer(\n", + " linear.in_features, linear.out_features, rank, alpha\n", + " ).to(device)\n", + "\n", + " def forward(self, x):\n", + " \n", + " return self.linear(x) + self.lora(x)" + ] + }, + { + "cell_type": "markdown", + "id": "6a5003e5-c77e-475a-8c75-11fbfea325ee", + "metadata": {}, + "source": [ + "### Applying LoRA\n", + "To fine-tune with LoRA, first, load a pretrained TextClassifier model with LoRA (while freezing its layers), load its pretrained state from a file, and then disable gradient updates for all of its parameters to prevent further training. Here, you will load a model that was pretrained on the AG NEWS dataset, which is a dataset that has 4 classes. Note that when you initialize this model, you set `num_classes` to 4. Moreover, the pretrained AG_News model was trained with the embedding layer unfrozen. Hence you will initialize the model with `freeze=False`. Although you are initializing the model with layers unfrozen and the wrong number of classes for your task, you will make modifications to the model later on that correct this:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 69, + "id": "5908508a-a521-4a39-8b41-ac8157ea8bcd", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "TextClassifier(\n", + " (embedding): Embedding(400000, 100)\n", + " (fc1): Linear(in_features=100, out_features=128, bias=True)\n", + " (relu): ReLU()\n", + " (fc2): Linear(in_features=128, out_features=4, bias=True)\n", + ")" + ] + }, + "execution_count": 69, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from urllib.request import urlopen\n", + "import io\n", + "\n", + "model_lora=TextClassifier(num_classes=4,freeze=False)\n", + "model_lora.to(device)\n", + "\n", + "urlopened = urlopen('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/uGC04Pom651hQs1XrZ0NsQ/my-model-freeze-false.pth')\n", + "\n", + "stream = io.BytesIO(urlopened.read())\n", + "state_dict = torch.load(stream, map_location=device)\n", + "model_lora.load_state_dict(state_dict)\n", + "\n", + "# Here, you freeze all layers:\n", + "for parm in model_lora.parameters():\n", + " parm.requires_grad=False\n", + "model_lora" + ] + }, + { + "cell_type": "markdown", + "id": "08738e26-8def-49e8-9d79-4925651a3162", + "metadata": {}, + "source": [ + "Note that the `for` loop in the above code froze all of the layers in the neural network, including the embedding layer.\n", + "\n", + "Additionally, note that the original model was on a classification problem that had four classes, while the IMDB dataset has just 2 classes. To account for this, you will replace the final layer with a new linear layer where the number of outputs equals 2:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 70, + "id": "8d6e2366-2c81-46c4-ba57-e27f446c4ebf", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "TextClassifier(\n", + " (embedding): Embedding(400000, 100)\n", + " (fc1): Linear(in_features=100, out_features=128, bias=True)\n", + " (relu): ReLU()\n", + " (fc2): Linear(in_features=128, out_features=2, bias=True)\n", + ")" + ] + }, + "execution_count": 70, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "model_lora.fc2=nn.Linear(in_features=128, out_features=2, bias=True).to(device)\n", + "model_lora" + ] + }, + { + "cell_type": "markdown", + "id": "afd1344f-e3dc-4968-b45b-4c255f854224", + "metadata": {}, + "source": [ + "Let's view all of the modules in the object.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 71, + "id": "202eebcf-d0dc-4d90-829b-3754d7658099", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "TextClassifier(\n", + " (embedding): Embedding(400000, 100)\n", + " (fc1): Linear(in_features=100, out_features=128, bias=True)\n", + " (relu): ReLU()\n", + " (fc2): Linear(in_features=128, out_features=2, bias=True)\n", + ")" + ] + }, + "execution_count": 71, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "model_lora" + ] + }, + { + "cell_type": "markdown", + "id": "0229bf1d-195f-4d18-a99d-3080266db335", + "metadata": {}, + "source": [ + "Your task now is to replace the hidden layer with a LoRA layer. You can access the hidden layer as follows:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 72, + "id": "9e1a22b0-cca2-49c5-9bb7-5f7349db0cec", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Linear(in_features=100, out_features=128, bias=True)" + ] + }, + "execution_count": 72, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "model_lora.fc1" + ] + }, + { + "cell_type": "markdown", + "id": "c7b71d41-bb47-4674-b989-727b273e48c8", + "metadata": {}, + "source": [ + "The following replaces this layer with a LoRA layer:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 73, + "id": "3c7137ce-82c0-45c8-b337-8c0070e2e0ac", + "metadata": {}, + "outputs": [], + "source": [ + "model_lora.fc1=LinearWithLoRA(model_lora.fc1,rank=2, alpha=0.1).to(device)" + ] + }, + { + "cell_type": "markdown", + "id": "62ad6238-94b8-4980-90e5-a5c2487c2e92", + "metadata": {}, + "source": [ + "Let's look at the hidden layer again to ensure that it is indeed converted to a LoRA layer.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 74, + "id": "16da330f-38bc-4118-9fcb-f9b88abe4b59", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "LinearWithLoRA(\n", + " (linear): Linear(in_features=100, out_features=128, bias=True)\n", + " (lora): LoRALayer()\n", + ")" + ] + }, + "execution_count": 74, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "model_lora.fc1" + ] + }, + { + "cell_type": "code", + "execution_count": 75, + "id": "85e9a340-c580-4512-b345-b9710d6a5eca", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "TextClassifier(\n", + " (embedding): Embedding(400000, 100)\n", + " (fc1): LinearWithLoRA(\n", + " (linear): Linear(in_features=100, out_features=128, bias=True)\n", + " (lora): LoRALayer()\n", + " )\n", + " (relu): ReLU()\n", + " (fc2): Linear(in_features=128, out_features=2, bias=True)\n", + ")" + ] + }, + "execution_count": 75, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "model_lora" + ] + }, + { + "cell_type": "markdown", + "id": "1172a78a-3d2f-4cd9-8ff0-af92088a6ba2", + "metadata": {}, + "source": [ + "At this point, training the model is similar, with the only difference being that, except for the output layer, only the learnable parameters \n", + "```A``` and ```B``` will be updated. The code to select the values for `r` and `alpha`, which is not run, is nonetheless provided herein for your convenience.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 76, + "id": "aa919697-b5f0-4c5a-9e26-de86ee43bd2e", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "TextClassifier(\n", + " (embedding): Embedding(400000, 100)\n", + " (fc1): LinearWithLoRA(\n", + " (linear): Linear(in_features=100, out_features=128, bias=True)\n", + " (lora): LoRALayer()\n", + " )\n", + " (relu): ReLU()\n", + " (fc2): Linear(in_features=128, out_features=2, bias=True)\n", + ")" + ] + }, + "execution_count": 76, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "model_lora.to(device)" + ] + }, + { + "cell_type": "markdown", + "id": "14beb4cc-f276-4955-ab67-b5902123729f", + "metadata": {}, + "source": [ + "
\n", + "Click here to see code to select r and alpha\n", + " \n", + "```python \n", + "ranks = [1, 2, 5, 10]\n", + "alphas = [0.1, 0.5, 1.0, 2.0, 5.0]\n", + "\n", + "results=[]\n", + "accuracy_old=0\n", + "# Loop over each combination of 'r' and 'alpha'\n", + "for r in ranks:\n", + " for alpha in alphas:\n", + " print(f\"Testing with rank = {r} and alpha = {alpha}\")\n", + " model_name=f\"model_lora_rank{r}_alpha{alpha}_AGtoIBDM_final_adam_\"\n", + " \n", + " model_lora=TextClassifier(num_classes=4,freeze=False)\n", + " model_lora.to(device)\n", + " \n", + " urlopened = urlopen('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/uGC04Pom651hQs1XrZ0NsQ/my-model-freeze-false.pth')\n", + " \n", + " stream = io.BytesIO(urlopened.read())\n", + " state_dict = torch.load(stream, map_location=device)\n", + " model_lora.load_state_dict(state_dict)\n", + " \n", + " for parm in model_lora.parameters():\n", + " parm.requires_grad=False\n", + " \n", + " model_lora.fc2=nn.Linear(in_features=128, out_features=2, bias=True)\n", + " model_lora.fc1=LinearWithLoRA(model_lora.fc1,rank=r, alpha=alpha )\n", + " optimizer = torch.optim.Adam(model_lora.parameters(), lr=LR)\n", + "\n", + " scheduler = torch.optim.lr_scheduler.ExponentialLR(optimizer, gamma=0.1)\n", + " \n", + " model_lora.to(device)\n", + " \n", + " train_model(model_lora, optimizer, criterion, train_dataloader, valid_dataloader, epochs=300, model_name=model_name)\n", + " \n", + " accuracy=evaluate(valid_dataloader , model_lora, device)\n", + " result = {\n", + " 'rank': r,\n", + " 'alpha': alpha,\n", + " 'accuracy':accuracy\n", + " }\n", + "\n", + " # Append the dictionary to the results list\n", + " results.append(result)\n", + "\n", + " if accuracy>accuracy_old:\n", + " print(f\"Testing with rank = {r} and alpha = {alpha}\")\n", + " print(f\"accuracy: {accuracy} accuracy_old: {accuracy_old}\" )\n", + " accuracy_old=accuracy\n", + " torch.save(model.state_dict(), f\"{model_name}.pth\")\n", + " save_list_to_file(cum_loss_list, f\"{model_name}_loss.pkl\")\n", + " save_list_to_file(acc_epoch, f\"{model_name}_acc.pkl\")\n", + " \n", + "```" + ] + }, + { + "cell_type": "markdown", + "id": "dbc3dc3e-70c2-4ae3-b3b7-7359cf42dd6b", + "metadata": {}, + "source": [ + "\n", + "Let's set up the training components for the `model_lora` model, defining a learning rate of 1, using cross-entropy loss as the criterion, optimizing with stochastic gradient descent (SGD), and scheduling the learning rate to decay by a factor of 0.1 at each epoch:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 77, + "id": "a6588ce7-93ff-4658-8fde-b15e1a9dea06", + "metadata": {}, + "outputs": [], + "source": [ + "LR=1\n", + "criterion = torch.nn.CrossEntropyLoss()\n", + "optimizer = torch.optim.SGD(model_lora.parameters(), lr=LR)\n", + "scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1.0, gamma=0.1)\n" + ] + }, + { + "cell_type": "markdown", + "id": "4e604097-52b3-4ac5-ace0-b749af99b5c6", + "metadata": {}, + "source": [ + "You have pretrained a model using an identical procedure for 300 epochs and saved it for your convenience. However, to give you a taste of how training works in practice, run the following code to train the model for just 2 epochs.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 78, + "id": "1285bd07-be04-4502-b844-141fae08d4b4", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + " 0%| | 0/10 [00:00" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "cum_loss_list=load_list_from_file(\"model_lora_final2_loss.pkl\")\n", + "acc_epoch=load_list_from_file(\"model_lora_final2_acc.pkl\")\n", + "plot(cum_loss_list,acc_epoch)" + ] + }, + { + "cell_type": "code", + "execution_count": 82, + "id": "54c3cb54-ff64-46ee-be40-dbad7383eed1", + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAEYAAAAQCAYAAACr+QluAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjcuNSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/xnp5ZAAAACXBIWXMAABJ0AAASdAHeZh94AAADhklEQVR4nO3Xa6hVVRAH8N/NqxlFTyIjMs0gJKPHh4qil2YRYWAhRWgWFPRBRDIoDBsnqPyQlBBUYigZ1IeUIrQiUrIgCIKosIdUJj3tZWqZmt4+rH1k3+M55557kwjqD5t11qxZ/z0zZ2bW2j19fX3+x4HorU8ycxNOaaP7fUSM6kSWmdOxopreHhFLh2pYN1yZ2YPbqucM9OAjLMWSiNhX0z0OU3ENzsRJ2I0PsAzL6vr9AlPhVzzaQr5jAEdOxmOV3hGddAfCILiewU3YgmfxOybjcVyIm2u60yr5t1iHzTgB1ymBvDozp0VEH60DszUiFgzSkR4l6j9hFe4azP6hcGXmVCUoX+C8iPixko/ASszIzBciYlW15VNci9VNmTQP7+B6JUgr4ZChOtCE2ZiIW/HbP8Q1tRoXNYICEbEb86vprJp8bUS8VA9KJf8OT1TTyxryVhlzaFXfoyvD3sf6iNjbyrrMHI+FWBwR6zNzYgdnOmKQXI1+93mLtYbs4swcUQWrE/ZU458NQauMGaU0vQeUXrMWGzPz0mbFzOytdDdj3gAv74ghcDWyZGyLtVOrsbf2u9N7G73olYa8OTDLMEkJzuFK934SY/ByZp7VpH8fzsEtEbGzkwFdYLBcq6vxzsw8tiHMzOHImt4xA/AsxASsiYhXG8J+pRQR2bTpQ9yRmTswFwtUtZ2Z5yv/7KKIeLsLR9piiFzPYQauwobMfBF/4AqcqGTeaOxrR5CZsxW/Pq649qPb5ttoTpdUhL14Wun089tt6gZD5ap63hTcgx8ws3o2Kkf19kp1S5v3zsJibMDlEfFzfb2nm5tvZh6FrdgVESMz82j80qUPiyNiTgfug8ZV4xyp3Me2RcTxLdbn4BGlIiZFxAHBa3UqtcIF1djo9rvwVBvdc5Ve8RY+wUClcTC5GrgRI5RLXz9k5t1KX3kPk+tHfR37M6Y6KjdHRL+7Q2aOwWs4DfdGxIOdLMrMBQjtr/HjMByfRcSe5vVBch0ZEduaZGdX9g7DhIj4prY2H/fjXVzZXD511DPmBszNzPX4UqnRccq3xUiswcOdHOkSryvfY2Ox6W9yvZaZO5WS2I7xir07MaUpKDOVoOzFm5id2XzW2BQRy+kfmHU4XUndi5TjequSxiuwovEd8S/C80rZTMdh+BpL8FBEfNWk27jvDMOcNnxvYDldNt//Iv4CzKFk7EtLSAoAAAAASUVORK5CYII=", + "text/latex": [ + "$\\displaystyle 54.492$" + ], + "text/plain": [ + "54.492" + ] + }, + "execution_count": 82, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "evaluate(test_dataloader , model_lora, device)" + ] + }, + { + "cell_type": "markdown", + "id": "bb5c0973-b6cc-4024-9a2b-f9c406fb2503", + "metadata": {}, + "source": [ + "Instead of evaluating the model you just trained for 2 epochs, lets have a look at the LoRA model pretrained on 300 epochs:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 80, + "id": "ccd15d85-9762-4bec-b5d5-665fe235b832", + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "%%capture \n", + "!wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/JWPRb1RMhKLRMUWOKw9pxA/model-lora-final2.pth\n", + "!wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/_dm02rLyTrwsXEQh2r32sQ/model-lora-final2-acc.pkl\n", + "!wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/OZbVqKjoqOSIwnET8AB1KA/model-lora-final2-loss.pkl\n" + ] + }, + { + "cell_type": "markdown", + "id": "69dca0d8-c197-4d15-a9b2-a45265066a20", + "metadata": {}, + "source": [ + "The following shows the progression of the training of this model for 300 epochs:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 81, + "id": "f7b34b56-6f20-44cf-b0fa-6b1bd88d8985", + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "cum_loss_list=load_list_from_file(model_name.replace('_','-') + \"-loss.pkl\")\n", + "acc_epoch=load_list_from_file(model_name.replace('_','-') + \"-acc.pkl\")\n", + "plot(cum_loss_list,acc_epoch)" + ] + }, + { + "cell_type": "markdown", + "id": "1afaf400-babe-4c17-99d7-9ee167d19174", + "metadata": {}, + "source": [ + "Let's load actually load the model into model_lora:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 83, + "id": "6fa85d86-2966-4d8f-b7fc-b59551f4d02e", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "TextClassifier(\n", + " (embedding): Embedding(400000, 100)\n", + " (fc1): LinearWithLoRA(\n", + " (linear): Linear(in_features=100, out_features=128, bias=True)\n", + " (lora): LoRALayer()\n", + " )\n", + " (relu): ReLU()\n", + " (fc2): Linear(in_features=128, out_features=2, bias=True)\n", + ")" + ] + }, + "execution_count": 83, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "model_lora.load_state_dict(torch.load(model_name.replace('_','-') + \".pth\", map_location=device))\n", + "model_lora.eval()" + ] + }, + { + "cell_type": "markdown", + "id": "c85bdd49-492a-4cc2-a22f-2dccec3cc8ec", + "metadata": {}, + "source": [ + "And, let's evaluate its performance on the test data.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 84, + "id": "db4a670a-958b-41b9-a0fb-168902a1d706", + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAEYAAAAQCAYAAACr+QluAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjcuNSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/xnp5ZAAAACXBIWXMAABJ0AAASdAHeZh94AAAEG0lEQVR4nO3Ye8jfcxQH8NfDY0gUcku5JnemRUa2aS4Jy7BIQwopmnsknB0l91sps9QmRDG5ZIxYLqVWmtDMJds09wcP5tLYHn98Pr/t+/ye3/P0e2T5x6lv5/s9n/P5nHPe3/M5n/P99gwMDPifhlJvJ2FmTsYlGI+t8T3ex30RMb+h14Pz67U/evAhHsLsiFjbrSOZeTomYiwOxpZ4LCKmjzBnOXYdZvibiNixTX9bTMWJOBA7Y3WNbQ7mtHweAkxm3o6rsRLPoQ/bYRwmYX5D/VGchW/xOH7DsXgAR+Cc4YLqQNcrgKyqtvfpct5PuLeDfFUH2bTq21dYiM+xA05VXuYJmTktIgYGAZOZFyigPIwLI2J12/gmjfupCijLcFhE9FX5GMzD2Zn5TEQ83WWAlyuAfKpkzsIu5/VHxMwudT/GFLzQzObMvA6LcJoC0rzexuCmuFlBcQgoEBF/Nh6nVn5XC5Sqszozb8BJynbsCpiIWAdEZnYzZdQUEa8NI/86M2cp8U/SBEbZAtspabk2M0/EAfgDiyLi7bb1Wvv3sw62WrKjMnNMJ5D/Rdo0M6djF/yK9/BGRKwZ5Tqtl/4Xg2vMoZX/gcUKKOsoM9/A6RHxXRW1smT3Dkb2qLy33i8dpZOjoR3xSJtsWWaeFxGvd7NAZvZaXw9fgo0a49tXfjUGcJRyMhyElzEBTzb0X6j8iszcpmFkEzT3wtbdOPcPaQ4mK+BsoZw0D2I3vJiZB3e5zq1KIsyPiAUMzpgWSH9hSkQsr8/v10L7ESZm5vi6rZ7A2TgeSzLzWSXbjsFOSq3aBV0f2aOliGgvRh/gosxchSsx0/pa2JEyc0bVXarEg8EZ01/54gYoLQd+w4L6eFiVrcHJuBbf4dx6faIc1b9U/W9HDm+D0KzKJ4yklJmX4D4swdER8UNrrJkxH1XeP8w6P1a+eUtQT6nb6tU0uBn2Ql9ELBsxhA1DrTq4xXAKmXkZ7lGybHJEDHqBzYx5Vakt+2VmU96iVjHuJtAzMUZp+v4LOrzyTiemzLxGAeVdJVOGZPU6ACJiBZ5X6sKlbQsdp9SSfrVqV/lWHYyOxR1Kht3aYXzPzNyn2Sz+E8rMfTNzSEZk5m64vz4+2mH8hurXO0qm9LXrMPST4GIcgrtrH7NYOY5PwRqcHxE/NfRfyczflXT8Bfsq3yG/4+SI+LKDzVeV75vdsbzh8CnVDut7pPGZObfe90XEVY11zsCVtY1YUe3vWe1vpny63Nk0nJnn4qYay5uY0aGZXB4RcwcBExErM3McblRa5wn4WcmkWyJiUdsiTynbZrpSe77A7Kq7sgMoI9FYpXg3aQ/re6IVaAKzEHsrL/JIpZ704y2lr3kkItp/HbR6ro1x2TB+vI65Pf//duhMfwOaGG062kNIrAAAAABJRU5ErkJggg==", + "text/latex": [ + "$\\displaystyle 69.152$" + ], + "text/plain": [ + "69.152" + ] + }, + "execution_count": 84, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "evaluate(test_dataloader , model_lora, device)\n" + ] + }, + { + "cell_type": "markdown", + "id": "8eacb773-a876-4575-a595-3a7ef2a99f4c", + "metadata": {}, + "source": [ + "You get a 3% improvement over a model trained from scratch by using LoRA. Note that this occurs despite the fact that the model fine-tuned with LoRA updated less parameters than the model trained from scratch!\n", + "\n", + "The ```model_lora.fc1``` attribute represents ```LinearWithLoRA``` which contains both the standard ```Linear``` layer ``(linear)`` and an additional ```LoRA``` layer ```(lora)``` which represents the ```LoRALayer```.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 85, + "id": "2684b92c-c477-4aac-ad75-35ba2def8dbc", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "LinearWithLoRA(\n", + " (linear): Linear(in_features=100, out_features=128, bias=True)\n", + " (lora): LoRALayer()\n", + ")" + ] + }, + "execution_count": 85, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "model_lora.fc1" + ] + }, + { + "cell_type": "markdown", + "id": "01f51868-4cf9-4e04-b971-c94435b9e6b0", + "metadata": {}, + "source": [ + "From ```model_lora.fc1.lora```, you can obtain the learnable parameters A and B. \n" + ] + }, + { + "cell_type": "code", + "execution_count": 86, + "id": "8fa84b7f-7a98-4d0a-9a13-53aa8de6a081", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "B Parameter containing:\n", + "tensor([[-4.3192e-01, -1.1071e+00, 2.4456e-01, -3.1031e-02, -2.0012e-01,\n", + " -6.7811e-01, -1.3552e-01, -2.7458e-01, 3.2278e-02, 6.7592e-02,\n", + " 8.3020e-01, 1.1610e-05, -1.0894e-01, 7.7830e-05, -1.6789e-01,\n", + " -1.3309e-01, -5.1875e-01, 2.1928e-02, -6.5869e-02, -3.5834e-01,\n", + " -2.4473e-02, -1.1260e+00, -8.8752e-02, -7.0861e-03, -1.3263e-02,\n", + " 0.0000e+00, -6.9039e-01, -8.6471e-02, -3.9146e-01, -2.2644e-01,\n", + " -8.7611e-01, -7.9929e-01, 0.0000e+00, 3.9646e-01, 5.2164e-01,\n", + " -4.2730e-01, 2.3550e-01, 4.0447e-02, 2.3289e-01, -4.5217e-01,\n", + " 1.7721e-03, -4.7263e-01, -2.4343e-01, 6.3737e-01, 0.0000e+00,\n", + " 2.6904e-03, -7.8828e-01, 2.2559e-02, -4.3776e-02, 3.0909e-01,\n", + " -1.6914e-01, -2.0294e-01, -4.2175e-01, 7.8840e-01, -3.1771e-01,\n", + " -2.0639e-01, 1.1487e-02, -5.7238e-01, 1.4071e-01, -2.8561e-01,\n", + " 1.1753e-01, -1.6501e-04, -4.5406e-01, 0.0000e+00, 6.4464e-05,\n", + " -4.4552e-01, -1.4372e-01, 5.2899e-02, -9.7813e-01, -4.5834e-01,\n", + " -6.0968e-01, 5.5418e-02, -7.8054e-01, -1.2806e-01, -1.2291e-01,\n", + " -1.4927e-01, 1.1984e-01, -4.9344e-02, -4.0802e-01, 0.0000e+00,\n", + " 1.0085e-01, 0.0000e+00, -1.1553e+00, -1.1726e-01, 0.0000e+00,\n", + " -1.5980e-01, 1.0715e-01, -6.6047e-01, -4.0122e-01, -3.3328e-01,\n", + " 0.0000e+00, -8.4713e-05, -1.0361e+00, -2.0651e-01, 4.8110e-01,\n", + " -1.1648e+00, -1.3535e+00, 3.9261e-01, -5.9634e-04, 0.0000e+00,\n", + " -2.3484e-02, -2.9119e-02, 4.8014e-03, -9.4629e-02, -2.8612e-01,\n", + " -2.1896e-02, -3.7316e-01, 3.3309e-02, -1.2363e-01, -1.7977e-01,\n", + " 4.3949e-05, -2.0580e-01, 4.9675e-01, -3.3358e-03, 2.3369e-01,\n", + " -2.5254e-01, 8.1739e-03, 1.1209e-01, -1.7311e-01, -8.0624e-02,\n", + " -1.3280e+00, 7.7398e-01, -1.2380e+00, -4.4729e-01, -6.5231e-01,\n", + " -3.0026e-01, 8.8543e-02, -1.9287e-04],\n", + " [ 2.8719e-01, 4.5585e-01, -2.3756e-01, -4.2161e-02, 1.0720e-01,\n", + " -8.3839e-01, 2.5098e-01, -4.7875e-01, -6.7113e-02, -1.8940e-01,\n", + " -9.7494e-01, 1.5431e-05, -1.9587e-01, 2.8327e-04, 1.3258e-01,\n", + " -1.4904e-01, 7.6676e-01, -1.9980e-01, -5.4798e-02, -5.3408e-02,\n", + " -1.8818e-01, 7.9588e-01, 9.6087e-02, 2.5530e-02, -7.2462e-01,\n", + " 0.0000e+00, -1.4258e+00, -5.9680e-02, 1.7253e-03, 4.7575e-02,\n", + " 4.2872e-01, 1.7875e-02, 0.0000e+00, -1.6925e-01, 1.6071e+00,\n", + " -7.2869e-01, -2.5870e-01, 2.7839e-02, -3.0237e-01, 6.9624e-01,\n", + " -2.8908e-04, 2.9164e-01, 4.7389e-02, 2.8530e-01, 0.0000e+00,\n", + " -5.9235e-02, -6.8176e-02, -4.6494e-02, -8.1741e-02, -6.3252e-01,\n", + " 2.4469e-02, 7.2494e-02, 8.2158e-02, 6.3870e-01, 2.1799e-01,\n", + " 7.5177e-01, 3.1872e-02, 1.4100e-01, -2.3390e-01, 3.3391e-01,\n", + " -3.8685e-01, -4.5463e-03, 1.4543e-01, 0.0000e+00, 3.8082e-05,\n", + " 3.0410e-01, 4.7614e-02, -1.5299e-01, -3.4549e-01, -2.3156e-01,\n", + " 2.2711e-01, -1.0894e-01, 1.7479e+00, 5.1043e-03, 1.4021e-02,\n", + " 1.9368e-02, -1.0875e-01, 9.7929e-03, -7.4530e-02, 0.0000e+00,\n", + " 1.9233e-02, 0.0000e+00, 5.1489e-01, -2.0680e-01, 0.0000e+00,\n", + " 1.8348e-01, 5.8013e-01, 2.9426e-01, -8.2005e-01, 2.2163e-01,\n", + " 0.0000e+00, 1.8530e-04, 5.1081e-01, 1.1597e-01, -4.4384e-01,\n", + " 6.3387e-01, 9.7338e-01, -6.0469e-01, -6.5377e-02, 0.0000e+00,\n", + " -7.6416e-02, -4.1226e-02, 1.3478e-01, -2.0853e-01, 2.4558e-01,\n", + " -1.0312e-01, 4.7792e-01, 4.1285e-02, -4.4301e-02, 3.2861e-02,\n", + " 6.9632e-04, 4.0895e-02, 1.1403e-01, 4.4740e-03, -2.8703e-01,\n", + " 3.8271e-01, -2.0000e-03, -5.1493e-02, 2.4292e-01, -2.3110e-01,\n", + " 1.1481e+00, -1.6331e+00, 5.0157e-01, -1.3528e-01, 3.8709e-01,\n", + " 1.4637e-01, -1.2716e-01, -6.3597e-05]], device='cuda:0',\n", + " requires_grad=True)\n", + "\n", + " Number of elements in the tensor B 256\n" + ] + } + ], + "source": [ + "B=model_lora.fc1.lora.B\n", + "print(\"B\",B)\n", + "print(\"\\n Number of elements in the tensor B\",B.numel())\n", + "torch.save(B, 'B.pth')\n" + ] + }, + { + "cell_type": "code", + "execution_count": 87, + "id": "e184ffb3-ccff-4637-a0f1-85b4c7c83bb0", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "A Parameter containing:\n", + "tensor([[ 3.2952e-01, 9.3019e-01],\n", + " [ 2.3904e+00, -5.1022e+00],\n", + " [-2.4573e-01, 2.4733e+00],\n", + " [ 5.8014e-01, 3.9014e-01],\n", + " [ 1.6970e+00, -2.4614e+00],\n", + " [-1.2420e+00, 8.3014e-01],\n", + " [-2.0468e+00, 1.1629e+00],\n", + " [-5.9361e-01, 1.8099e-01],\n", + " [ 9.2466e-02, 1.1583e-01],\n", + " [-2.0841e-02, 1.5550e+00],\n", + " [ 1.3028e+00, -9.8381e-01],\n", + " [ 1.4320e+00, -3.3497e+00],\n", + " [-1.1637e+00, 1.9436e+00],\n", + " [ 2.4898e-01, -1.1353e+00],\n", + " [ 2.4423e+00, 1.0154e+00],\n", + " [ 2.4881e+00, -4.6765e+00],\n", + " [-1.8985e-01, 1.3426e+00],\n", + " [-1.1730e-01, -2.1925e+00],\n", + " [ 2.0193e+00, -8.5886e-01],\n", + " [-3.1268e+00, 3.5134e+00],\n", + " [ 1.0935e+00, -2.9263e+00],\n", + " [-1.0435e+00, 2.5428e-01],\n", + " [-8.6704e-01, 2.3570e+00],\n", + " [-5.2725e-02, -2.3731e-01],\n", + " [-4.3896e+00, 5.0369e+00],\n", + " [-1.1174e+00, 9.9728e-01],\n", + " [ 4.7710e-01, -1.2162e+00],\n", + " [ 3.3296e+00, -5.9627e+00],\n", + " [ 1.7393e+00, -2.0707e+00],\n", + " [-4.0115e-01, 2.2691e+00],\n", + " [ 4.8368e-04, 1.1773e+00],\n", + " [ 1.9858e+00, -1.6100e+00],\n", + " [ 6.9951e-01, -1.2869e+00],\n", + " [-2.6380e-01, 2.5743e+00],\n", + " [-3.9002e+00, 2.6775e+00],\n", + " [-2.1946e+00, 3.7126e+00],\n", + " [-3.2268e-01, 1.7346e-01],\n", + " [ 4.7590e+00, -2.9553e+00],\n", + " [-6.6704e-01, 8.1369e-01],\n", + " [ 1.0073e+00, 3.6606e-01],\n", + " [ 1.3186e-01, -3.4151e+00],\n", + " [-6.6832e-01, 3.7761e-01],\n", + " [-3.0825e-01, 1.1928e+00],\n", + " [ 1.8897e+00, -1.2788e+00],\n", + " [-1.4344e+00, -2.9294e-01],\n", + " [ 2.1678e+00, -2.5469e+00],\n", + " [-6.6993e-01, 2.2932e+00],\n", + " [-3.1326e+00, 4.3424e+00],\n", + " [ 4.2197e+00, -7.1153e+00],\n", + " [ 7.1775e-01, -4.0832e-01],\n", + " [-1.6389e+00, 1.4520e+00],\n", + " [-1.6150e+00, 2.4683e+00],\n", + " [ 1.5745e+00, -4.6085e+00],\n", + " [ 6.0900e-01, -1.0851e+00],\n", + " [-1.2270e+00, 4.0171e-01],\n", + " [-2.6879e-01, 1.5706e+00],\n", + " [ 2.0390e+00, -1.8152e+00],\n", + " [-1.2317e+00, 6.0501e-01],\n", + " [ 4.1927e-02, -2.4078e+00],\n", + " [ 1.4265e+00, -9.2525e-01],\n", + " [ 5.9394e-01, -7.4587e-01],\n", + " [ 6.1463e-01, -2.6220e-01],\n", + " [ 6.7605e-01, -2.1740e-01],\n", + " [-5.9436e-01, 9.3632e-01],\n", + " [ 2.8933e-01, -1.3208e-02],\n", + " [-2.0206e+00, 3.9333e+00],\n", + " [ 2.4903e+00, -1.1771e+00],\n", + " [ 9.5112e-01, -1.7163e+00],\n", + " [ 1.7097e+00, -8.8835e-01],\n", + " [-3.4182e+00, 2.9024e+00],\n", + " [ 1.3908e+00, -1.7813e+00],\n", + " [-1.9825e+00, 2.1889e+00],\n", + " [ 3.2187e-02, -1.7187e-01],\n", + " [ 2.3456e+00, -3.3756e+00],\n", + " [-1.3550e+00, 3.8261e+00],\n", + " [-2.2111e+00, 1.7149e+00],\n", + " [-1.6745e+00, 2.1577e+00],\n", + " [-2.0270e+00, 3.8298e+00],\n", + " [ 1.6872e+00, -1.8343e+00],\n", + " [-1.3611e+00, 8.4912e-01],\n", + " [-1.5436e+00, 2.8195e+00],\n", + " [-8.4964e-01, 1.7983e+00],\n", + " [ 1.7100e+00, -6.1945e-01],\n", + " [ 1.7025e+00, -3.0842e+00],\n", + " [ 1.5122e+00, 7.2826e-01],\n", + " [ 9.1957e-01, -2.4107e+00],\n", + " [-2.5545e+00, 4.1697e+00],\n", + " [ 1.7832e-01, -2.9886e+00],\n", + " [-4.0698e+00, 5.4074e+00],\n", + " [ 1.3872e+00, -2.1134e+00],\n", + " [-3.7505e+00, 4.7124e+00],\n", + " [-1.6772e+00, 3.0610e+00],\n", + " [-7.6127e-01, 7.7868e-01],\n", + " [-5.1529e-01, 7.3124e-02],\n", + " [ 2.5028e+00, -5.3530e-01],\n", + " [ 2.1893e+00, -2.8006e+00],\n", + " [-1.7899e+00, 1.7446e+00],\n", + " [-1.1793e+00, 4.6499e+00],\n", + " [ 1.1547e+00, 1.7930e+00],\n", + " [-1.7421e-01, 2.7170e+00]], device='cuda:0', requires_grad=True)\n", + "\n", + " Number of elements in the tensor A 200\n" + ] + } + ], + "source": [ + "A=model_lora.fc1.lora.A\n", + "print(\"A\",A)\n", + "print(\"\\n Number of elements in the tensor A\",A.numel())\n", + "torch.save(A, 'A.pth')" + ] + }, + { + "cell_type": "markdown", + "id": "ca386b8c-f866-4f71-810b-324f96cb7a61", + "metadata": {}, + "source": [ + "A and B have approximately 450 parameters. If you were to store the entire linear layer, you would have 12,800 parameters, which is around 28 times more. Remember, this is possibly the simplest model that you can have.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 88, + "id": "2873e500-1b1b-4302-a782-fa3ab59b374e", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + " Number of elements in the tensor A 12800\n" + ] + } + ], + "source": [ + "\n", + "print(\"\\n Number of elements in the tensor A\",model_lora.fc1.linear.weight.numel())\n" + ] + }, + { + "cell_type": "markdown", + "id": "7a88cfa5-73d4-458f-8c66-3229e4a5d49b", + "metadata": {}, + "source": [ + " alfa and the ouput layer are also saved.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 89, + "id": "6b203f60-2fdd-4c3e-9f22-55d0ba34827a", + "metadata": {}, + "outputs": [], + "source": [ + "alfa_=model_lora.fc1.lora.alpha\n", + "\n", + "torch.save(alfa_, 'alfa_.pth')\n", + "\n", + "torch.save(model_lora.fc2.state_dict(), 'out_layer.pth')" + ] + }, + { + "cell_type": "markdown", + "id": "8ec5a028-c58f-4ad6-8fcd-9fc94369c5ee", + "metadata": {}, + "source": [ + "## Loading the model\n", + "\n", + "The main advantage of LoRA is that for fine-tuning, you only need to save the learnable parameters A and B, alpha, and the output layer in your classification example.\n", + "\n", + "The saved files are converted to tensors and the linear layer, respectively.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 91, + "id": "eff5f4e2-3b93-486d-8353-90657c44a97f", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "A: torch.Size([100, 2])\n" + ] + } + ], + "source": [ + "A = torch.load('A.pth')\n", + "print(\"A:\",A.shape)" + ] + }, + { + "cell_type": "code", + "execution_count": 92, + "id": "dd2b464e-5b25-4ce5-8534-e1c58b40a708", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "B: torch.Size([2, 128])\n" + ] + } + ], + "source": [ + "B = torch.load('B.pth')\n", + "print(\"B:\",B.shape)" + ] + }, + { + "cell_type": "code", + "execution_count": 93, + "id": "042798fa-c836-4a1c-bda5-6df0a12c24d2", + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAACEAAAAQCAYAAACYwhZnAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjcuNSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/xnp5ZAAAACXBIWXMAABJ0AAASdAHeZh94AAABlElEQVR4nM3VT4iNURjH8c+IjWkaNSUL5c/Nn51ZKCMLSQ3LWcySZCdEEzvq6VG2hMx6SvbWU5qVPylrmUEWZENIY+la3HN5O7ncidt46u33vk+/97zf85xznneo3W5b7VhbJzJzM67gKMbwDveQEfGx34EzcxoHMY49GMHdiDhWe9dUL7bwFCfxBNfxCufxKDPH+oXAZZwtEG9/Z6wrMYuNOBcRtxpw1zCDqzjVJ8QM3uCFTkUWehl/VKJUYRKvcbvyBZZxPDOH+yGIiIWIWIqIP2665nIcKjofEd+qAb/gAdZjoh+IlUQTYlfRxR7epaI7BwkxWvRzD283v2GQEKsWTYjuTEd/ZWzkPw0S4nnRXmu+o2ivPfNPILrneDIz6yY2ggP4iscDg4iIl5jHVpypfIlh3ImI5QZcKzN3Z+a6v4GoO+ZpPMTNzDyMZ9in00MWcany38cWbNNpcj+pM6cwVR43Fd2fmXPl/n1EXKQ6HaUaezFXPn4BLdzARER8WMEEx3GiXEdKbnsjN901Dv0Pv/LvvxltuHPwrH4AAAAASUVORK5CYII=", + "text/latex": [ + "$\\displaystyle 0.1$" + ], + "text/plain": [ + "0.1" + ] + }, + "execution_count": 93, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "alfa_ = torch.load('alfa_.pth')\n", + "alfa_ \n" + ] + }, + { + "cell_type": "markdown", + "id": "5f2f40f8-ebc8-4f6f-ad13-35c66f05b2e7", + "metadata": {}, + "source": [ + "The output layer:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 94, + "id": "b5ef67e4-d89d-4525-91c0-79c128af2c6b", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 94, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "output_layer=nn.Linear(in_features=128, out_features=2, bias=True)\n", + "output_layer.load_state_dict(torch.load('out_layer.pth'))" + ] + }, + { + "cell_type": "markdown", + "id": "c2ce6ab9-6d2d-4274-9cbe-96f83c175159", + "metadata": {}, + "source": [ + "The model object is created and the pretrained parameters are loaded:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 95, + "id": "a1950bfb-dd37-421c-a288-502d89e8b8d6", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "TextClassifier(\n", + " (embedding): Embedding(400000, 100)\n", + " (fc1): Linear(in_features=100, out_features=128, bias=True)\n", + " (relu): ReLU()\n", + " (fc2): Linear(in_features=128, out_features=4, bias=True)\n", + ")" + ] + }, + "execution_count": 95, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "\n", + "model_load_lora = TextClassifier(num_classes=4,freeze=False)\n", + "model_load_lora.to(device)\n", + "\n", + "urlopened = urlopen('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/uGC04Pom651hQs1XrZ0NsQ/my-model-freeze-false.pth')\n", + "\n", + "stream = io.BytesIO(urlopened.read())\n", + "state_dict = torch.load(stream, map_location=device)\n", + "model_load_lora.load_state_dict(state_dict)\n", + "\n", + "model_load_lora" + ] + }, + { + "cell_type": "markdown", + "id": "6710f633-1588-47e0-9a76-da4aeb062784", + "metadata": {}, + "source": [ + "The LoRA layer object is added to the original hidden layer.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 96, + "id": "5c62b5ea-bc49-40f1-a488-9e107058547f", + "metadata": {}, + "outputs": [], + "source": [ + "model_load_lora.fc1=LinearWithLoRA(model_load_lora.fc1,rank=2, alpha=0.1)\n", + "model_load_lora.fc2=nn.Linear(in_features=128, out_features=2, bias=True).to(device)" + ] + }, + { + "cell_type": "markdown", + "id": "47867680-839c-43f2-8907-b11fce97f1aa", + "metadata": {}, + "source": [ + "The parameters from fine-tuning are added.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 97, + "id": "68cd3b53-fda5-42b4-965c-fbc98ef93999", + "metadata": {}, + "outputs": [], + "source": [ + "model_load_lora.fc1.lora.A=A\n", + "model_load_lora.fc1.lora.B=B\n", + "model_load_lora.fc1.lora.alpha=alfa_ \n", + "model_load_lora.fc2=output_layer" + ] + }, + { + "cell_type": "code", + "execution_count": 98, + "id": "4fee5c83-2b3f-449e-bbad-1a612d600c18", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "TextClassifier(\n", + " (embedding): Embedding(400000, 100)\n", + " (fc1): LinearWithLoRA(\n", + " (linear): Linear(in_features=100, out_features=128, bias=True)\n", + " (lora): LoRALayer()\n", + " )\n", + " (relu): ReLU()\n", + " (fc2): Linear(in_features=128, out_features=2, bias=True)\n", + ")" + ] + }, + "execution_count": 98, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "model_load_lora.to(device)\n", + "model_load_lora.eval()\n" + ] + }, + { + "cell_type": "code", + "execution_count": 99, + "id": "8cb0716c-1dcd-42df-96ca-740384ef79fe", + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAEYAAAAQCAYAAACr+QluAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjcuNSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/xnp5ZAAAACXBIWXMAABJ0AAASdAHeZh94AAAD0ElEQVR4nO3YaaiWRRQH8N81tYUKKooi2imwjVbFwBbbPkiSLRChWWB9SUwqKQQ7nSC0wkoIiizKCoo2KlHaDLJAEsLAsI3KoiJSy7TSzKUPMy88vj7Xe837sQMP8848//95zjlz5szM27Nt2zb/y44yuG0wMy/EZIzEAViD5ZgTEQsbuB5Mqs9J6MFneAKPR8TWvgzIzIMwDmNwCg7Hpvq9p/BUt57/wunl2+PxbO3eGBFPdN4NagHfj3dxFt7AbCzAwTi/C/4cHsfReF4JyD54FE/3ZViVqzEXI/ARHsYrOLnqe7FOwO5yuv08Ao/gj7b3g7vAN2Ia5uGmiNjU9X5I4/c4XItvMTwiVtfxodXICZn5WkS8ujMD8SXGYkFzljNzOpbiSlxRde4Op+lHj5JZa/Aqbu/GDGqA98S9+F5LUCAi/ml0x9V2dicoFbMJM2p3cpthXTrfi4j53akfET/jsdo9f3c5XTIFo3ED/mwDNDPmYmW5PIytmTlGSc2NWBoRS7q4h9b2mxa9nbFRmTm0Lcj9lM5EbB4oTmYOwyylXi7OzNFtuGZgzq7tRixTgtJUuBhXRcSqOtTJkmNa9B7b0H8sPu/dj3bJzMG4rnbfHAhOff+ssiqm70xXs/geUttp2IZR2A+n4m2ci5ca+AW1vTUzD2x8fAiygTtg5+70KrOUyVkYEW8NEOcunI7rI2LDzhQ1M6YTpM0YGxEra395LbRf4LzMHFmX1QuYgEuxIjNfV7LtIhymzMqR6HPb7JbMnILblEybMBCczByhZMnslrKwgzQzZm1tlzWCAiLiL3RmYHgd24LLcCdWYWJ9vsI5WF/xv/Tt1nYOTMYcrMAFEfHr7nLqEnpG2c1m7KhhR2lmzBe1XdsL9rfa7t0ZqLvUffVpGrIXjsfqiPi2P4ZU3lQ8hE9xYUT0GdR+cvbFCfX3xsxsgZibmXOVojy1GZhFSm05MTMHtZwcO8W4P45eg6HKoa9fkpl3KDXiE1zcPAIMAOdvPNnLuzOUuvOhkhxLaGRMRHyXmfOVg9Mtyix0DLhEqSVrNap9Zu4fEeu6jD0NDygZNqvFmeMwBF93zkWZOQP34GNc0s/l029OLbSTetFztxKYec0rQfdd6eYKerCeY5Yp2/Hl2IJJEfF7A/9OZm5Q0ng9hin3lw24LCJ+arFlEY6qeldm5sTq4BZ8gCktqb4yIp5uOLPLnF2V7QITET9k5pnKtjZW2aLXYT5mRsTSLv7LyrIZr9SeH5W708yI+KGfNnTOQXtgai+Y921/9/ovnF2Snv//dmiXfwHLqKZB8S0rTwAAAABJRU5ErkJggg==", + "text/latex": [ + "$\\displaystyle 69.224$" + ], + "text/plain": [ + "69.224" + ] + }, + "execution_count": 99, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "evaluate(test_dataloader , model_load_lora, device)" + ] + }, + { + "cell_type": "markdown", + "id": "aa14425e-297e-4fce-ab41-51686ebcbe9b", + "metadata": {}, + "source": [ + "This confirms that the model loaded correctly. You still get a 3% improvement in accuracy!\n", + "\n", + "Finally, the following shows how you can make a prediction on the following article using the function **`predict`**:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 100, + "id": "839bfe9f-fe50-4106-99f7-ede8dbdc2310", + "metadata": {}, + "outputs": [], + "source": [ + "article=\"\"\"This was a lacklustre movie with very little going for it. I was not impressed.\"\"\"" + ] + }, + { + "cell_type": "markdown", + "id": "468d8dc4-0c6c-4e9f-9433-d8b819f8afab", + "metadata": {}, + "source": [ + "This markdown content generates a styled box with light gray background and padding. It contains an `

` header displaying the content of the `article` variable, and an `

` header indicating the predicted category of the news article which is provided by the `result` variable. The placeholders `{article}` and `{result}` will be dynamically replaced with actual values when this markdown is rendered.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 101, + "id": "6264a056-7192-47fa-b02b-5f32d6e0e404", + "metadata": {}, + "outputs": [ + { + "data": { + "text/markdown": [ + "\n", + "
\n", + "

This was a lacklustre movie with very little going for it. I was not impressed.

\n", + "

The category of the news article: negative review

\n", + "
\n" + ], + "text/plain": [ + "" + ] + }, + "execution_count": 101, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "result = predict(article, model_load_lora, text_pipeline)\n", + "\n", + "markdown_content = f'''\n", + "
\n", + "

{article}

\n", + "

The category of the news article: {result}

\n", + "
\n", + "'''\n", + "\n", + "md(markdown_content)" + ] + }, + { + "cell_type": "markdown", + "id": "d62be84d-bbc9-4021-bc17-762b265614b4", + "metadata": {}, + "source": [ + "---\n", + "## Exercise: Apply LoRA to a different network\n", + "\n", + "The following code defines a neural network called `NNet`. \n", + "\n", + "`NNet` is a neural network that was originally written to identify hand-written digits from 32x32 images. Your task is to fine-tune this network to perform letter recognition using LoRA by replacing the section labeled `### REPLACE THIS ###` in the code block below. To enhance your understanding, apply LoRA to just the second linear layer, and replace the last layer with a layer that has 26 outputs, one for each letter in the English alphabet.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 102, + "id": "b0a24c15-bcb2-4171-850b-4a47c1380244", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "This is what the model looked like before applying LoRA:\n", + "NNet(\n", + " (conv1): Conv2d(1, 6, kernel_size=(5, 5), stride=(1, 1))\n", + " (conv2): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1))\n", + " (fc1): Linear(in_features=400, out_features=120, bias=True)\n", + " (fc2): Linear(in_features=120, out_features=84, bias=True)\n", + " (fc3): Linear(in_features=84, out_features=10, bias=True)\n", + ")\n", + "\n", + "###############\n", + "\n", + "This is what the model looked like after applying LoRA:\n", + "NNet(\n", + " (conv1): Conv2d(1, 6, kernel_size=(5, 5), stride=(1, 1))\n", + " (conv2): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1))\n", + " (fc1): Linear(in_features=400, out_features=120, bias=True)\n", + " (fc2): LinearWithLoRA(\n", + " (linear): Linear(in_features=120, out_features=84, bias=True)\n", + " (lora): LoRALayer()\n", + " )\n", + " (fc3): Linear(in_features=84, out_features=26, bias=True)\n", + ")\n" + ] + } + ], + "source": [ + "#TODO\n", + "class NNet(nn.Module):\n", + "\n", + " def __init__(self):\n", + " super(NNet, self).__init__()\n", + " # 1 input image channel, 6 output channels, 5x5 square convolution\n", + " # kernel\n", + " self.conv1 = nn.Conv2d(1, 6, 5)\n", + " self.conv2 = nn.Conv2d(6, 16, 5)\n", + " # an affine operation: y = Wx + b\n", + " self.fc1 = nn.Linear(16 * 5 * 5, 120) # 5*5 from image dimension\n", + " self.fc2 = nn.Linear(120, 84)\n", + " self.fc3 = nn.Linear(84, 10)\n", + "\n", + " def forward(self, input):\n", + " # Convolution layer C1: 1 input image channel, 6 output channels,\n", + " # 5x5 square convolution, it uses RELU activation function, and\n", + " # outputs a Tensor with size (N, 6, 28, 28), where N is the size of the batch\n", + " c1 = F.relu(self.conv1(input))\n", + " # Subsampling layer S2: 2x2 grid, purely functional,\n", + " # this layer does not have any parameter, and outputs a (N, 6, 14, 14) Tensor\n", + " s2 = F.max_pool2d(c1, (2, 2))\n", + " # Convolution layer C3: 6 input channels, 16 output channels,\n", + " # 5x5 square convolution, it uses RELU activation function, and\n", + " # outputs a (N, 16, 10, 10) Tensor\n", + " c3 = F.relu(self.conv2(s2))\n", + " # Subsampling layer S4: 2x2 grid, purely functional,\n", + " # this layer does not have any parameter, and outputs a (N, 16, 5, 5) Tensor\n", + " s4 = F.max_pool2d(c3, 2)\n", + " # Flatten operation: purely functional, outputs a (N, 400) Tensor\n", + " s4 = torch.flatten(s4, 1)\n", + " # Fully connected layer F5: (N, 400) Tensor input,\n", + " # and outputs a (N, 120) Tensor, it uses RELU activation function\n", + " f5 = F.relu(self.fc1(s4))\n", + " # Fully connected layer F6: (N, 120) Tensor input,\n", + " # and outputs a (N, 84) Tensor, it uses RELU activation function\n", + " f6 = F.relu(self.fc2(f5))\n", + " # Gaussian layer OUTPUT: (N, 84) Tensor input, and\n", + " # outputs a (N, 10) Tensor\n", + " output = self.fc3(f6)\n", + " return output\n", + "\n", + "model_exercise = NNet()\n", + "model_exercise.to(device)\n", + "\n", + "print('This is what the model looked like before applying LoRA:')\n", + "print(model_exercise)\n", + "print(\"\\n###############\\n\")\n", + "\n", + "# Freeze all parameters:\n", + "for parm in model_exercise.parameters():\n", + " parm.requires_grad=False\n", + "\n", + "# Change final layer for one with 26 outputs:\n", + "model_exercise.fc3=nn.Linear(in_features=84, out_features=26, bias=True).to(device)\n", + "\n", + "# Apply LoRA to the second linear layer\n", + "model_exercise.fc2=LinearWithLoRA(model_exercise.fc2,rank=2, alpha=0.1).to(device)\n", + "\n", + "print('This is what the model looked like after applying LoRA:')\n", + "print(model_exercise)" + ] + }, + { + "cell_type": "markdown", + "id": "7f523e91-700c-4d28-998c-be37badf0c63", + "metadata": {}, + "source": [ + "
\n", + " Click here for the solution\n", + "\n", + "```python\n", + "class NNet(nn.Module):\n", + "\n", + " def __init__(self):\n", + " super(NNet, self).__init__()\n", + " # 1 input image channel, 6 output channels, 5x5 square convolution\n", + " # kernel\n", + " self.conv1 = nn.Conv2d(1, 6, 5)\n", + " self.conv2 = nn.Conv2d(6, 16, 5)\n", + " # an affine operation: y = Wx + b\n", + " self.fc1 = nn.Linear(16 * 5 * 5, 120) # 5*5 from image dimension\n", + " self.fc2 = nn.Linear(120, 84)\n", + " self.fc3 = nn.Linear(84, 10)\n", + "\n", + " def forward(self, input):\n", + " # Convolution layer C1: 1 input image channel, 6 output channels,\n", + " # 5x5 square convolution, it uses RELU activation function, and\n", + " # outputs a Tensor with size (N, 6, 28, 28), where N is the size of the batch\n", + " c1 = F.relu(self.conv1(input))\n", + " # Subsampling layer S2: 2x2 grid, purely functional,\n", + " # this layer does not have any parameter, and outputs a (N, 6, 14, 14) Tensor\n", + " s2 = F.max_pool2d(c1, (2, 2))\n", + " # Convolution layer C3: 6 input channels, 16 output channels,\n", + " # 5x5 square convolution, it uses RELU activation function, and\n", + " # outputs a (N, 16, 10, 10) Tensor\n", + " c3 = F.relu(self.conv2(s2))\n", + " # Subsampling layer S4: 2x2 grid, purely functional,\n", + " # this layer does not have any parameter, and outputs a (N, 16, 5, 5) Tensor\n", + " s4 = F.max_pool2d(c3, 2)\n", + " # Flatten operation: purely functional, outputs a (N, 400) Tensor\n", + " s4 = torch.flatten(s4, 1)\n", + " # Fully connected layer F5: (N, 400) Tensor input,\n", + " # and outputs a (N, 120) Tensor, it uses RELU activation function\n", + " f5 = F.relu(self.fc1(s4))\n", + " # Fully connected layer F6: (N, 120) Tensor input,\n", + " # and outputs a (N, 84) Tensor, it uses RELU activation function\n", + " f6 = F.relu(self.fc2(f5))\n", + " # Gaussian layer OUTPUT: (N, 84) Tensor input, and\n", + " # outputs a (N, 10) Tensor\n", + " output = self.fc3(f6)\n", + " return output\n", + "\n", + "model_exercise = NNet()\n", + "model_exercise.to(device)\n", + "\n", + "print('This is what the model looked like before applying LoRA:')\n", + "print(model_exercise)\n", + "print(\"\\n###############\\n\")\n", + "\n", + "# Freeze all parameters:\n", + "for parm in model_exercise.parameters():\n", + " parm.requires_grad=False\n", + "\n", + "# Change final layer for one with 26 outputs:\n", + "model_exercise.fc3=nn.Linear(in_features=84, out_features=26, bias=True).to(device)\n", + "\n", + "# Apply LoRA to the second linear layer\n", + "model_exercise.fc2=LinearWithLoRA(model_exercise.fc2,rank=2, alpha=0.1).to(device)\n", + "\n", + "print('This is what the model looked like after applying LoRA:')\n", + "print(model_exercise)\n", + "```\n", + "\n", + "
\n" + ] + }, + { + "cell_type": "markdown", + "id": "61f9dd98-d9f8-4017-bbdd-976a69eddd24", + "metadata": {}, + "source": [ + "---" + ] + }, + { + "cell_type": "markdown", + "id": "7635b116-9be8-4a51-b687-9ddd98bc905e", + "metadata": {}, + "source": [ + "## Congratulations! You have completed the lab\n", + "\n", + "## Authors\n", + "\n", + "[Joseph Santarcangelo](https://author.skills.network/instructors/joseph_santarcangelo) has a Ph.D. in Electrical Engineering, his research focused on using machine learning, signal processing, and computer vision to determine how videos impact human cognition. Joseph has been working for IBM since he completed his Ph.D.\n", + "\n", + "[Wojciech \"Victor\" Fulmyk](https://www.linkedin.com/in/wfulmyk) is a Data Scientist at IBM, and a PhD Candidate in economics at the University of Calgary.\n", + "\n", + "[Ashutosh Sagar](https://www.linkedin.com/in/ashutoshsagar/) is completing his MS in CS from Dalhousie University. He has previous experience working with Natural Language Processing and as a Data Scientist.\n", + "\n", + "## References\n", + "\n", + "[Finetuning with LoRA -- A Hands-On Example](https://lightning.ai/lightning-ai/studios/code-lora-from-scratch)\n", + "\n", + "[TEXT CLASSIFICATION WITH THE TORCHTEXT LIBRARY](https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html)\n", + "\n", + "\n", + "© Copyright IBM Corporation. All rights reserved.\n" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.20" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/notebooks/LLM_Specialization/QLoRA_with_Hugging_Face.ipynb b/notebooks/LLM_Specialization/QLoRA_with_Hugging_Face.ipynb new file mode 100644 index 0000000..f8ed8b0 --- /dev/null +++ b/notebooks/LLM_Specialization/QLoRA_with_Hugging_Face.ipynb @@ -0,0 +1,1312 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "

\n", + " \n", + " \"Skills\n", + " \n", + "

\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# QLoRA with HuggingFace\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "QLoRA is an extension of LoRA that leverages quantization. Quantization is the process of mapping continuous infinite values to a smaller set of discrete finite values. Effectively, the model's parameters are are stored in 2, 3, 4 or 8-bits as opposed to the usual 32-bits, lowering the number of bits needed to store information. Quantization offers two benefits:\n", + "\n", + "1. It reduced memory footprint. By using a finite set of discrete levels, the values can be represented with fewer bits, reducing the memory required to store them; and\n", + "2. It allows for efficient computation. Quantized values can be represented and processed more efficiently on hardware with limited numerical precision, such as low-power microcontrollers or specialized AI/ML accelerators.\n", + "\n", + "Choosing QLoRA over LoRA provides several tradeoffs. QLoRA offers the following advantages of LoRA:\n", + "\n", + "1. Substantially smaller GPU memory usage than LoRA.\n", + "2. Higher maximum sequence lengths resulting from the smaller GPU memory usage.\n", + "3. Higher batch sizes resulting from the smaller GPU memory usage.\n", + "\n", + "The main disadvantage of QLoRA is slower fine-tuning speed.\n", + "\n", + "Interestingly enough, the accuracy of QLoRA and LoRA are comparable despite the fact that QLoRA offers substantially smaller models with lower GPU memory footprints than LoRA.\n", + "\n", + "The original QLoRA paper is available [here](https://arxiv.org/pdf/2305.14314).\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Note that the following uses the popular `BitsAndBytes` library to implement QLoRA, which only supports quantization using a CUDA-enabled GPU. You will not be able to run this notebook without a compatible GPU!**" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# __Table of Contents__\n", + "\n", + "
    \n", + "
  1. Objectives
  2. \n", + "
  3. \n", + " Setup\n", + "
      \n", + "
    1. Install required libraries
    2. \n", + "
    3. Import required libraries
    4. \n", + "
    5. Define helper functions
    6. \n", + "
    \n", + "
  4. \n", + "
  5. IMDB dataset
  6. \n", + "
  7. Tokenizer
  8. \n", + "
  9. Configure BitsAndBytes
  10. \n", + "
  11. Load a quantized version of a pretrained model
  12. \n", + "
  13. Train
  14. \n", + "
  15. Results
  16. \n", + "
\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Objectives\n", + "\n", + "After completing this lab you will be able to:\n", + "\n", + "- Load and predict using models from HuggingFace\n", + "- Fine-tune language models using QLoRA\n", + "- Understand the advantages and disadvantages of QLoRA\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Setup\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Install required libraries\n", + "\n", + "For this lab, you use the following libraries, which are __not__ preinstalled in the Skills Network Labs environment. __You must run the code in the following cell__ to install them.\n", + "\n", + "\n", + "```bash\n", + "!pip install -U datasets==2.20.0 huggingface_hub==0.23.4 transformers==4.41.2 peft==0.11.1 bitsandbytes==0.43.1 matplotlib==3.9.0 scikit-learn==1.5.0\n", + "```\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Import required libraries\n", + "\n", + "The following code imports required libraries.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "True\n", + "Tesla P40\n", + "Import Successfully!\n" + ] + } + ], + "source": [ + "import os\n", + "os.environ[\"CUDA_VISIBLE_DEVICES\"]=\"0\"\n", + "\n", + "import torch\n", + "print(torch.cuda.is_available())\n", + "print(torch.cuda.get_device_name())\n", + "\n", + "import matplotlib.pyplot as plt\n", + "# You can also use this section to suppress warnings generated by your code:\n", + "def warn(*args, **kwargs):\n", + " pass\n", + "import warnings\n", + "warnings.warn = warn\n", + "warnings.filterwarnings('ignore')\n", + "\n", + "import json\n", + "\n", + "import numpy as np\n", + "\n", + "from datasets import load_dataset #load_metric\n", + "\n", + "from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments, BitsAndBytesConfig\n", + "\n", + "from peft import LoraConfig, get_peft_model, TaskType, replace_lora_weights_loftq, prepare_model_for_kbit_training\n", + "\n", + "print(\"Import Successfully!\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's check whether a compatible GPU is available:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "device(type='cuda')" + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Move the model to the appropriate device\n", + "device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')\n", + "device" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Define helper functions\n", + "\n", + "Here are some helper functions. We will use these later to save and load from JSON:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "def save_to_json(data, file_path):\n", + " \"\"\"\n", + " Save a dictionary to a JSON file.\n", + "\n", + " Args:\n", + " data (dict): The dictionary to save.\n", + " file_path (str): The path to the JSON file.\n", + " \"\"\"\n", + " with open(file_path, 'w') as json_file:\n", + " json.dump(data, json_file, indent=4)\n", + " print(f\"Data successfully saved to {file_path}\")\n", + " \n", + " \n", + "def load_from_json(file_path):\n", + " \"\"\"\n", + " Load data from a JSON file.\n", + "\n", + " Args:\n", + " file_path (str): The path to the JSON file.\n", + "\n", + " Returns:\n", + " dict: The data loaded from the JSON file.\n", + " \"\"\"\n", + " with open(file_path, 'r') as json_file:\n", + " data = json.load(json_file)\n", + " return data " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# IMDB dataset \n", + "\n", + "The IMDB dataset is a large movie review dataset, consisting of 50,000 movie reviews for training and 25,000 movie reviews for testing. The reviews are labeled as either positive or negative, and each review is a variable-length sequence of words. The IMDB dataset is a popular benchmark for text classification tasks, and it has been used to train a variety of natural language processing models. The following line loads the IMDB dataset:\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "imdb = load_dataset(\"imdb\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's display the structure of the IMDB dataset:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Dataset structure:\n", + "DatasetDict({\n", + " train: Dataset({\n", + " features: ['text', 'label'],\n", + " num_rows: 25000\n", + " })\n", + " test: Dataset({\n", + " features: ['text', 'label'],\n", + " num_rows: 25000\n", + " })\n", + " unsupervised: Dataset({\n", + " features: ['text', 'label'],\n", + " num_rows: 50000\n", + " })\n", + "})\n" + ] + } + ], + "source": [ + "# Display the structure of the dataset\n", + "print(\"Dataset structure:\")\n", + "print(imdb)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The following displays the available splits in the dataset (train, test, unsupervised)\n" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "dict_keys(['train', 'test', 'unsupervised'])" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "imdb.keys()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's explore and print a sample from the training set:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "Sample from the training set:\n", + "{'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered \"controversial\" I really had to see this for myself.

The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.

What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far between, even then it\\'s not shot like some cheaply made porno. While my countrymen mind find it shocking, in reality sex and nudity are a major staple in Swedish cinema. Even Ingmar Bergman, arguably their answer to good old boy John Ford, had sex scenes in his films.

I do commend the filmmakers for the fact that any sex shown in the film is shown for artistic purposes rather than just to shock people and make money to be shown in pornographic theaters in America. I AM CURIOUS-YELLOW is a good film for anyone wanting to study the meat and potatoes (no pun intended) of Swedish cinema. But really, this film doesn\\'t have much of a plot.', 'label': 0}\n" + ] + } + ], + "source": [ + "# Explore and print a sample from the training set\n", + "print(\"\\nSample from the training set:\")\n", + "print(imdb['train'][0])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The followiong displays the unique class labels in the dataset. For the IMDB dataset, the labels are integers representing sentiment, where 0 stands for “negative” and 1 stands for “positive”. Here’s how you can extract this information:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "Unique labels in the dataset (class information):\n", + "{0, 1}\n" + ] + } + ], + "source": [ + "train_labels = imdb['train']['label']\n", + "unique_labels = set(train_labels)\n", + "print(\"\\nUnique labels in the dataset (class information):\")\n", + "print(unique_labels)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The following dictionary maps class values to class names:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{0: 'negative', 1: 'positive'}" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "class_names = {0: \"negative\", 1: \"positive\"}\n", + "class_names" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Since the IMDB dataset is quite large, we’ll create smaller subsets to facilitate quicker training and testing.\n", + "\n", + "In this notebook, we simulate training and testing using the `small_` datasets due to time constraints. However, it's important to recognize that these smaller datasets are insufficient for proper fine-tuning of the DistilBERT model. For more accurate results, a larger subsample, such as the `medium_train_dataset`, would be necessary.\n", + "\n", + "Consequently, all results presented here pertain to models trained with the `medium_train_dataset` and evaluated on the test set from `medium_test_dataset`. However, the notebook, as is, does NOT train models on these datasets; rather, it trains models using the `small_` datasets, as training on the `medium_` datasets would be too time-consuming.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [], + "source": [ + "small_train_dataset = imdb[\"train\"].shuffle(seed=42).select([i for i in list(range(50))])\n", + "small_test_dataset = imdb[\"test\"].shuffle(seed=42).select([i for i in list(range(50))])\n", + "medium_train_dataset = imdb[\"train\"].shuffle(seed=42).select([i for i in list(range(3000))])\n", + "medium_test_dataset = imdb[\"test\"].shuffle(seed=42).select([i for i in list(range(300))])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Tokenizer\n", + "\n", + "The following loads the DistilBERT tokenizer: \n" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [], + "source": [ + "tokenizer = AutoTokenizer.from_pretrained(\"distilbert-base-uncased\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The tokenizer provides tokenized input IDs and an attention mask for a given input text:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Input IDs: [101, 1045, 12524, 1045, 2572, 8025, 1011, 3756, 2013, 2026, 2678, 3573, 2138, 1997, 2035, 1996, 6704, 2008, 5129, 2009, 2043, 2009, 2001, 2034, 2207, 1999, 3476, 1012, 1045, 2036, 2657, 2008, 2012, 2034, 2009, 2001, 8243, 2011, 1057, 1012, 1055, 1012, 8205, 2065, 2009, 2412, 2699, 2000, 4607, 2023, 2406, 1010, 3568, 2108, 1037, 5470, 1997, 3152, 2641, 1000, 6801, 1000, 1045, 2428, 2018, 2000, 2156, 2023, 2005, 2870, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 1996, 5436, 2003, 8857, 2105, 1037, 2402, 4467, 3689, 3076, 2315, 14229, 2040, 4122, 2000, 4553, 2673, 2016, 2064, 2055, 2166, 1012, 1999, 3327, 2016, 4122, 2000, 3579, 2014, 3086, 2015, 2000, 2437, 2070, 4066, 1997, 4516, 2006, 2054, 1996, 2779, 25430, 14728, 2245, 2055, 3056, 2576, 3314, 2107, 2004, 1996, 5148, 2162, 1998, 2679, 3314, 1999, 1996, 2142, 2163, 1012, 1999, 2090, 4851, 8801, 1998, 6623, 7939, 4697, 3619, 1997, 8947, 2055, 2037, 10740, 2006, 4331, 1010, 2016, 2038, 3348, 2007, 2014, 3689, 3836, 1010, 19846, 1010, 1998, 2496, 2273, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 2054, 8563, 2033, 2055, 1045, 2572, 8025, 1011, 3756, 2003, 2008, 2871, 2086, 3283, 1010, 2023, 2001, 2641, 26932, 1012, 2428, 1010, 1996, 3348, 1998, 16371, 25469, 5019, 2024, 2261, 1998, 2521, 2090, 1010, 2130, 2059, 2009, 1005, 1055, 2025, 2915, 2066, 2070, 10036, 2135, 2081, 22555, 2080, 1012, 2096, 2026, 2406, 3549, 2568, 2424, 2009, 16880, 1010, 1999, 4507, 3348, 1998, 16371, 25469, 2024, 1037, 2350, 18785, 1999, 4467, 5988, 1012, 2130, 13749, 7849, 24544, 1010, 15835, 2037, 3437, 2000, 2204, 2214, 2879, 2198, 4811, 1010, 2018, 3348, 5019, 1999, 2010, 3152, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 1045, 2079, 4012, 3549, 2094, 1996, 16587, 2005, 1996, 2755, 2008, 2151, 3348, 3491, 1999, 1996, 2143, 2003, 3491, 2005, 6018, 5682, 2738, 2084, 2074, 2000, 5213, 2111, 1998, 2191, 2769, 2000, 2022, 3491, 1999, 26932, 12370, 1999, 2637, 1012, 1045, 2572, 8025, 1011, 3756, 2003, 1037, 2204, 2143, 2005, 3087, 5782, 2000, 2817, 1996, 6240, 1998, 14629, 1006, 2053, 26136, 3832, 1007, 1997, 4467, 5988, 1012, 2021, 2428, 1010, 2023, 2143, 2987, 1005, 1056, 2031, 2172, 1997, 1037, 5436, 1012, 102]\n", + "Attention Mask: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]\n" + ] + } + ], + "source": [ + "my_tokens=tokenizer(imdb['train'][0]['text'])\n", + "\n", + "# Print the tokenized input IDs\n", + "print(\"Input IDs:\", my_tokens['input_ids'])\n", + "\n", + "# Print the attention mask\n", + "print(\"Attention Mask:\", my_tokens['attention_mask'])\n", + "\n", + "# If token_type_ids is present, print it\n", + "if 'token_type_ids' in my_tokens:\n", + " print(\"Token Type IDs:\", my_tokens['token_type_ids'])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The following preprocessing function tokenizes a text input. We apply this function to all texts in our datasets using the `.map()` method:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "0893c8e8a7d3452abbcabc48a4e8604d", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "Map: 0%| | 0/50 [00:00\n", + " \n", + " \n", + " [1880/1880 20:38, Epoch 10/10]\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
EpochTraining LossValidation LossAccuracy
1No log0.6294510.786667
2No log0.3796230.823333
30.5182000.3683520.826667
40.5182000.3603060.826667
50.5182000.3580130.830000
60.3086000.3576410.830000
70.3086000.3535800.830000
80.2841000.3549110.830000
90.2841000.3536850.840000
100.2841000.3548080.843333

" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "0428b91092944c3b980428bdd8415084", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "Downloading builder script: 0%| | 0.00/4.20k [00:00" + ] + }, + "execution_count": 29, + "metadata": {}, + "output_type": "execute_result" + }, + { + "data": { + "image/png": "", + "text/plain": [ + "

" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "eval_accuracy_qlora=get_metric_qlora('eval_accuracy',log_history_qlora)\n", + "eval_loss_qlora=get_metric_qlora('eval_loss',log_history_qlora)\n", + "plt.plot(eval_accuracy_qlora,label='eval_accuracy')\n", + "plt.plot(eval_loss_qlora,label='eval_loss')\n", + "plt.xlabel(\"epoch\")\n", + "plt.legend()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The above code results in the following plot:\n", + "\n", + "![qlora_training_plot](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/wzMMj73IuM6fKmPZtKtQNA/qlora-training-plot.png)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The above code indicates that, in this particular instance, the bulk of the benefits from fine-tuning were gained within the first 3 epochs.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Congratulations! You have completed the lab\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Authors\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "[Wojciech \"Victor\" Fulmyk](https://www.linkedin.com/in/wfulmyk) is a Data Scientist and a PhD Candidate in Economics at the University of Calgary.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "[Fateme Akbari](https://www.linkedin.com/in/fatemeakbari/) is a Ph.D. candidate in Information Systems at McMaster University with demonstrated research experience in Machine Learning and NLP.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "[Joseph Santarcangelo](https://author.skills.network/instructors/joseph_santarcangelo) has a Ph.D. in Electrical Engineering, his research focused on using machine learning, signal processing, and computer vision to determine how videos impact human cognition. Joseph has been working for IBM since he completed his PhD.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## References\n", + "\n", + "[Finetuning with LoRA -- A Hands-On Example](https://lightning.ai/lightning-ai/studios/code-lora-from-scratch)\n", + "\n", + "[QLORA: Efficient Finetuning of Quantized LLMs](https://arxiv.org/pdf/2305.14314)\n", + "\n", + "[Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA](https://huggingface.co/blog/4bit-transformers-bitsandbytes)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Change Log\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "|Date (YYYY-MM-DD)|Version|Changed By|Change Description|\n", + "|-|-|-|-|\n", + "|2024-07-09|0.99|Victor|Lab written|\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Copyright © 2024 IBM Corporation. All rights reserved.\n" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.20" + }, + "prev_pub_hash": "856043195c4f7d96c767c084082c1b6e5e7de2222404f63f3e209d4961071444" + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/requirements.txt b/requirements.txt index 07150c1..8130d39 100644 --- a/requirements.txt +++ b/requirements.txt @@ -8,11 +8,15 @@ transformers==4.46.3 #huggingface-hub==0.27.0 datasets==3.1.0 accelerate==1.0.1 +evaluate trl -jupyter==1.1.1 +peft +bitsandbytes sentencepiece==0.2.0 nltk==3.9.1 gradio==4.44.1 spacy==3.7.2 # conda install #flash_attn==2.6.3 # /home/loc/Works/flash-attention (need gcc to build) -scikit-learn \ No newline at end of file +scikit-learn +plotly +jupyter==1.1.1