update readme

NouamaneTazi · NouamaneTazi · commit 392b4b075cea · 2024-05-06T02:27:46.000Z
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -0,0 +1,224 @@
+<!---
+Copyright 2022 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# How to contribute to 🤗 Nanotron?
+
+Everyone is welcome to contribute, and we value everybody's contribution. Code
+is thus not the only way to help the community. Answering questions, helping
+others, reaching out and improving the documentations are immensely valuable to
+the community.
+
+It also helps us if you spread the word: reference the library from blog posts
+on the awesome projects it made possible, shout out on Twitter every time it has
+helped you, or simply star the repo to say "thank you".
+
+Whichever way you choose to contribute, please be mindful to respect our
+[code of conduct](CODE_OF_CONDUCT.md).
+
+## You can contribute in so many ways!
+
+Some of the ways you can contribute to nanotron:
+* Fixing outstanding issues with the existing code;
+* Contributing to the examples or to the documentation;
+* Submitting issues related to bugs or desired new features.
+
+## Submitting a new issue or feature request
+
+Do your best to follow these guidelines when submitting an issue or a feature
+request. It will make it easier for us to come back to you quickly and with good
+feedback.
+
+### Did you find a bug?
+
+The 🤗 Nanotron library is robust and reliable thanks to the users who notify us of
+the problems they encounter. So thank you for reporting an issue.
+
+First, we would really appreciate it if you could **make sure the bug was not
+already reported** (use the search bar on Github under Issues).
+
+Did not find it? :( So we can act quickly on it, please follow these steps:
+
+* Include your **OS type and version**, the versions of **Python** and **PyTorch**.
+* A short, self-contained, code snippet that allows us to reproduce the bug in
+  less than 30s;
+* Provide your Nanotron configuration used for the run;
+* Describe the expected behavior and the actual behavior;
+
+### Do you want a new feature?
+
+A good feature request addresses the following points:
+
+1. Motivation first:
+* Is it related to a problem/frustration with the library? If so, please explain
+  why. Providing a code snippet that demonstrates the problem is best.
+* Is it related to something you would need for a project? We'd love to hear
+  about it!
+* Is it something you worked on and think could benefit the community?
+  Awesome! Tell us what problem it solved for you.
+2. Write a *full paragraph* describing the feature;
+3. Provide a **code snippet** that demonstrates its future use;
+4. In case this is related to a paper, please attach a link;
+5. Attach any additional information (drawings, screenshots, etc.) you think may help.
+
+If your issue is well written we're already 80% of the way there by the time you
+post it.
+
+## Submitting a pull request (PR)
+
+Before writing code, we strongly advise you to search through the existing PRs or
+issues to make sure that nobody is already working on the same thing. If you are
+unsure, it is always a good idea to open an issue to get some feedback.
+
+You will need basic `git` proficiency to be able to contribute to
+🤗 Nanotron. `git` is not the easiest tool to use but it has the greatest
+manual. Type `git --help` in a shell and enjoy. If you prefer books, [Pro
+Git](https://git-scm.com/book/en/v2) is a very good reference.
+
+Follow these steps to start contributing:
+
+1. Fork the [repository](https://github.com/huggingface/nanotron) by
+   clicking on the 'Fork' button on the repository's page. This creates a copy of the code
+   under your GitHub user account.
+
+2. Clone your fork to your local disk, and add the base repository as a remote. The following command
+   assumes you have your public SSH key uploaded to GitHub. See the following guide for more
+   [information](https://docs.github.com/en/repositories/creating-and-managing-repositories/cloning-a-repository).
+
+   ```bash
+   $ git clone git@github.com:<your Github handle>/nanotron.git
+   $ cd nanotron
+   $ git remote add upstream https://github.com/huggingface/nanotron.git
+   ```
+
+3. Create a new branch to hold your development changes, and do this for every new PR you work on.
+
+   Start by synchronizing your `main` branch with the `upstream/main` branch (ore details in the [GitHub Docs](https://docs.github.com/en/github/collaborating-with-issues-and-pull-requests/syncing-a-fork)):
+
+   ```bash
+   $ git checkout main
+   $ git fetch upstream
+   $ git merge upstream/main
+   ```
+
+   Once your `main` branch is synchronized, create a new branch from it:
+
+   ```bash
+   $ git checkout -b a-descriptive-name-for-my-changes
+   ```
+
+   **Do not** work on the `main` branch.
+
+4. Set up a development environment by running the following command in a conda or a virtual environment you've created for working on this library:
+
+   ```bash
+   $ pip install -e ".[dev]"
+   $ pip install -e ".[test]"
+   $ pre-commit install
+   ```
+
+   (If nanotron was already installed in the virtual environment, remove
+   it with `pip uninstall nanotron` before reinstalling it in editable
+   mode with the `-e` flag.)
+
+   Alternatively, if you are using [Visual Studio Code](https://code.visualstudio.com/Download), the fastest way to get set up is by using
+   the provided Dev Container. Documentation on how to get started with dev containers is available [here](https://code.visualstudio.com/docs/remote/containers).
+
+5. Develop the features on your branch.
+
+   As you work on the features, you should make sure that the test suite
+   passes. You should run the tests impacted by your changes like this (see
+   below an explanation regarding the environment variable):
+
+   ```bash
+   $ pytest tests/<TEST_TO_RUN>.py
+   ```
+
+   `nanotron` relies on `ruff` to format its source code
+   consistently. After you make changes, apply automatic style corrections and code verifications
+   that can't be automated in one go with:
+
+   This target is also optimized to only work with files modified by the PR you're working on.
+
+   If you prefer to run the checks one after the other, the following command apply the
+   style corrections:
+
+   ```bash
+   $ pre-commit run --all-files
+   ```
+
+   Once you're happy with your changes, add changed files using `git add` and
+   make a commit with `git commit` to record your changes locally:
+
+   ```bash
+   $ git add modified_file.py
+   $ git commit
+   ```
+
+   Please write [good commit messages](https://chris.beams.io/posts/git-commit/).
+
+   It is a good idea to sync your copy of the code with the original
+   repository regularly. This way you can quickly account for changes:
+
+   ```bash
+   $ git fetch upstream
+   $ git rebase upstream/main
+   ```
+
+   Push the changes to your account using:
+
+   ```bash
+   $ git push -u origin a-descriptive-name-for-my-changes
+   ```
+
+6. Once you are satisfied (**and the checklist below is happy too**), go to the
+   webpage of your fork on GitHub. Click on 'Pull request' to send your changes
+   to the project maintainers for review.
+
+7. It's ok if maintainers ask you for changes. It happens to core contributors
+   too! So everyone can see the changes in the Pull request, work in your local
+   branch and push the changes to your fork. They will automatically appear in
+   the pull request.
+
+
+### Checklist
+
+1. The title of your pull request should be a summary of its contribution;
+2. If your pull request addresses an issue, please mention the issue number in
+   the pull request description to make sure they are linked (and people
+   consulting the issue know you are working on it);
+3. To indicate a work in progress please prefix the title with `[WIP]`, or mark
+   the PR as a draft PR. These are useful to avoid duplicated work, and to differentiate
+   it from PRs ready to be merged;
+4. Make sure existing tests pass;
+5. Add high-coverage tests. No quality testing = no merge.
+
+See an example of a good PR here: https://github.com/huggingface/nanotron/pull/155
+
+### Tests
+
+An extensive test suite is included to test the library behavior and several examples. Library tests can be found in
+the [tests folder](https://github.com/huggingface/nanotron/tree/main/tests).
+
+We use `pytest` in order to run the tests. From the root of the
+repository, here's how to run tests with `pytest` for the library:
+
+```bash
+# Runs all tests (where 12 of which run in parallel)
+$ pytest -n 12 tests
+```
+
+You can specify a smaller set of tests in order to test only the feature
+you're working on.
diff --git a/README.md b/README.md
@@ -23,99 +23,47 @@
 <h3 align="center">
     <a href="https://huggingface.co/nanotron"><img style="float: middle; padding: 10px 10px 10px 10px;" width="60" height="55" src="https://huggingface.co/datasets/huggingface/brand-assets/resolve/main/hf-logo.png" /></a>
 </h3>
+<h3 align="center">
+<p>Pretraining models made easy
+</h3>
 
 
+Nanotron is a library for pretraining transformer models. It provides a simple and flexible API to pretrain models on custom datasets. Nanotron is designed to be easy to use, fast, and scalable. It is built with the following principles in mind:
 
-#
-
-The objective of this library is to provide easy distributed primitives in order to train a variety of models efficiently using 3D parallelism. For more information about the internal design of the library or 3D parallelism in general, please check out [[docs.md]](./docs/docs.md) and [[3d_parallelism.md]](./docs/3d_parallelism.md).
-
-
-# Philosophy
-
-- Make it fast. At least as fast as other open source versions.
-- Make it minimal. We don't actually need to support all techniques and all versions of 3D parallelism. What matters is that we can efficiently use the "best" ones.
-- Make everything explicit instead of transparent. As we move forward, making things transparent works well when it works well but is a horrible debugging experience if one doesn't understand the implications of techniques used. In order to mitigate this, we choose to be explicit in the way it does things
-
-# Core Features
-
-We support the following:
- - 3D parallelism, including one-forward-one-backward pipeline engine
- - ZeRO-1 optimizer
- - FP32 gradient accumulation
- - Parameter tying/sharding
- - Spectral µTransfer parametrization for scaling up neural networks
-
-# Installation
+- **Simplicity**: Nanotron is designed to be easy to use. It provides a simple and flexible API to pretrain models on custom datasets.
+- **Performance**: Optimized for speed and scalability, Nanotron uses the latest techniques to train models faster and more efficiently.
 
-Requirements:
- - Python >= 3.10
- - PyTorch >= 2.0.0
- - Flash-Attention >= 2.5.0
+## Installation
 
-To install (in a new env):
 ```bash
-pip install torch
-pip install packaging; pip install "flash-attn>=2.5.0"  --no-build-isolation
-pip install nanotron
+# Requirements: Python>=3.10
+git clone https://github.com/huggingface/nanotron
+cd nanotron
+pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu121
+pip install -e .
+
+# Install dependencies if you want to use the example scripts
+pip install datasets transformers
+pip install "flash-attn>=2.5.0" --no-build-isolation
 ```
+> [!NOTE]
+> If you get `undefined symbol: ncclCommRegister` error you should install torch 2.1.2 instead: `pip install torch==2.1.2 --index-url https://download.pytorch.org/whl/cu121`
 
-Also nice to have: `pip install transformers datasets python-etcd tensorboardX`
-
-We also support a set of flavors that you can install using `pip install -e [$FLAVOR]`:
- - `dev`: Used is you are developping in `nanotron`. It installs in particular our linter mechanism. On top of that you have to run `pre-commit install` afterwards.
- - `test`: We use `pytest` in order to run out testing suite. In order to run tests in parallel, it will install `pytest-xdist`, which you can leverage by running `pytest -n 12 tests` (12 is the number of parallel test)
-
-
-# Quick examples
-
-In the `/examples` directory, you can find a few example configuration file, and a script to run it.
-
-You can run a sample training using:
-```bash
-torchrun --nproc_per_node=8 run_train.py --config-file examples/train_tiny_llama.sh
-```
+> [!TIP]
+> We log to wandb automatically if it's installed. For that you can use `pip install wandb`. If you don't want to use wandb, you can run `wandb disabled`.
 
-And run a sample generation using:
+## Quick Start
+### Training a tiny Llama model
+The following command will train a tiny Llama model on a single node with 8 GPUs. The model will be saved in the `checkpoints` directory as specified in the config file.
 ```bash
-torchrun --nproc_per_node=8 run_generation.py --ckpt-path checkpoints/text/4
+CUDA_DEVICE_MAX_CONNECTIONS=1 torchrun --nproc_per_node=8 run_train.py --config-file examples/config_tiny_llama.yaml
 ```
 
-# Development guidelines
-
-If you plan on developing on `nanotron`, we suggest you install the `dev` flavor: `pip install -e ".[dev]"`
-
-We use pre-commit to run a bunch of callbacks on each commit, mostly normalization code in order for the codebase to stay consistent. Please do run `pre-commit install`.
-
-For the linting:
+### Run generation from your checkpoint
 ```bash
-pre-commit install
-pre-commit run --config .pre-commit-config.yaml --all-files
+torchrun --nproc_per_node=1 run_generate.py --ckpt-path checkpoints/10/ --pp 1 --tp 1
 ```
+> [!TIP]
+> We could set a larger TP for faster generation, and a larger PP in case of very large models.
 
-*As a part of making sure we aren't slowed down as the codebase grows, we will not merge a PR if the features it introduces do not have test coverage.*
-
-We have extensions built on top of Nanotron, with their tests located in the `/examples` folder. Since VSCode defaults to discovering tests only in the `/tests` folder, please run tests from both `/examples` and `/tests` to ensure your PR does not break these extensions. Please run `make tests` to execute all the nanotron tests and the tests in the `/examples` directory that you need to pass.
-
-Features we would like to add:
-- [ ] Support `torch.compile`
-- [ ] More optimized kernels
-- [ ] Support Zero3
-- [ ] Other PP schedules (such as Interleaved 1f1b...)
-- [ ] Ring attention / Sequence Parallelism
-- [ ] 3D Parallel MoEs
-- [ ] Supporting more architectures (Mamba..)
-- [ ] ...
-
-
-# Useful scripts
-- `scripts/log_lighteval_to_wandb.py`: logs the evaluation results of LightEval to wandb, including summary statistics.
-
-
-# Environment Variables
-- `NANOTRON_BENCHMARK=1`: if you want to log the throughput during training
-
-
-# Credits
-
-We would like to thank everyone working on LLMs, especially those sharing their work openly from which we took great inspiration: Nvidia for `Megatron-LM/apex`, Microsoft for `DeepSpeed`, HazyResearch for `flash-attn`
+## Config file description