Skip to content

Commit

Permalink
First release readme (pytorch#227)
Browse files Browse the repository at this point in the history
Reworked readme to highlight first release and feature set. 
q - use our logo?  (I think it adds some spark). 

Visual preview:
<img width="898" alt="Screenshot 2024-04-14 at 7 02 39 PM"
src="https://github.com/pytorch/torchtitan/assets/46302957/60b4b6a8-c4f3-41a8-8d8d-27b924f8de15">
  • Loading branch information
lessw2020 authored Apr 16, 2024
1 parent f86bfb2 commit a10262a
Show file tree
Hide file tree
Showing 3 changed files with 39 additions and 11 deletions.
45 changes: 36 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,18 +1,45 @@
# torchtitan
<p align="center">
<picture>
<source media="(prefers-color-scheme: light)" srcset="https://github.com/lessw2020/TorchTitan/blob/1ab9828ae6aa0e6508d9a7002d743d96d85e8599/assets/images/TorchTitan_logo_main.jpg">
<img alt="TorchTitan_Logo" width=35%>
</picture>
</p>

Note: This repository is currently under heavy development.
## torchtitan is still in pre-release!
`torchtitan` is currently in a pre-release state and under extensive development.

`torchtitan` is a proof-of-concept for Large-scale LLM training using native PyTorch. It is (and will continue to be) a repo to showcase PyTorch's latest distributed training features in a clean, minimal codebase. torchtitan is complementary to and not a replacement for any of the great large-scale LLM training codebases such as Megatron, Megablocks, LLM Foundry, Deepspeed, etc. Instead, we hope that the features showcased in torchtitan will be adopted by these codebases quickly. torchtitan is unlikely to ever grow a large community around it.
`torchtitan` is a native PyTorch reference architecture showcasing some of the latest PyTorch techniques for large scale model training.
* Designed to be easy to understand, use and extend for different training purposes.
* Minimal changes to the model code when applying 1D, 2D, or (soon) 3D Parallel.
* Modular components instead of monolithic codebase.
* Get started in minutes, not hours!

## Design Principles
Please note: `torchtitan` is a proof-of-concept for Large-scale LLM training using native PyTorch. It is (and will continue to be) a repo to showcase PyTorch's latest distributed training features in a clean, minimal codebase. torchtitan is complementary to and not a replacement for any of the great large-scale LLM training codebases such as Megatron, Megablocks, LLM Foundry, Deepspeed, etc. Instead, we hope that the features showcased in torchtitan will be adopted by these codebases quickly. torchtitan is unlikely to ever grow a large community around it.

While torchtitan utilizes the PyTorch ecosystem for things like data loading (i.e. HuggingFace datasets), the core functionality is written in PyTorch.
## Pre-Release Updates:
#### (4/16/2024): TorchTitan is now public but in a pre-release state and under development. Currently we showcase pre-training Llama2 models (LLMs) of various sizes from scratch.

Key features available:</br>
1 - [FSDP2 (per param sharding)](https://github.com/pytorch/torchtitan/blob/main/docs/fsdp.md) </br>
2 - Tensor Parallel (FSDP + Tensor Parallel)</br>
3 - Selective layer and op activation checkpointing </br>
4 - Distributed checkpointing (asynch pending) </br>
5 - 3 datasets pre-configured (47K - 144M)</br>
6 - GPU usage, MFU, tokens per second and other metrics all reported and displayed via TensorBoard.</br>
7 - optional Fused RMSNorm, learning rate scheduler, meta init, and more.</br>
8 - All options easily configured via toml files.</br>


## Coming soon features:
1 - Asynch checkpointing </br>
2 - FP8 support </br>
3 - Context Parallel </br>
4 - 3D (Pipeline Parallel) </br>
5 - Torch Compile support </br>

* Designed to be easy to understand, use and extend for different training purposes.
* Minimal changes to the model code, when applying 1D/2D or 3D Parallelisms.
* Modular components instead of monolithic codebase

# Installation
## Installation

Install PyTorch from source or install the latest pytorch nightly, then install requirements by

Expand All @@ -31,7 +58,7 @@ run the llama debug model locally to verify the setup is correct:
./run_llama_train.sh
```

# TensorBoard
## TensorBoard

To visualize TensorBoard metrics of models trained on a remote server via a local web browser:

Expand Down
1 change: 1 addition & 0 deletions assets/images/readme.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
images folder for main repo
4 changes: 2 additions & 2 deletions train.py
Original file line number Diff line number Diff line change
Expand Up @@ -390,8 +390,8 @@ def loss_fn(pred, labels):
)

if torch.distributed.get_rank() == 0:
logger.info("Sleeping 1 second for other ranks to complete")
time.sleep(1)
logger.info("Sleeping for 2 seconds for others ranks to complete ")
time.sleep(2)

metric_logger.close()
logger.info("Training completed")
Expand Down

0 comments on commit a10262a

Please sign in to comment.