Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CI/CD for unit tests #41

Merged
merged 107 commits into from
Feb 16, 2024
Merged
Changes from 1 commit
Commits
Show all changes
107 commits
Select commit Hold shift + click to select a range
1c79951
add CI/CD for unit tests
xrsrke Jan 19, 2024
04491d3
fix
xrsrke Jan 19, 2024
fdd5d1e
fix syntax
xrsrke Jan 19, 2024
91208dd
fix
xrsrke Jan 19, 2024
8da087d
fix
xrsrke Jan 19, 2024
00875c0
update actions/checkout
xrsrke Jan 19, 2024
cca7e56
new runner label
glegendre01 Jan 19, 2024
338c042
fix typo
glegendre01 Jan 19, 2024
0c6433c
add workflow dispatch
glegendre01 Jan 19, 2024
6de2472
remove path filter for triggering
glegendre01 Jan 19, 2024
79b22d8
test ci
xrsrke Jan 23, 2024
c73623b
update python version
xrsrke Jan 23, 2024
5efc135
add code quality
xrsrke Jan 23, 2024
4fb80a4
refactor
xrsrke Jan 23, 2024
ceb21c2
only check src
xrsrke Jan 23, 2024
05aa557
fix
xrsrke Jan 23, 2024
0010cfa
use docker image
xrsrke Jan 23, 2024
dba1eed
fix
xrsrke Jan 23, 2024
b2af5d0
use python 10
xrsrke Jan 23, 2024
8914de7
change docker image
xrsrke Jan 24, 2024
368beba
fix pip install
xrsrke Jan 24, 2024
565e081
add fa2-related tests
xrsrke Jan 24, 2024
7b38326
fix
xrsrke Jan 24, 2024
906477b
update FA2 version
xrsrke Jan 24, 2024
4491ce7
add on push
xrsrke Jan 24, 2024
5b22ede
update FA2 to flash-attn>=2.5.0
xrsrke Jan 24, 2024
5f3ce67
Merge branch 'main' of github.com:huggingface/nanotron into xrsrke/se…
xrsrke Jan 29, 2024
9a03a04
add searching for free ports in unit tests
xrsrke Jan 29, 2024
1cf4da2
remove searching port
xrsrke Jan 29, 2024
f6d9847
move searching ports to distributed
xrsrke Jan 29, 2024
f675daf
Update 3d_parallelism_unit_tests.yaml
xrsrke Jan 29, 2024
0908b74
Update 3d_parallelism_unit_tests.yaml
xrsrke Jan 29, 2024
df7cb9d
Update distributed.py
xrsrke Jan 29, 2024
839677a
Update 3d_parallelism_unit_tests.yaml
xrsrke Jan 29, 2024
b631186
Update 3d_parallelism_unit_tests.yaml
xrsrke Jan 30, 2024
128eea5
Update distributed.py
xrsrke Jan 30, 2024
f96808a
Refactor test_clip_grads_with_tp parameters
NouamaneTazi Jan 31, 2024
d123d1b
Skip test cases for ALL_REDUCE mode with async communication
NouamaneTazi Jan 31, 2024
b899564
Update init_method to use env://localhost:port
NouamaneTazi Jan 31, 2024
ff32ddb
tests run for all PRs
NouamaneTazi Jan 31, 2024
abe42c6
Update branch filter in GitHub workflows
NouamaneTazi Jan 31, 2024
0a754a1
skip ALL_REDUCE with async comm
NouamaneTazi Jan 31, 2024
5d822bb
make sure total_norm in clip grad is a scalar
NouamaneTazi Jan 31, 2024
e5e2045
Merge branch 'main' of github.com:huggingface/nanotron into xrsrke/se…
xrsrke Jan 31, 2024
5d9652a
refactor
xrsrke Jan 31, 2024
063020a
zeros([]
NouamaneTazi Feb 1, 2024
741966b
Merge pull request #52 from huggingface/nouamane/fix_ci
NouamaneTazi Feb 1, 2024
e2ed85f
exclude sanity_checks.py from CoL
xrsrke Feb 1, 2024
91234fa
exclude sanity_checks.py from CoL
xrsrke Feb 1, 2024
a57cb9b
Merge branch 'main' of github.com:huggingface/nanotron into xrsrke/se…
xrsrke Feb 10, 2024
8a98cfc
fix expectation
xrsrke Feb 10, 2024
29672db
remove empty context manager in tp tests
xrsrke Feb 10, 2024
0a34e65
add reruning a tests if a port is in used
xrsrke Feb 10, 2024
e3c3d11
fix checking total_norm should be a scalar
xrsrke Feb 10, 2024
63ca0d2
fix
xrsrke Feb 10, 2024
44c0e05
add more retrying
xrsrke Feb 10, 2024
b8eeb1e
fix clip grads
xrsrke Feb 10, 2024
b553c4e
remove testing dim in clip grads
xrsrke Feb 10, 2024
0b97c38
fuk
xrsrke Feb 10, 2024
8c7355e
refactor
xrsrke Feb 10, 2024
2a4e735
run tests in parallel
xrsrke Feb 10, 2024
d47555e
not run fa2
xrsrke Feb 10, 2024
3b70271
only run 5 tests in parallel
xrsrke Feb 10, 2024
30b8004
only run a test at a time
xrsrke Feb 10, 2024
51a804c
add forking RNG
xrsrke Feb 10, 2024
cec0c04
fix circular import
xrsrke Feb 10, 2024
f42a43e
fix rng
xrsrke Feb 10, 2024
5b375f5
remove parallel tests
xrsrke Feb 10, 2024
081b17d
add python random seed
xrsrke Feb 11, 2024
4dce881
remove dist test, and add destroying process group after running a test
xrsrke Feb 11, 2024
00bb0bf
fix
xrsrke Feb 11, 2024
957826e
edit
xrsrke Feb 11, 2024
dc65581
fix
xrsrke Feb 11, 2024
0fe7bdd
fix
xrsrke Feb 11, 2024
de52fc6
removing destroy pg
xrsrke Feb 11, 2024
f2afea3
add destroying parallel_context in unit tests
xrsrke Feb 11, 2024
97ebff4
ignore layer norm
xrsrke Feb 11, 2024
6a5fd81
wtf is going on
xrsrke Feb 11, 2024
9c7e1a7
add small run
xrsrke Feb 13, 2024
b2c71b0
run small with dist test
xrsrke Feb 13, 2024
0d21bba
debug missing destroy
xrsrke Feb 13, 2024
6bb69ff
fuck
xrsrke Feb 13, 2024
b39c831
f
xrsrke Feb 13, 2024
3bd346d
.
NouamaneTazi Feb 13, 2024
dd0079e
.
NouamaneTazi Feb 13, 2024
91cf7e3
try timeout-minutes and --rm
NouamaneTazi Feb 13, 2024
7e0fcce
try -v
NouamaneTazi Feb 13, 2024
6dcb73d
try
NouamaneTazi Feb 13, 2024
b64f04f
bring back parallel_context.destroy()
NouamaneTazi Feb 13, 2024
2d44ec7
add 3d tests
xrsrke Feb 14, 2024
5d03579
add all cicd
xrsrke Feb 14, 2024
ab09576
run parallel tests
xrsrke Feb 14, 2024
77e0764
only run 1 test
xrsrke Feb 14, 2024
f43687f
add directly spawning processes
xrsrke Feb 15, 2024
004e7f4
refactor spawn function as init_distributed
xrsrke Feb 15, 2024
558b341
please work
xrsrke Feb 15, 2024
98046f8
catch overlaping port from find_free_port
xrsrke Feb 15, 2024
d96c7fa
clean up
xrsrke Feb 15, 2024
f56f8a7
fix circular import
xrsrke Feb 15, 2024
a48b7bf
skip fp8 tests in FA2
xrsrke Feb 15, 2024
033aca9
update code quality
xrsrke Feb 15, 2024
d4c27e7
fix
xrsrke Feb 15, 2024
39e5846
fix
xrsrke Feb 15, 2024
6f7e4b2
remove uncessary files
xrsrke Feb 15, 2024
cd51bd9
fix search free poorts
xrsrke Feb 15, 2024
6c30d2c
set ParallelContext in wrapper
xrsrke Feb 16, 2024
c705f4d
remove uncessary comments
xrsrke Feb 16, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
removing destroy pg
xrsrke committed Feb 11, 2024
commit de52fc6fd7d765ea3f4b77258c016bed482eb148
6 changes: 0 additions & 6 deletions tests/helpers/utils.py
Original file line number Diff line number Diff line change
@@ -8,7 +8,6 @@
from typing import Any, Callable, Dict, List, Optional, Tuple

import torch.cuda
import torch.distributed as dist
from nanotron.parallel import ParallelContext
from packaging import version
from torch.distributed.launcher import elastic_launch
@@ -89,11 +88,6 @@ def __call__(self):

self.func(*self.args, **self.kwargs)

# NOTE: after running the test, we free the port
if dist.is_initialized():
dist.barrier()
dist.destroy_process_group()


def init_distributed(tp: int, dp: int, pp: int):
def _init_distributed(func):