[Feature] Add loading different datasets based on training stages #80

xrsrke · 2024-02-25T09:14:29Z

Reproduce

Step 1: Modify your config:

Use a single dataset for the entire training

data:
  dataset:
      dataset_overwrite_cache: false
      dataset_processing_num_proc_per_process: 1
      hf_dataset_config_name: null
      hf_dataset_or_datasets: HuggingFaceH4/testing_alpaca_small
      hf_dataset_splits: train
      text_column_name: completion

Use different datasets based on training stages

  # NOTE: if you wanna use different datasets for different stages of the training
data:
  dataset_stages:
    - name: Stable Training Stage
      training_steps: 1
      dataset:
        dataset_overwrite_cache: false
        dataset_processing_num_proc_per_process: 1
        hf_dataset_config_name: null
        hf_dataset_or_datasets: HuggingFaceH4/testing_alpaca_small
        hf_dataset_splits: train
        text_column_name: completion
    - name: Annealing Phase
      training_steps: 10
      dataset:
        dataset_overwrite_cache: false
        dataset_processing_num_proc_per_process: 1
        hf_dataset_config_name: null
        hf_dataset_or_datasets: stas/c4-en-10k
        hf_dataset_splits: train
        text_column_name: text

Step 2: CUDA_DEVICE_MAX_CONNECTIONS=1 torchrun --nproc_per_node=8 run_train.py --config-file examples/config_tiny_llama.yaml

NouamaneTazi

Great job! Left some questiosn

examples/config_tiny_llama.yaml

src/nanotron/config/config.py

run_train.py

src/nanotron/trainer.py

xrsrke added 2 commits February 25, 2024 08:24

add loading dataset by training stages

4e43b82

add loading different datasets based on training stages

e401fb6

xrsrke marked this pull request as ready for review February 25, 2024 09:14

xrsrke requested a review from NouamaneTazi February 25, 2024 09:14

NouamaneTazi reviewed Feb 26, 2024

View reviewed changes

src/nanotron/trainer.py Outdated Show resolved Hide resolved

xrsrke added 2 commits March 20, 2024 12:50

clear dataloader from memory and refactor

18a05d3

Merge branch 'main' into xrsrke/training_stages

e34efa4

xrsrke closed this Mar 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Add loading different datasets based on training stages #80

[Feature] Add loading different datasets based on training stages #80

xrsrke commented Feb 25, 2024

NouamaneTazi left a comment

[Feature] Add loading different datasets based on training stages #80

[Feature] Add loading different datasets based on training stages #80

Conversation

xrsrke commented Feb 25, 2024

NouamaneTazi left a comment

Choose a reason for hiding this comment