You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* added peft req
* added a dev-requirements file and fixed formatting
* added functionality to reset dataloader after epoch finishes
* Mistral training now works (padding token was incorrectly set)
* added config ignore
* reverting gitignore
* two bugs. one, labels for bos/eos tokens weren't added properly if there was a sequence separator. two, the attention mask had the eos/bos token id instead of a [1] for those tokens specifically
* added gotcha for separator in data preprocessing
* fixed small bug in truncation where eos token should be added after truncation
* fixed small bug in truncation where eos token should be added after truncation
Copy file name to clipboardExpand all lines: docs/config.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -65,7 +65,7 @@ Similar to the wandb config above, these keyword parameters are fed directly int
65
65
*`overlap`: When we chunk a data point during packing, we can choose to have some overlap between the current chunk and the next chunk. This might help the model understand surrounding context during training (although this isn't something we have empirically investigated, we keep this option available to users).
66
66
*`add_bos_eos_tokens`: Whether to add `BOS` and `EOS` tokens as defined by the respective HuggingFace tokenizer. If using packing, these will be added after packing is done, so that each chunk of size `max_seq_len` has these tokens.
67
67
*`from_disk`: Whether we are going to be loading the dataset to preprocess from disk (the other option is to download straight from HuggingFace).
68
-
*`seperator`: If using conditional finetuning (i.e. in a given data point, everything before `separator` will not be used for calculating the loss and its labels will be `ignore_index`).
68
+
*`seperator`: If using conditional finetuning (i.e. in a given data point, everything before `separator` will not be used for calculating the loss and its labels will be `ignore_index`).**Note:** if `separator` is not found in a given sequence, the default behavior is that datapoint will be skipped and not be a part of the final set.
69
69
*`load_path`: The directory containing the HuggingFace dataset we are loading to preprocess.
70
70
*`split`: If `load_path` is a dataset dictionary, `split` specifies which key in this dictionary contains the dataset we are preprocessing.
71
71
*`save_path`: The directory we will be saving the processed dataset to.
0 commit comments