Skip to content

Commit

Permalink
Merge branch 'main' into add_dataset_sample_num
Browse files Browse the repository at this point in the history
  • Loading branch information
seanzhang-zhichen authored May 24, 2024
2 parents d956041 + 3a023bc commit 27cb51f
Show file tree
Hide file tree
Showing 29 changed files with 757 additions and 514 deletions.
17 changes: 10 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,12 +69,12 @@ Compared to ChatGLM's [P-Tuning](https://github.com/THUDM/ChatGLM2-6B/tree/main/

## Changelog

[24/05/20] We supported fine-tuning the **PaliGemma** series models. Note that the PaliGemma models are pre-trained models, you need to fine-tune them with `gemma` template for chat completion.

[24/05/18] We supported **[KTO](https://arxiv.org/abs/2402.01306)** algorithm for preference learning. See [examples](examples/README.md) for usage.

[24/05/14] We supported training and inference on the Ascend NPU devices. Check [installation](#installation) section for details.

[24/05/13] We supported fine-tuning the **Yi-1.5** series models.

<details><summary>Full Changelog</summary>

[24/04/26] We supported fine-tuning the **LLaVA-1.5** multimodal LLMs. See [examples](examples/README.md) for usage.
Expand Down Expand Up @@ -160,6 +160,7 @@ Compared to ChatGLM's [P-Tuning](https://github.com/THUDM/ChatGLM2-6B/tree/main/
| [LLaVA-1.5](https://huggingface.co/llava-hf) | 7B/13B | q_proj,v_proj | vicuna |
| [Mistral/Mixtral](https://huggingface.co/mistralai) | 7B/8x7B/8x22B | q_proj,v_proj | mistral |
| [OLMo](https://huggingface.co/allenai) | 1B/7B | q_proj,v_proj | - |
| [PaliGemma](https://huggingface.co/google) | 3B | q_proj,v_proj | gemma |
| [Phi-1.5/2](https://huggingface.co/microsoft) | 1.3B/2.7B | q_proj,v_proj | - |
| [Phi-3](https://huggingface.co/microsoft) | 3.8B | qkv_proj | phi |
| [Qwen](https://huggingface.co/Qwen) | 1.8B/7B/14B/72B | c_attn | qwen |
Expand Down Expand Up @@ -284,11 +285,11 @@ huggingface-cli login
| ------------ | ------- | --------- |
| python | 3.8 | 3.10 |
| torch | 1.13.1 | 2.2.0 |
| transformers | 4.37.2 | 4.40.1 |
| transformers | 4.37.2 | 4.41.0 |
| datasets | 2.14.3 | 2.19.1 |
| accelerate | 0.27.2 | 0.30.0 |
| peft | 0.9.0 | 0.10.0 |
| trl | 0.8.1 | 0.8.6 |
| accelerate | 0.27.2 | 0.30.1 |
| peft | 0.9.0 | 0.11.1 |
| trl | 0.8.2 | 0.8.6 |

| Optional | Minimum | Recommend |
| ------------ | ------- | --------- |
Expand Down Expand Up @@ -344,6 +345,8 @@ To enable FlashAttention-2 on the Windows platform, you need to install the prec

<details><summary>For Ascend NPU users</summary>

Join [NPU user group](assets/wechat_npu.jpg).

To utilize Ascend NPU devices for (distributed) training and inference, you need to install the **[torch-npu](https://gitee.com/ascend/pytorch)** library and the **[Ascend CANN Kernels](https://www.hiascend.com/developer/download/community/result?module=cann)**.

| Requirement | Minimum | Recommend |
Expand All @@ -356,7 +359,7 @@ To utilize Ascend NPU devices for (distributed) training and inference, you need
Docker image:

- 32GB: [Download page](http://mirrors.cn-central-221.ovaijisuan.com/detail/130.html)
- 64GB: Coming soon
- 64GB: [Download page](http://mirrors.cn-central-221.ovaijisuan.com/detail/131.html)

Remember to use `ASCEND_RT_VISIBLE_DEVICES` instead of `CUDA_VISIBLE_DEVICES` to specify the device to use.

Expand Down
17 changes: 10 additions & 7 deletions README_zh.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,12 +69,12 @@ https://github.com/hiyouga/LLaMA-Factory/assets/16256802/ec36a9dd-37f4-4f72-81bd

## 更新日志

[24/05/20] 我们支持了 **PaliGemma** 系列模型的微调。注意 PaliGemma 是预训练模型,你需要使用 `gemma` 模板进行微调使其获得对话能力。

[24/05/18] 我们支持了 **[KTO](https://arxiv.org/abs/2402.01306)** 偏好对齐算法。详细用法请参照 [examples](examples/README_zh.md)

[24/05/14] 我们支持了昇腾 NPU 设备的训练和推理。详情请查阅[安装](#安装-llama-factory)部分。

[24/05/13] 我们支持了 Yi-1.5 系列模型的微调。

<details><summary>展开日志</summary>

[24/04/26] 我们支持了多模态模型 **LLaVA-1.5** 的微调。详细用法请参照 [examples](examples/README_zh.md)
Expand Down Expand Up @@ -160,6 +160,7 @@ https://github.com/hiyouga/LLaMA-Factory/assets/16256802/ec36a9dd-37f4-4f72-81bd
| [LLaVA-1.5](https://huggingface.co/llava-hf) | 7B/13B | q_proj,v_proj | vicuna |
| [Mistral/Mixtral](https://huggingface.co/mistralai) | 7B/8x7B/8x22B | q_proj,v_proj | mistral |
| [OLMo](https://huggingface.co/allenai) | 1B/7B | q_proj,v_proj | - |
| [PaliGemma](https://huggingface.co/google) | 3B | q_proj,v_proj | gemma |
| [Phi-1.5/2](https://huggingface.co/microsoft) | 1.3B/2.7B | q_proj,v_proj | - |
| [Phi-3](https://huggingface.co/microsoft) | 3.8B | qkv_proj | phi |
| [Qwen](https://huggingface.co/Qwen) | 1.8B/7B/14B/72B | c_attn | qwen |
Expand Down Expand Up @@ -284,11 +285,11 @@ huggingface-cli login
| ------------ | ------- | --------- |
| python | 3.8 | 3.10 |
| torch | 1.13.1 | 2.2.0 |
| transformers | 4.37.2 | 4.40.1 |
| transformers | 4.37.2 | 4.41.0 |
| datasets | 2.14.3 | 2.19.1 |
| accelerate | 0.27.2 | 0.30.0 |
| peft | 0.9.0 | 0.10.0 |
| trl | 0.8.1 | 0.8.6 |
| accelerate | 0.27.2 | 0.30.1 |
| peft | 0.9.0 | 0.11.1 |
| trl | 0.8.2 | 0.8.6 |

| 可选项 | 至少 | 推荐 |
| ------------ | ------- | --------- |
Expand Down Expand Up @@ -344,6 +345,8 @@ pip install https://github.com/jllllll/bitsandbytes-windows-webui/releases/downl

<details><summary>昇腾 NPU 用户指南</summary>

加入 [NPU 用户群](assets/wechat_npu.jpg)

如果使用昇腾 NPU 设备进行(分布式)训练或推理,需要安装 **[torch-npu](https://gitee.com/ascend/pytorch)** 库和 **[Ascend CANN Kernels](https://www.hiascend.com/developer/download/community/result?module=cann)**

| 依赖项 | 至少 | 推荐 |
Expand All @@ -356,7 +359,7 @@ pip install https://github.com/jllllll/bitsandbytes-windows-webui/releases/downl
Docker 镜像:

- 32GB:[下载地址](http://mirrors.cn-central-221.ovaijisuan.com/detail/130.html)
- 64GB:敬请期待
- 64GB:[下载地址](http://mirrors.cn-central-221.ovaijisuan.com/detail/131.html)

请记得使用 `ASCEND_RT_VISIBLE_DEVICES` 而非 `CUDA_VISIBLE_DEVICES` 来指定您使用的设备。

Expand Down
Binary file modified assets/wechat.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/wechat_npu.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion data/README_zh.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
"hf_hub_url": "Hugging Face 的数据集仓库地址(若指定,则忽略 script_url 和 file_name)",
"ms_hub_url": "ModelScope 的数据集仓库地址(若指定,则忽略 script_url 和 file_name)",
"script_url": "包含数据加载脚本的本地文件夹名称(若指定,则忽略 file_name)",
"file_name": "该目录下数据集文件的名称(若上述参数未指定,则此项必需)",
"file_name": "该目录下数据集文件夹或文件的名称(若上述参数未指定,则此项必需)",
"formatting": "数据集格式(可选,默认:alpaca,可以为 alpaca 或 sharegpt)",
"ranking": "是否为偏好数据集(可选,默认:False)",
"subset": "数据集子集的名称(可选,默认:None)",
Expand Down
3 changes: 2 additions & 1 deletion data/hh_rlhf_en/hh_rlhf_en.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,8 @@ def _info(self) -> datasets.DatasetInfo:
features = datasets.Features(
{
"instruction": datasets.Value("string"),
"output": datasets.Sequence(datasets.Value("string")),
"chosen": datasets.Value("string"),
"rejected": datasets.Value("string"),
"history": datasets.Sequence(datasets.Sequence(datasets.Value("string"))),
}
)
Expand Down
25 changes: 19 additions & 6 deletions src/llamafactory/chat/hf_engine.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
from transformers import GenerationConfig, TextIteratorStreamer

from ..data import get_template_and_fix_tokenizer
from ..extras.constants import IMAGE_TOKEN
from ..extras.misc import get_logits_processor
from ..model import load_model, load_tokenizer
from .base_engine import BaseEngine, Response
Expand Down Expand Up @@ -55,14 +56,28 @@ def _process_args(
image: Optional["NDArray"] = None,
input_kwargs: Optional[Dict[str, Any]] = {},
) -> Tuple[Dict[str, Any], int]:
if processor is not None and image is not None and "<image>" not in messages[0]["content"]:
messages[0]["content"] = "<image>" + messages[0]["content"]
if (
processor is not None
and image is not None
and not hasattr(processor, "image_seq_length")
and IMAGE_TOKEN not in messages[0]["content"]
): # llava-like models
messages[0]["content"] = IMAGE_TOKEN + messages[0]["content"]

paired_messages = messages + [{"role": "assistant", "content": ""}]
system = system or generating_args["default_system"]
pixel_values = None
prompt_ids, _ = template.encode_oneturn(
tokenizer=tokenizer, messages=paired_messages, system=system, tools=tools
)
if processor is not None and image is not None: # add image features
image_processor: "BaseImageProcessor" = getattr(processor, "image_processor")
batch_feature = image_processor(image, return_tensors="pt")
pixel_values = batch_feature.to(model.device)["pixel_values"] # shape (B, C, H, W)
if hasattr(processor, "image_seq_length"): # paligemma models
image_token_id = tokenizer.convert_tokens_to_ids(IMAGE_TOKEN)
prompt_ids = [image_token_id] * getattr(processor, "image_seq_length") + prompt_ids

prompt_length = len(prompt_ids)
inputs = torch.tensor([prompt_ids], device=model.device)

Expand Down Expand Up @@ -122,10 +137,8 @@ def _process_args(
logits_processor=get_logits_processor(),
)

if processor is not None and image is not None:
image_processor: "BaseImageProcessor" = getattr(processor, "image_processor")
pixel_values: "torch.Tensor" = image_processor(image, return_tensors="pt")["pixel_values"]
gen_kwargs["pixel_values"] = pixel_values.to(model.device)
if pixel_values is not None:
gen_kwargs["pixel_values"] = pixel_values

return gen_kwargs, prompt_length

Expand Down
29 changes: 18 additions & 11 deletions src/llamafactory/chat/vllm_engine.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
from typing import TYPE_CHECKING, AsyncGenerator, AsyncIterator, Dict, List, Optional, Sequence, Union

from ..data import get_template_and_fix_tokenizer
from ..extras.constants import IMAGE_TOKEN
from ..extras.logging import get_logger
from ..extras.misc import get_device_count, infer_optim_dtype
from ..extras.packages import is_vllm_available
Expand All @@ -17,7 +18,6 @@


if TYPE_CHECKING:
import torch
from numpy.typing import NDArray
from transformers.image_processing_utils import BaseImageProcessor

Expand Down Expand Up @@ -67,7 +67,7 @@ def __init__(
patch_size = config.vision_config.patch_size
self.image_feature_size = (image_size // patch_size) ** 2
engine_args["image_input_type"] = "pixel_values"
engine_args["image_token_id"] = self.tokenizer.convert_tokens_to_ids("<image>")
engine_args["image_token_id"] = self.tokenizer.convert_tokens_to_ids(IMAGE_TOKEN)
engine_args["image_input_shape"] = "1,3,{},{}".format(image_size, image_size)
engine_args["image_feature_size"] = self.image_feature_size
if getattr(config, "is_yi_vl_derived_model", None):
Expand All @@ -92,14 +92,28 @@ async def _generate(
**input_kwargs,
) -> AsyncIterator["RequestOutput"]:
request_id = "chatcmpl-{}".format(uuid.uuid4().hex)
if self.processor is not None and image is not None and "<image>" not in messages[0]["content"]:
messages[0]["content"] = "<image>" * self.image_feature_size + messages[0]["content"]

if (
self.processor is not None
and image is not None
and not hasattr(self.processor, "image_seq_length")
and IMAGE_TOKEN not in messages[0]["content"]
): # llava-like models
messages[0]["content"] = IMAGE_TOKEN * self.image_feature_size + messages[0]["content"]

paired_messages = messages + [{"role": "assistant", "content": ""}]
system = system or self.generating_args["default_system"]
prompt_ids, _ = self.template.encode_oneturn(
tokenizer=self.tokenizer, messages=paired_messages, system=system, tools=tools
)

if self.processor is not None and image is not None: # add image features
image_processor: "BaseImageProcessor" = getattr(self.processor, "image_processor")
pixel_values = image_processor(image, return_tensors="pt")["pixel_values"]
multi_modal_data = MultiModalData(type=MultiModalData.Type.IMAGE, data=pixel_values)
else:
multi_modal_data = None

prompt_length = len(prompt_ids)

use_beam_search: bool = self.generating_args["num_beams"] > 1
Expand Down Expand Up @@ -144,13 +158,6 @@ async def _generate(
skip_special_tokens=True,
)

if self.processor is not None and image is not None:
image_processor: "BaseImageProcessor" = getattr(self.processor, "image_processor")
pixel_values: "torch.Tensor" = image_processor(image, return_tensors="pt")["pixel_values"]
multi_modal_data = MultiModalData(type=MultiModalData.Type.IMAGE, data=pixel_values)
else:
multi_modal_data = None

result_generator = self.model.generate(
prompt=None,
sampling_params=sampling_params,
Expand Down
80 changes: 37 additions & 43 deletions src/llamafactory/data/collator.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
from dataclasses import dataclass
from typing import Any, Dict, List, Sequence, Tuple
from typing import Any, Dict, Sequence

import torch
from transformers import DataCollatorForSeq2Seq
Expand All @@ -11,21 +11,6 @@ class PairwiseDataCollatorWithPadding(DataCollatorForSeq2Seq):
Data collator for pairwise data.
"""

def _pad_labels(self, batch: torch.Tensor, positions: List[Tuple[int, int]]) -> torch.Tensor:
r"""
Masks out the input ids except for the responses.
"""
padded_labels = []
for feature, (prompt_len, answer_len) in zip(batch, positions):
if self.tokenizer.padding_side == "left":
start, end = feature.size(0) - answer_len, feature.size(0)
else:
start, end = prompt_len, prompt_len + answer_len
padded_tensor = self.label_pad_token_id * torch.ones_like(feature)
padded_tensor[start:end] = feature[start:end]
padded_labels.append(padded_tensor)
return torch.stack(padded_labels, dim=0).contiguous() # in contiguous memory

def __call__(self, features: Sequence[Dict[str, Any]]) -> Dict[str, torch.Tensor]:
r"""
Pads batched data to the longest sequence in the batch.
Expand All @@ -34,21 +19,22 @@ def __call__(self, features: Sequence[Dict[str, Any]]) -> Dict[str, torch.Tensor
the last n examples represent rejected examples.
"""
concatenated_features = []
label_positions = []
for key in ("chosen_ids", "rejected_ids"):
for key in ("chosen", "rejected"):
for feature in features:
prompt_len, answer_len = len(feature["prompt_ids"]), len(feature[key])
concatenated_features.append(
{
"input_ids": feature["prompt_ids"] + feature[key],
"attention_mask": [1] * (prompt_len + answer_len),
}
)
label_positions.append((prompt_len, answer_len))
target_feature = {
"input_ids": feature["{}_input_ids".format(key)],
"attention_mask": feature["{}_attention_mask".format(key)],
"labels": feature["{}_labels".format(key)],
}
if "pixel_values" in feature:
target_feature["pixel_values"] = feature["pixel_values"]

batch = super().__call__(concatenated_features)
batch["labels"] = self._pad_labels(batch["input_ids"], label_positions)
return batch
if "{}_token_type_ids".format(key) in feature:
target_feature["token_type_ids"] = feature["{}_token_type_ids".format(key)]

concatenated_features.append(target_feature)

return super().__call__(concatenated_features)


@dataclass
Expand All @@ -62,26 +48,34 @@ def __call__(self, features: Sequence[Dict[str, Any]]) -> Dict[str, torch.Tensor
kl_features = []
kto_tags = []
for feature in features:
target_features.append(
{
"input_ids": feature["input_ids"],
"attention_mask": feature["attention_mask"],
"labels": feature["labels"],
}
)
kl_features.append(
{
"input_ids": feature["kl_input_ids"],
"attention_mask": feature["kl_attention_mask"],
"labels": feature["kl_labels"],
}
)
target_feature = {
"input_ids": feature["input_ids"],
"attention_mask": feature["attention_mask"],
"labels": feature["labels"],
}
kl_feature = {
"input_ids": feature["kl_input_ids"],
"attention_mask": feature["kl_attention_mask"],
"labels": feature["kl_labels"],
}
if "pixel_values" in feature:
target_feature["pixel_values"] = feature["pixel_values"]

if "token_type_ids" in feature:
target_feature["token_type_ids"] = feature["token_type_ids"]
kl_feature["token_type_ids"] = feature["kl_token_type_ids"]

target_features.append(target_feature)
kl_features.append(kl_feature)
kto_tags.append(feature["kto_tags"])

batch = super().__call__(target_features)
kl_batch = super().__call__(kl_features)
batch["kl_input_ids"] = kl_batch["input_ids"]
batch["kl_attention_mask"] = kl_batch["attention_mask"]
batch["kl_labels"] = kl_batch["labels"]
if "token_type_ids" in batch:
batch["kl_token_type_ids"] = kl_batch["token_type_ids"]

batch["kto_tags"] = torch.tensor(kto_tags)
return batch
8 changes: 6 additions & 2 deletions src/llamafactory/data/loader.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
import os
import numpy as np
from numpy.random import RandomState
import sys
from typing import TYPE_CHECKING, Literal, Optional, Union

from datasets import load_dataset, load_from_disk
Expand Down Expand Up @@ -180,12 +181,15 @@ def get_dataset(
logger.info("Tokenized dataset saved at {}.".format(data_args.tokenized_path))
logger.info("Please restart the training with `--tokenized_path {}`.".format(data_args.tokenized_path))

exit(0)
sys.exit(0)

if training_args.should_log:
try:
print_function(next(iter(dataset)))
except StopIteration:
raise RuntimeError("Cannot find valid samples, check `data/README.md` for the data format.")
if stage == "pt":
raise RuntimeError("Cannot find sufficient samples, consider increasing dataset size.")
else:
raise RuntimeError("Cannot find valid samples, check `data/README.md` for the data format.")

return dataset
Loading

0 comments on commit 27cb51f

Please sign in to comment.