Skip to content

Commit

Permalink
fix conflict
Browse files Browse the repository at this point in the history
  • Loading branch information
15797939668 committed May 20, 2024
2 parents 449e2aa + d52fae2 commit d956041
Show file tree
Hide file tree
Showing 254 changed files with 55,061 additions and 2,445,708 deletions.
2 changes: 1 addition & 1 deletion .github/SECURITY.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Reporting Security Issues

To report a security issue, please use the GitHub Security Advisory ["Report a Vulnerability"](https://github.com/electron/electron/security/advisories/new) tab.
To report a security issue, please use the GitHub Security Advisory ["Report a Vulnerability"](https://github.com/hiyouga/LLaMA-Factory/security/advisories/new) tab.

We will send a response indicating the next steps in handling your report. After the initial reply to your report, the security team will keep you informed of the progress towards a fix and full announcement, and may ask for additional information or guidance.

Expand Down
4 changes: 2 additions & 2 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,9 @@ COPY requirements.txt /app/
RUN pip install -r requirements.txt

COPY . /app/
RUN pip install -e .[deepspeed,metrics,bitsandbytes,qwen]
RUN pip install -e .[metrics,bitsandbytes,qwen]

VOLUME [ "/root/.cache/huggingface/", "/app/data", "/app/output" ]
EXPOSE 7860

CMD [ "python", "src/train_web.py" ]
CMD [ "llamafactory-cli", "webui" ]
621 changes: 207 additions & 414 deletions README.md

Large diffs are not rendered by default.

596 changes: 208 additions & 388 deletions README_zh.md

Large diffs are not rendered by default.

Binary file modified assets/wechat.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
289 changes: 255 additions & 34 deletions data/README.md
Original file line number Diff line number Diff line change
@@ -1,24 +1,29 @@
If you are using a custom dataset, please provide your dataset definition in the following format in `dataset_info.json`.
The [dataset_info.json](dataset_info.json) contains all available datasets. If you are using a custom dataset, please **make sure** to add a *dataset description* in `dataset_info.json` and specify `dataset: dataset_name` before training to use it.

Currently we support datasets in **alpaca** and **sharegpt** format.

```json
"dataset_name": {
"hf_hub_url": "the name of the dataset repository on the Hugging Face hub. (if specified, ignore script_url and file_name)",
"ms_hub_url": "the name of the dataset repository on the ModelScope hub. (if specified, ignore script_url and file_name)",
"ms_hub_url": "the name of the dataset repository on the Model Scope hub. (if specified, ignore script_url and file_name)",
"script_url": "the name of the directory containing a dataset loading script. (if specified, ignore file_name)",
"file_name": "the name of the dataset file in this directory. (required if above are not specified)",
"file_sha1": "the SHA-1 hash value of the dataset file. (optional, does not affect training)",
"file_name": "the name of the dataset folder or dataset file in this directory. (required if above are not specified)",
"formatting": "the format of the dataset. (optional, default: alpaca, can be chosen from {alpaca, sharegpt})",
"ranking": "whether the dataset is a preference dataset or not. (default: False)",
"subset": "the name of the subset. (optional, default: None)",
"folder": "the name of the folder of the dataset repository on the Hugging Face hub. (optional, default: None)",
"ranking": "whether the dataset is a preference dataset or not. (default: false)",
"formatting": "the format of the dataset. (optional, default: alpaca, can be chosen from {alpaca, sharegpt})",
"columns (optional)": {
"prompt": "the column name in the dataset containing the prompts. (default: instruction)",
"query": "the column name in the dataset containing the queries. (default: input)",
"response": "the column name in the dataset containing the responses. (default: output)",
"history": "the column name in the dataset containing the histories. (default: None)",
"messages": "the column name in the dataset containing the messages. (default: conversations)",
"system": "the column name in the dataset containing the system prompts. (default: None)",
"tools": "the column name in the dataset containing the tool description. (default: None)"
"tools": "the column name in the dataset containing the tool description. (default: None)",
"images": "the column name in the dataset containing the image inputs. (default: None)",
"chosen": "the column name in the dataset containing the chosen answers. (default: None)",
"rejected": "the column name in the dataset containing the rejected answers. (default: None)",
"kto_tag": "the column name in the dataset containing the kto tags. (default: None)"
},
"tags (optional, used for the sharegpt format)": {
"role_tag": "the key in the message represents the identity. (default: from)",
Expand All @@ -33,29 +38,38 @@ If you are using a custom dataset, please provide your dataset definition in the
}
```

Given above, you can use the custom dataset via specifying `--dataset dataset_name`.
## Alpaca Format

### Supervised Fine-Tuning Dataset

* [Example dataset](alpaca_en_demo.json)

Currently we support dataset in **alpaca** or **sharegpt** format, the dataset in alpaca format should follow the below format:
In supervised fine-tuning, the `instruction` column will be concatenated with the `input` column and used as the human prompt, then the human prompt would be `instruction\ninput`. The `output` column represents the model response.

The `system` column will be used as the system prompt if specified.

The `history` column is a list consisting of string tuples representing prompt-response pairs in the history messages. Note that the responses in the history **will also be learned by the model** in supervised fine-tuning.

```json
[
{
"instruction": "user instruction (required)",
"input": "user input (optional)",
"instruction": "human instruction (required)",
"input": "human input (optional)",
"output": "model response (required)",
"system": "system prompt (optional)",
"history": [
["user instruction in the first round (optional)", "model response in the first round (optional)"],
["user instruction in the second round (optional)", "model response in the second round (optional)"]
["human instruction in the first round (optional)", "model response in the first round (optional)"],
["human instruction in the second round (optional)", "model response in the second round (optional)"]
]
}
]
```

Regarding the above dataset, the `columns` in `dataset_info.json` should be:
Regarding the above dataset, the *dataset description* in `dataset_info.json` should be:

```json
"dataset_name": {
"file_name": "data.json",
"columns": {
"prompt": "instruction",
"query": "input",
Expand All @@ -66,34 +80,151 @@ Regarding the above dataset, the `columns` in `dataset_info.json` should be:
}
```

The `query` column will be concatenated with the `prompt` column and used as the user prompt, then the user prompt would be `prompt\nquery`. The `response` column represents the model response.
### Pre-training Dataset

- [Example dataset](c4_demo.json)

In pre-training, only the `text` column will be used for model learning.

```json
[
{"text": "document"},
{"text": "document"}
]
```

Regarding the above dataset, the *dataset description* in `dataset_info.json` should be:

```json
"dataset_name": {
"file_name": "data.json",
"columns": {
"prompt": "text"
}
}
```

### Preference Dataset

Preference datasets are used for reward modeling, DPO training and ORPO training.

It requires a better response in `chosen` column and a worse response in `rejected` column.

```json
[
{
"instruction": "human instruction (required)",
"input": "human input (optional)",
"chosen": "chosen answer (required)",
"rejected": "rejected answer (required)"
}
]
```

Regarding the above dataset, the *dataset description* in `dataset_info.json` should be:

```json
"dataset_name": {
"file_name": "data.json",
"ranking": true,
"columns": {
"prompt": "instruction",
"query": "input",
"chosen": "chosen",
"rejected": "rejected"
}
}
```

### KTO Dataset

The `system` column will be used as the system prompt. The `history` column is a list consisting string tuples representing prompt-response pairs in the history. Note that the responses in the history **will also be used for training**.
- [Example dataset](kto_en_demo.json)

For the pre-training datasets, only the `prompt` column will be used for training.
KTO datasets require a extra `kto_tag` column containing the boolean human feedback.

```json
[
{
"instruction": "human instruction (required)",
"input": "human input (optional)",
"output": "model response (required)",
"kto_tag": "human feedback [true/false] (required)"
}
]
```

For the preference datasets, the `response` column should be a string list whose length is 2, with the preferred answers appearing first, for example:
Regarding the above dataset, the *dataset description* in `dataset_info.json` should be:

```json
{
"instruction": "user instruction",
"input": "user input",
"output": [
"chosen answer",
"rejected answer"
]
"dataset_name": {
"file_name": "data.json",
"columns": {
"prompt": "instruction",
"query": "input",
"response": "output",
"kto_tag": "kto_tag"
}
}
```

The dataset in sharegpt format should follow the below format:
### Multimodal Dataset

- [Example dataset](mllm_demo.json)

Multimodal datasets require a `images` column containing the paths to the input images. Currently we only support one image.

```json
[
{
"instruction": "human instruction (required)",
"input": "human input (optional)",
"output": "model response (required)",
"images": [
"image path (required)"
]
}
]
```

Regarding the above dataset, the *dataset description* in `dataset_info.json` should be:

```json
"dataset_name": {
"file_name": "data.json",
"columns": {
"prompt": "instruction",
"query": "input",
"response": "output",
"images": "images"
}
}
```

## Sharegpt Format

### Supervised Fine-Tuning Dataset

- [Example dataset](glaive_toolcall_en_demo.json)

Compared to the alpaca format, the sharegpt format allows the datasets have **more roles**, such as human, gpt, observation and function. They are presented in a list of objects in the `conversations` column.

Note that the human and observation should appear in odd positions, while gpt and function should appear in even positions.

```json
[
{
"conversations": [
{
"from": "human",
"value": "user instruction"
"value": "human instruction"
},
{
"from": "function_call",
"value": "tool arguments"
},
{
"from": "observation",
"value": "tool result"
},
{
"from": "gpt",
Expand All @@ -106,24 +237,114 @@ The dataset in sharegpt format should follow the below format:
]
```

Regarding the above dataset, the `columns` in `dataset_info.json` should be:
Regarding the above dataset, the *dataset description* in `dataset_info.json` should be:

```json
"dataset_name": {
"file_name": "data.json",
"formatting": "sharegpt",
"columns": {
"messages": "conversations",
"system": "system",
"tools": "tools"
}
}
```

### Preference Dataset

- [Example dataset](dpo_en_demo.json)

Preference datasets in sharegpt format also require a better message in `chosen` column and a worse message in `rejected` column.

```json
[
{
"conversations": [
{
"from": "human",
"value": "human instruction"
},
{
"from": "gpt",
"value": "model response"
},
{
"from": "human",
"value": "human instruction"
}
],
"chosen": {
"from": "gpt",
"value": "chosen answer (required)"
},
"rejected": {
"from": "gpt",
"value": "rejected answer (required)"
}
}
]
```

Regarding the above dataset, the *dataset description* in `dataset_info.json` should be:

```json
"dataset_name": {
"file_name": "data.json",
"formatting": "sharegpt",
"ranking": true,
"columns": {
"messages": "conversations",
"chosen": "chosen",
"rejected": "rejected"
}
}
```

### OpenAI Format

The openai format is simply a special case of the sharegpt format, where the first message may be a system prompt.

```json
[
{
"messages": [
{
"role": "system",
"content": "system prompt (optional)"
},
{
"role": "user",
"content": "human instruction"
},
{
"role": "assistant",
"content": "model response"
}
]
}
]
```

Regarding the above dataset, the *dataset description* in `dataset_info.json` should be:

```json
"dataset_name": {
"file_name": "data.json",
"formatting": "sharegpt",
"columns": {
"messages": "messages"
},
"tags": {
"role_tag": "from",
"content_tag": "value",
"user_tag": "human",
"assistant_tag": "gpt"
"role_tag": "role",
"content_tag": "content",
"user_tag": "user",
"assistant_tag": "assistant",
"system_tag": "system"
}
}
```

where the `messages` column should be a list following the `u/a/u/a/u/a` order.
The KTO datasets and multimodal datasets in sharegpt format are similar to the alpaca format.

Pre-training datasets and preference datasets are incompatible with the sharegpt format yet.
Pre-training datasets are **incompatible** with the sharegpt format.
Loading

0 comments on commit d956041

Please sign in to comment.