Chikuma_10.7B - V2 (Enhanced with DPO) [For Experiments]
This model is the DPO fine tuned version of Chikuma_10.7B, which was a depth upscaled merge of:
The name "Chikuma" is inspired by the Chikuma River, the longest in Japan, known for its continuous flow and meandering path. This metaphorically represents the model's depth, fluidity, and adaptability in processing and understanding language.
Dataset used for Fine Tuning
Dataset: /argilla/distilabel-intel-orca-dpo-pairs
The dataset was roughly ~3000 samples but they were high quality (according to the chosen_score).
The following filters were applied to the original dataset:
dataset = dataset.filter(
lambda r:
r["status"] != "tie" and
r["chosen_score"] >= 8 and
not r["in_gsm8k_train"]
)
Chat Template
The chat template for Chikuma_10.7B - V2 is a modified version of ChatML, optimized for improved interaction and engagement:
<|im_start|>GPT4 Correct system:
{system} Always use <|end_of_turn|> when you want to end the answer. <|im_end|>
<|im_start|>GPT4 Correct user:
{user}<|im_end|>
<|im_start|>GPT4 Correct Assistant:
{asistant}<|im_end|>
Nous Benchmark Evaluation
Model | AGIEval | GPT4All | TruthfulQA | Bigbench | Average |
---|---|---|---|---|---|
SynthIQ-7b | 42.67 | 73.71 | 56.51 | 44.59 | 54.37 |
openchat/openchat-3.5-0106 | 44.17 | 73.72 | 52.53 | 44.4 | 53.71 |
Chikuma_10.7B | 42.41 | 73.41 | 56.69 | 43.5 | 54.00 |
Chikuma_10.7B_v2 | 42.77 | 73.81 | 58.83 | 44.83 | 55.06 |
OpenLLM Leaderboard
Benchmark Name | Performance |
---|---|
ARC | 66.38 |
HellaSwag | 85 |
MMLU | 65.27 |
TruthfulQA | 58.83 |
Winogrande | 78.77 |
GSM8K | 63.68 |
Average | 69.65 |
Training Environment
- Hardware: Single A100 80GB GPU in a runpod, utilized for approximately 1.5 hours.
- Training Script: Accessible via Google Colab Notebook. Special thanks to mlabonne for providing the template.
Usage
# Format prompt
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(new_model)
# Create pipeline
pipeline = transformers.pipeline(
"text-generation",
model=new_model,
tokenizer=tokenizer,
device="cuda"
)
# Generate text
message = [
{"role": "system", "content": "You are a helpful assistant chatbot."},
{"role": "user", "content": "Who invented LLMs?"}
]
prompt = tokenizer.apply_chat_template(message, add_generation_prompt=True, tokenize=False)
sequences = pipeline(
prompt,
max_new_tokens=512
)
print(sequences[0]['generated_text'])
Acknowledgements
A heartfelt appreciation goes to the vibrant open-source community, particularly:
- The Intel team for publishing a great open dataset and show how well it worked in the first place
- Teknium and NousResearch for their awesome work and models.
- Maxime for sharing such great resources.
- Argilla for publishing argilla/distilabel-intel-orca-dpo-pairs
Open LLM Leaderboard Evaluation Results
Detailed results can be found here
Metric | Value |
---|---|
Avg. | 68.87 |
AI2 Reasoning Challenge (25-Shot) | 66.38 |
HellaSwag (10-Shot) | 85.14 |
MMLU (5-Shot) | 64.70 |
TruthfulQA (0-shot) | 59.20 |
Winogrande (5-shot) | 79.40 |
GSM8k (5-shot) | 58.38 |
- Downloads last month
- 258
Model tree for sethuiyer/Chikuma_10.7B_v2
Dataset used to train sethuiyer/Chikuma_10.7B_v2
Evaluation results
- normalized accuracy on AI2 Reasoning Challenge (25-Shot)test set Open LLM Leaderboard66.380
- normalized accuracy on HellaSwag (10-Shot)validation set Open LLM Leaderboard85.140
- accuracy on MMLU (5-Shot)test set Open LLM Leaderboard64.700
- mc2 on TruthfulQA (0-shot)validation set Open LLM Leaderboard59.200
- accuracy on Winogrande (5-shot)validation set Open LLM Leaderboard79.400
- accuracy on GSM8k (5-shot)test set Open LLM Leaderboard58.380