Skip to content

NovaSky-AI/SkyThought

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SkyThought

Github Twitter Hugging Face Collection Discord

News

  • [2025/01/23] ⚡️ We released Sky-T1-32B-Flash (model, data) to tackle overthinking and reduce reasoning sequence lengths while maintaining accuracy.
  • [2025/01/19] 🎉 Chat demo for Sky-T1-32B-Preview is alive! Please check it out!
  • [2025/01/10] 🎉 We have released our Sky-T1-32B-Preview model and data through HuggingFace!

Links

Getting Started

We open source the code and scripts we used for data curation, training, and evaluation for Sky-T1-32B-Preview, you can find more details in each directory.

  • /data: The 17k training data used to train Sky-T1-32B-Preview. We also add the science and riddle portion from the STILL-2 model.
  • skythought/tools: Training data curation and evaluation for Sky-T1. To generate our training data, we use the QwQ-32B-Preview model. We curate the data mixture to cover diverse domains that require reasoning, and a reject sampling procedure to improve the data quality.
  • skythought/train: Training scripts for Sky-T1. We use Llama-Factory to perform training. The model was trained for 3 epochs with a learning rate of 1e-5 and a batch size of 96. Our model training was completed in 19 hours on 8 H100 GPUs using DeepSpeed Zero-3 offloading, costing approximately $450 as per Lambda Cloud pricing.

Evaluation

Following, we show our evaluation results for the Sky-T1-32B-Preview model across math, coding, and science benchmarks.

Evaluation results

Metric Sky-T1-32B-Preview Qwen-2.5-32B-Instruct QwQ o1-preview
Math500 86.4 81.4 92.2 81.4
AIME2024 43.3 16.7 50.0 40.0
LiveCodeBench-Easy 86.3 84.6 90.7 92.9
LiveCodeBench-Medium 56.8 40.8 56.3 54.9
LiveCodeBench-Hard 17.9 9.8 17.1 16.3
GPQA-Diamond 56.8 45.5 52.5 75.2
OlympiadBench (Math, EN) 59.79 46.74 62.17 -

Results on non-reasoning benchmarks

We also evaluate on non-reasoning benchmarks (these are benchmarks for instruction-following, QA, etc) to test whether the model has traded-off capability in other domains for better performance in reasoning-related benchmarks.

Metric Sky-T1-32B-Preview Qwen-2.5-32B-Instruct QwQ-32B-Preview Eval Implementation
MMLU (0 shot; no CoT) 78.36 74.14 71.23 lm_eval
MMLU (5 shot; no CoT) 82.46 82.62 82.32 lm_eval
ARC-C (0 shot; no CoT) 49.49 49.4 49.66 lm_eval
IFEval 75.79 78.74 42.51 lm_eval
LLM-as-a-Judge 9.12 9.19 8.30 fastchat
MGSM (0 shot; direct) 33 42.3 19.07 lm_eval
MGSM (8-shot; direct) 58.4 61.47 58.5 lm_eval
BFCL-v3 53.18 58.92 17.41 BFCL
Arena-Hard 74.79 66.51 52.6 Arena-Hard-Auto

For more details, refer here.

Fully Open-source: Driving Progress Together

We believe that open-source collaboration drives progress, and with Sky-T1-32B-Preview, we are fully committed to empowering the community. We open-source all details (i.e., data, codes, model weights) to enable the community to replicate and improve on our results easily:

Model
Sky-T1-32B-Preview
STILL-2
Journey
QwQ
o1
Data
Code
Report
Math domain
Coding domain
Model Weights

Citation

The code in this repository is mostly described in the post below. Please consider citing this work if you find the repository helpful.

@misc{sky_t1_2025,
  author       = {NovaSky Team},
  title        = {Sky-T1: Train your own O1 preview model within $450},
  howpublished = {https://novasky-ai.github.io/posts/sky-t1},
  note         = {Accessed: 2025-01-09},
  year         = {2025}
}

Acknowledgement

This work is done at Berkeley Sky Computing Lab, with the amazing compute support from Lambda Labs and Anyscale. We would like to express our gratitude for the valuable academic feedback and support from the Still-2 Team, and Junyang Lin from the Qwen Team.

About

Sky-T1: Train your own O1 preview model within $450

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages