This repository provides the official open-source implementation of the paper:
Comprehensive and Efficient Distillation for Lightweight Sentiment Analysis Models Guangyu Xie#, Yice Zhang#, Jianzhu Bao, Qianlong Wang, Yang Sun, Bingbing Wang, Ruifeng Xu [EMNLP 2025]
We propose CompEffDist, a comprehensive and efficient distillation framework for sentiment analysis. It consists of two core modules:
- Attribute-based Automatic Instruction Construction: automatically builds diverse instruction sets from extracted attributes.
- Difficulty-based Data Filtering: prioritizes and samples training data by difficulty to improve efficiency.
Across multiple model families (Llama-3, Qwen-3, Gemma-3), our 3B student models can match the performance of teacher models that are ~20× larger on most tasks.
Dataset: zhang-yice/sentiment-distillation-v2
Hugging Face: https://huggingface.co/datasets/zhang-yice/sentiment-distillation-v2
Included files:
Meta-Llama-3.1-70B-Instruct_50k.jsonl— training data generated by LlamaQwen3-32B_50k.jsonl— training data generated by Qwengemma-3-27b-it_50k.jsonl— training data generated by Gemma
| Model | Base Model | Hugging Face |
|---|---|---|
| llama-3-3B-sentiment-distillation-v2 | Llama-3-3B-Instruct | https://huggingface.co/zhang-yice/llama-3-3B-sentiment-distillation-v2 |
| gemma-3-4b-sentiment-distillation-v2 | Gemma-3-4B-it | https://huggingface.co/zhang-yice/gemma-3-4b-sentiment-distillation-v2 |
| Qwen3-4B-sentiment-distillation-v2 | Qwen3-4B-Instruct | https://huggingface.co/zhang-yice/Qwen3-4B-sentiment-distillation-v2 |
.
├── 0_generate_attribute.py # Steps 0–8: Attribute-based Automatic Instruction Construction
├── 1_extract_attributes.py
├── 2_generate_analysis_prompt.py
├── 3_extract_analysis_prompts.py
├── 4_task_specific_prompt_stage1.py
├── 5_task_specific_prompt_stage2.py
├── 6_extract_task_prompts.py
├── 7_task_specific_prompt_stage3.py
├── 8_task_specific_prompt_with_demos.py
├── clustering # Attribute clustering
│ ├── clustering.py
│ ├── collect_attr.py
│ ├── data
│ │ ├── clustering # Final clustering results
│ └── get_embedding.py
├── eval_sentibench # Sentibench evaluation code
├── machine_generated_instr # Generated intermediate artifacts
│ └── 11_final_clusterid_2all_prompts.json # All instructions generated from attribute clusters
├── post_train # Training scripts/configs
│ ├── bash
│ │ ├── llama_training.sh
│ │ └── model_name.json
│ ├── configs
│ │ └── 3b_full_config.yaml
│ └── training_data
├── prompts # User texts used for attribute extraction
├── requirements.txt # env
├── ranking_based_difficulty # Difficulty-based data filtering
│ ├── difficulty_prioritized_sampling.py
│ └── rank_score_compute.py
└── utils.py
CUDA_VISIBLE_DEVICES=0,1,2,3 python 0_generate_attribute.py \
-m /data/Meta-Llama-3.1-70B-Instruct \
-z 100 -c 0.7 -g 4 \
-i ./prompts/input_samples/sampled_amazon_input_4k.json__./prompts/input_samples/sampled_yelp_input_4k.json__./prompts/input_samples/sampled_movie_input_4k.json__./prompts/input_samples/sampled_tweet_input_4k.json__./prompts/input_samples/sampled_tweet_politics_input_4k.json \
-o ./output_data/python 1_extract_attributes.pyRun the following scripts in order:
collect_attr.py— assign a unique ID to each attributeget_embedding.py— compute/visualize embeddingsclustering.py— perform clustering
python 2_generate_analysis_prompt.pypython 3_extract_analysis_prompts.pypython 4_task_specific_prompt_stage1.pypython 5_task_specific_prompt_stage2.pypython 6_extract_task_prompts.pypython 7_task_specific_prompt_stage3.pypython 8_task_specific_prompt_with_demos.pyFinal prompts for each attribute cluster are saved at:
./machine_generated_instr/11_final_clusterid_2all_prompts.json
./ranking_based_difficulty/rank_score_compute.py— compute difficulty metrics./ranking_based_difficulty/difficulty_prioritized_sampling.py— stratified sampling based on difficulty
- Download the training data from Hugging Face and place it under:
./post_train/training_data
- Start training:
cd post_train
bash llama_training.shNotes:
- Please edit
llama_training.shto match your environment (e.g., data paths, input/output column names). - Also check
configs/3b_full_config.yamlfor training hyperparameters and runtime settings.
@inproceedings{xie-etal-2025-comprehensive,
title = "Comprehensive and Efficient Distillation for Lightweight Sentiment Analysis Models",
author = "Xie, Guangyu and
Zhang, Yice and
Bao, Jianzhu and
Wang, Qianlong and
Sun, Yang and
Wang, Bingbing and
Xu, Ruifeng",
editor = "Christodoulopoulos, Christos and
Chakraborty, Tanmoy and
Rose, Carolyn and
Peng, Violet",
booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = "2025",
address = "Suzhou, China",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.emnlp-main.1122/",
doi = "10.18653/v1/2025.emnlp-main.1122",
pages = "22081--22102",
ISBN = "979-8-89176-332-6",
}