Qwen2-7B-ReLU

Qwen2-7B-ReLU is a variant of Qwen2-7B that replaces the SiLU/Swish activation function with dReLU, achieving higher sparsity while maintaining the performance of the original model.

Key Features

Replaces SiLU/Swish activation function with dReLU
Maintains comparable or even better performance with the original Qwen2-7B
Significantly increases activation sparsity, enabling further optimization and compression

Benchmarks

The model has been evaluated on standard benchmarks to verify its performance:

MMLU: 69.19% (5-shot)
IFEval: 73.2% (Prompt Strict-Accuracy)
Livebench:
- Average: 32.1%
- Coding: 39.8%
- Data Analysis: 45.3%
- Instruction Following: 58.1%
- Language: 9.0%
- Math: 22.0%
- Reasoning: 18.7%

These results demonstrate that the ReLU modification maintains competitive performance while achieving higher sparsity compared to the original model.

Technical Details

The key modification in this version is the application of ReLU activation to both branches in the MLP block. The implementation modifies the original Qwen2MLP class as follows:

class Qwen2MLP(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.config = config
        self.hidden_size = config.hidden_size
        self.intermediate_size = config.intermediate_size
        self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
        self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
        self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
        self.act_fn = ACT2FN[config.hidden_act]

    def forward(self, x):
        down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.act_fn(self.up_proj(x)))
        return down_proj

The key change is in the forward pass, where the activation function is now applied to both the gate projection and up projection outputs before multiplication. This modification, combined with the use of ReLU, contributes to the increased sparsity of the model.

Intended Usage

This release primarily targets the research community for:

Studying sparsity in large language models
Model compression and optimization research
Understanding the impact of activation functions on model behavior

Model Limitations

The model may exhibit biases present in the training data
May generate incorrect, inappropriate, or harmful content
Performance may vary across different domains and tasks
Not suitable for production deployment without proper evaluation

Quick Start

You should replace original modeling_qwen FFN implementation code to dReLU firstly.

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("PowerInfer/SparseQwen2-7B")
tokenizer = AutoTokenizer.from_pretrained("PowerInfer/SparseQwen2-7B")

prompt = "Hello"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs)
response = tokenizer.decode(outputs[0])

Citation

If you use this model in your research, please cite:

@article{song2024turbo,
  title={Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters},
  author={Song, Yixin and Xie, Haotong and Zhang, Zhengyan and Wen, Bo and Ma, Li and Mi, Zeyu and Chen, Haibo},
  journal={arXiv preprint arXiv:2406.05955},
  year={2024}
}

PowerInfer
/

SparseQwen2-7B