A comprehensive evaluation framework for language models with support for multiple datasets, GPU specification, and offline evaluation capabilities.
- 🚀 Multi-dataset Support: Evaluate on code, algebra, analysis, and other domains
- 🎯 GPU Specification: Run evaluations on specific GPUs
- 📊 Offline Mode: Evaluate models without internet connection
- 🔧 Data Slicing: Evaluate on specific data subsets using indices
- 💾 Caching: Intelligent caching for faster repeated evaluations
- 📈 Comprehensive Metrics: Cross-entropy loss, token-level analysis
.
├── scripts/
│ └── eval.py # Main evaluation script
├── src/merge/
│ └── main_merging.py # Model merging script
├── data/
│ ├── eval_partial/ # Evaluation datasets
│ │ ├── code.json # Code generation tasks
│ │ ├── algebra.json # Mathematical problems
│ │ └── analysis.json # Data analysis tasks
│ └── train_partial/ # Training datasets
├── test_result/ # Evaluation results and merged models
└── cache/ # Cached tokenized data
Activate the required Python environment:
source /zju_0038/jinjia/workspace/Merging-Scaling-Law-main/merge-eval-py312/bin/activateSet required environment variables:
export TRANSFORMERS_NO_TORCHVISION=1
export PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=pythonMerge multiple models using Task Arithmetic:
export PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python
python3 src/merge/main_merging.py \
--merge_method task_arithmetic \
--output_dir /path/to/output \
--base_model /path/to/base/model \
--models_to_merge "/path/to/model1,/path/to/model2,/path/to/model3" \
--scaling_coefficient 0.2 \
--use_gpuRun model merging on a specific GPU:
export PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python
export CUDA_VISIBLE_DEVICES=7
python3 src/merge/main_merging.py \
--merge_method task_arithmetic \
--output_dir /zju_0038/test_merge/Merging-EVAL/test_result/scheme1_16models_task_arithmetic \
--base_model /zju_0038/wyy/mergebench/models/Llama-3.2-3B \
--models_to_merge "/zju_0038/yifyang/scripts/models/llama-instruct-3B-v2-algebra,/zju_0038/yifyang/scripts/models/llama-instruct-3B-v2-analysis,/zju_0038/yifyang/scripts/models/llama-instruct-3B-v2-number_theory,/zju_0038/yifyang/scripts/models/llama-instruct-3B-v2-physics" \
--scaling_coefficient 0.2 \
--use_gpuEvaluate a model on a specific dataset:
python3 scripts/eval.py \
--model /path/to/model \
--tokenizer /path/to/tokenizer \
--file /path/to/dataset.json \
--output ./results \
--batch_size 1 \
--max_length 2048Run evaluation on a specific GPU:
export CUDA_VISIBLE_DEVICES=7
python3 scripts/eval.py \
--model /path/to/model \
--tokenizer /path/to/tokenizer \
--file /path/to/dataset.json \
--output ./results \
--batch_size 1 \
--max_length 2048 \
--gpu_id 0 \
--offline# 数学问题推荐 max_length = 2048
source /zju_0038/jinjia/workspace/Merging-Scaling-Law-main/merge-eval-py312/bin/activate && \
export TRANSFORMERS_NO_TORCHVISION=1 && \
export PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python && \
export CUDA_VISIBLE_DEVICES=7 && \
python3 /zju_0038/test_merge/Merging-EVAL/scripts/eval.py \
--model /zju_0038/jinjia/workspace/Merging-Scaling-Law-main/models/merged/Llama-3B-cmb/task_arithmetic_9/sc0.1_r0/6p3h \
--tokenizer /zju_0038/jinjia/workspace/Merging-Scaling-Law-main/models/merged/Llama-3B-cmb/task_arithmetic_9/sc0.1_r0/6p3h \
--file /zju_0038/test_merge/Merging-EVAL/data/eval_partial/algebra.json \
--output /zju_0038/test_merge/Merging-EVAL/test_result \
--batch_size 1 \
--max_length 2048 \
--gpu_id 0 \
--offline# 代码问题推荐 max_length = 4096
source /zju_0038/jinjia/workspace/Merging-Scaling-Law-main/merge-eval-py312/bin/activate && \
export TRANSFORMERS_NO_TORCHVISION=1 && \
export PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python && \
export CUDA_VISIBLE_DEVICES=7 && \
python3 /zju_0038/test_merge/Merging-EVAL/scripts/eval.py \
--model /zju_0038/jinjia/workspace/Merging-Scaling-Law-main/models/merged/Llama-3B-cmb/task_arithmetic_9/sc0.1_r0/6p3h \
--tokenizer /zju_0038/jinjia/workspace/Merging-Scaling-Law-main/models/merged/Llama-3B-cmb/task_arithmetic_9/sc0.1_r0/6p3h \
--file /zju_0038/test_merge/Merging-EVAL/data/eval_partial/code.json \
--output /zju_0038/test_merge/Merging-EVAL/test_result \
--batch_size 1 \
--max_length 4096 \
--gpu_id 0 \
--offline--merge_method: Merging method (e.g., "task_arithmetic", "average_merging")--base_model: Path to the base model directory--models_to_merge: Comma-separated list of model paths to merge--output_dir: Output directory for merged model
--scaling_coefficient: Scaling coefficient for merging (default: 1.0)--use_gpu: Use GPU for merging (default: CPU)--exclude_param_names_regex: Regex patterns for parameters to exclude--param_value_mask_rate: Parameter value mask rate (default: 0.8)--mask_apply_method: Method for applying masks (default: "average_merging")--weight_mask_rates: Comma-separated weight mask rates
--model: Path to the model directory--tokenizer: Path to the tokenizer directory--file: Path to the evaluation dataset JSON file
--output: Output directory for results (default:./output)--batch_size: Batch size for evaluation (default: 10)--max_length: Maximum sequence length (default: 2048)--gpu_id: Specific GPU ID to use (e.g., 0, 1, 2)--offline: Run in offline mode (no internet connection required)--indices: Evaluate specific data indices (e.g., "1-10,15,20-22")--run_name: Custom name for output folder--no_cache: Disable caching mechanism
- Recommended max_length: 4096-8192
- Reason: Code samples are typically longer (average: 1,899 tokens)
- Coverage: 8192 covers 98% of samples
- Recommended max_length: 2048
- Reason: Math problems are typically shorter
- Coverage: 2048 is sufficient for most algebra problems
Results are saved as CSV files in the output directory:
test_result/
└── model_name/
└── all/
└── results.csv
CSV format:
problem,CE Loss,class
algebra,0.6289,algebra
Avg.,0.6289,average
Overall,0.6289,overall-
Protobuf Version Conflicts:
- Solution: Set
export PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python
- Solution: Set
-
GPU Memory Issues:
- Use CPU mode (remove
--use_gpuflag) - Reduce number of models to merge simultaneously
- Use specific GPU with
export CUDA_VISIBLE_DEVICES=X
- Use CPU mode (remove
-
PEFT Configuration Errors:
- Some models may have incompatible PEFT configurations
- Solution: Exclude problematic models or use CPU mode
-
RoPE Configuration Warnings:
- Warning:
rope_scalingconfiguration issues - Solution: These are usually non-fatal warnings
- Warning:
-
NaN Loss Values: Usually caused by very long sequences or all-masked labels
- Solution: Increase
max_lengthor check data format
- Solution: Increase
-
GPU Memory Issues:
- Reduce
batch_sizeto 1 - Decrease
max_length - Use specific GPU with
--gpu_id
- Reduce
-
Network Connection Issues:
- Use
--offlineflag for local model evaluation
- Use
- Use GPU mode for faster merging when memory allows
- Start with smaller model subsets to test compatibility
- Use CPU mode for large model collections to avoid memory issues
- Set environment variables before running to avoid conflicts
- Use caching for repeated evaluations on the same dataset
- Specify GPU ID for better resource management
- Adjust
max_lengthbased on dataset characteristics - Use offline mode when internet connection is unstable