📄 Paper: https://arxiv.org/abs/2509.06806
🤗 huggingface: https://huggingface.co/MachineLearningLM
Model
The model is been available on Hugging Face.
Pretraining Dataset
All Datasets have been open-sourced on Hugging Face. Due to the large file size, the dataset has been split into multiple parts. At the same time, complete datasets are also hosted on Google Drive:
A comprehensive framework for evaluating Large Language Models on machine learning tasks, supporting both traditional machine learning models and deep learning approaches with automated pipeline processing.
This framework provides end-to-end evaluation capabilities for LLMs on machine learning tasks, featuring automated data preprocessing, prompt generation, model inference, and comprehensive evaluation metrics.
-
Special Character Handling: Due to shell reserved characters, CSV filenames in TALENT datasets may contain special characters like
(which are shell reserved characters. We recommend preprocessing these filenames to use only numbers, letters, and underscores before processing. -
Text Data Processing: While we support text data processing, since we use commas (
,) as feature separators, please replace any commas in your dataset text to avoid model confusion. In our evaluation, we use spaces as replacements for commas.
# Install Python dependencies
pip install -r requirements.txtFor batch processing, you need to provide input path and output path parameters. The framework supports three execution modes:
source ./scripts/evaluate_parameters.shSee the "Execution Options" section below for detailed commands based on your preferred processing mode (sequential, parallel, or end-to-end pipeline).
Use scripts in single_process/ directory to run steps sequentially:
./scripts/single_process/data_prep.sh
./scripts/single_process/prompt_gen.sh # For deep learning only
./scripts/single_process/model_pred.sh
./scripts/single_process/evaluation.sh
./scripts/single_process/report.shUse scripts in multi_process/ directory for accelerated parallel execution:
./scripts/multi_process/data_prep_mp.sh
./scripts/multi_process/prompt_gen_mp.sh # For deep learning only
./scripts/multi_process/model_pred.sh
./scripts/multi_process/evaluation_mp.sh
./scripts/multi_process/report_mp.shRun the complete pipeline with parallelization optimizations:
./scripts/evaluate_pipeline.shFor direct inference on individual JSONL files, we support single-file processing mode.
Important: The input file must have a .jsonl extension - the code logic uses this suffix for file type identification.
The file should contain prompts in LLaMA Factory's Alpaca format with the following structure:
instruction: The task instructioninput: The input dataoutput: The expected output
python ./src/evaluation/model_pred/dl_model_pred.py \
--input_dir ./demo_input.jsonl \
--output_dir ./demo_output.jsonl \
--model_name MachineLearningLM/MachineLearningLM-7B-v1For cloud model calls, model path must start with openai:: for proper parsing and OpenAI SDK execution:
python3 ./src/evaluation/model_pred/dl_model_pred.py \
--input_dir ./input_demo.jsonl \
--output_dir ./output_demo.jsonl \
--model_name openai::gpt-4o-mini \
--api_key your_own_api_key \
--base_url your_own_base_url \
--max_samples 5You can also perform evaluation on individual files directly:
python ./src/evaluation/result_proc/evaluator.py \
--input_dir ./demo_response.jsonl \
--output_dir ./output_demo.txt # Can also be .jsonlNote: Our evaluation framework is specifically designed for results generated by our dl_model_pred inference pipeline. Please use outputs from our inference module as input for evaluation to ensure compatibility.
All parameters are managed through ./scripts/evaluate_parameters.sh. Modify this file to customize:
- Input/output paths
- Model configurations
- Processing parameters
- Evaluation settings
- Dual Model Support: Traditional ML and Deep Learning models.
- Flexible Processing: Single-process or multi-process execution
- Automated Pipeline: End-to-end workflow automation
- Single File Support: Direct inference on individual JSONL files
- Comprehensive Evaluation: Multi-metric evaluation framework
- Parallel Optimization: Built-in parallelization for performance
This part of the code needs to run in an environment with the tabicl and openpyxl libraries installed.
The evaluation code for tabicl is placed separately in the ./src/evaluation/tabicl_evaluate.py file. Use ./scripts/tabicl_evaluate.sh to obtain the evaluation results for tabicl.
Use --datasets to specify the datasets to be evaluated, and --sample_sizes to indicate the number of shots.
If multiple datasets need to be evaluated, separate them with spaces. To evaluate all CSV files in the input folder, use all.
MachineLearningLM uses the code from tabicl to generate prior data.
Use ./scripts/generate_data.sh to generate the prior data. It generates the corresponding .pt and .csv files, and normalizes the feature values in the CSV files to the range of 0–999, as we did in the paper.
Data Scale & Structure
| Parameter | Type | Description |
|---|---|---|
min_features |
int | Minimum number of features per dataset |
max_features |
int | Maximum number of features per dataset |
max_classes |
int | Maximum number of target classes |
min_seq_len |
int | Minimum samples per dataset. Uses max_seq_len if None |
max_seq_len |
int | Maximum samples per dataset (Not Include) |
Batch Configuration
| Parameter | Type | Description |
|---|---|---|
batch_size |
int | Total number of datasets to generate per batch |
batch_size_per_gp |
int | Number of datasets per group (shared characteristics) |
batch_size_per_subgp |
int | Number of datasets per subgroup (similar causal structures). Defaults to batch_size_per_gp if None |
Sequence Length Control
| Parameter | Type | Description |
|---|---|---|
log_seq_len |
bool | Sample sequence length from log-uniform distribution if True |
seq_len_per_gp |
bool | Sample sequence length per group (enables variable-sized datasets) |
replay_small |
bool | Occasionally sample smaller sequences for model robustness |
Train-Test Split
| Parameter | Type | Description |
|---|---|---|
min_train_size |
int/float | Start position/ratio for train split (int: absolute, float: fractional) |
max_train_size |
int/float | End position/ratio for train split (int: absolute, float: fractional) |
Generation Method
| Parameter | Type | Description |
|---|---|---|
prior_type |
str | Prior type: 'mlp_scm', 'tree_scm', or 'mix_scm' (random selection) |
fixed_hp |
dict | Fixed structural configuration parameters |
sampled_hp |
dict | Parameters sampled during generation |
Computation Settings
| Parameter | Type | Description |
|---|---|---|
n_jobs |
int | Number of parallel jobs (-1 = use all processors) |
num_threads_per_generate |
int | Number of threads per generation job |
device |
str | Computation device ('cpu' or 'cuda') |
MachineLearningLM uses the LLaMA-Factory framework for training.
cd ./third_party/LLaMA-Factory
pip install -e ".[torch,metrics]" --no-build-isolation
pip install wandbUse ./scripts/train.sh for training.
MachineLearningLM/
├──src/
| ├──evaluation/
│ │ ├── data_prep/ # Data preprocessing and chunking utilities
│ │ ├── prompt_gen/ # Prompt generation for deep learning models
│ │ ├── model_pred/ # Model inference (ML and DL prediction engines)
│ │ ├── result_proc/ # 5-layer evaluation architecture and metrics processing
│ │ ├── zero_summary/ # Result summarization and report generation
│ │ └── tabicl_evaluate.py
│ └──prior_data
│ └── pt_to_csv.py
├── scripts/
│ ├── single_process/ # Sequential execution shell scripts
│ ├── multi_process/ # Parallel execution shell scripts (with _mp suffix)
│ ├── evaluate_parameters.sh # Global parameter configuration
| ├── evaluate_pipeline.sh # automated pipeline
| ├── generate_data.sh
| ├── tabicl_evaluate.sh
| └── train.sh
├── datahub_inputs/
│ ├── data_demo/ # Demo datasets for testing
│ └── data_raw/ # Raw input datasets
├── third_party/
│ ├── tabicl/
│ └── LLaMA-Factory/
├── requirements.txt # Python dependencies for Evaluation Framework
├── README.md
├── README_zh.md
├── THIRD_PARTY_NOTICES.md
└── LICENSE
We thank LLaMA-Factory and TabICL for the open source code.
@article{dong2025machinelearninglm,
title={MachineLearningLM: Scaling Many-shot In-Context Learning via Continued Pretraining},
author={Dong, Haoyu and Zhang, Pengkun and Lu, Mingzhe and Shen, Yanzhen and Ke, Guolin},
journal={arXiv preprint arXiv:2509.06806},
year={2025}
}