Machine Learning LLM

📄 Paper: https://arxiv.org/abs/2509.06806

🤗 huggingface: https://huggingface.co/MachineLearningLM

Model

The model is been available on Hugging Face.

Pretraining Dataset

All Datasets have been open-sourced on Hugging Face. Due to the large file size, the dataset has been split into multiple parts. At the same time, complete datasets are also hosted on Google Drive:

🔹 Warmup Dataset

🔹 Full Dataset

Evaluation Framework

A comprehensive framework for evaluating Large Language Models on machine learning tasks, supporting both traditional machine learning models and deep learning approaches with automated pipeline processing.

Overview

This framework provides end-to-end evaluation capabilities for LLMs on machine learning tasks, featuring automated data preprocessing, prompt generation, model inference, and comprehensive evaluation metrics.

Important Notes

Special Character Handling: Due to shell reserved characters, CSV filenames in TALENT datasets may contain special characters like ( which are shell reserved characters. We recommend preprocessing these filenames to use only numbers, letters, and underscores before processing.
Text Data Processing: While we support text data processing, since we use commas (,) as feature separators, please replace any commas in your dataset text to avoid model confusion. In our evaluation, we use spaces as replacements for commas.

Installation

# Install Python dependencies
pip install -r requirements.txt

Batch Processing Usage

For batch processing, you need to provide input path and output path parameters. The framework supports three execution modes:

Step 1: Activate Parameters

source ./scripts/evaluate_parameters.sh

Step 2: Choose Execution Mode

See the "Execution Options" section below for detailed commands based on your preferred processing mode (sequential, parallel, or end-to-end pipeline).

Execution Options

Option 1: Sequential Processing

Use scripts in single_process/ directory to run steps sequentially:

./scripts/single_process/data_prep.sh
./scripts/single_process/prompt_gen.sh  # For deep learning only
./scripts/single_process/model_pred.sh
./scripts/single_process/evaluation.sh
./scripts/single_process/report.sh

Option 2: Parallel Processing

Use scripts in multi_process/ directory for accelerated parallel execution:

./scripts/multi_process/data_prep_mp.sh
./scripts/multi_process/prompt_gen_mp.sh  # For deep learning only
./scripts/multi_process/model_pred.sh
./scripts/multi_process/evaluation_mp.sh
./scripts/multi_process/report_mp.sh

Option 3: End-to-End Pipeline

Run the complete pipeline with parallelization optimizations:

./scripts/evaluate_pipeline.sh

Single File Processing

For direct inference on individual JSONL files, we support single-file processing mode.

Important: The input file must have a .jsonl extension - the code logic uses this suffix for file type identification.

The file should contain prompts in LLaMA Factory's Alpaca format with the following structure:

instruction: The task instruction
input: The input data
output: The expected output

Local Model Usage Example

python ./src/evaluation/model_pred/dl_model_pred.py \
  --input_dir ./demo_input.jsonl \
  --output_dir ./demo_output.jsonl \
  --model_name MachineLearningLM/MachineLearningLM-7B-v1

Cloud Model Usage

For cloud model calls, model path must start with openai:: for proper parsing and OpenAI SDK execution:

python3 ./src/evaluation/model_pred/dl_model_pred.py \
  --input_dir ./input_demo.jsonl \
  --output_dir ./output_demo.jsonl \
  --model_name openai::gpt-4o-mini \
  --api_key your_own_api_key \
  --base_url your_own_base_url \
  --max_samples 5

Single File Evaluation

You can also perform evaluation on individual files directly:

python ./src/evaluation/result_proc/evaluator.py \
  --input_dir ./demo_response.jsonl \
  --output_dir ./output_demo.txt   # Can also be .jsonl

Note: Our evaluation framework is specifically designed for results generated by our dl_model_pred inference pipeline. Please use outputs from our inference module as input for evaluation to ensure compatibility.

Configuration

All parameters are managed through ./scripts/evaluate_parameters.sh. Modify this file to customize:

Input/output paths
Model configurations
Processing parameters
Evaluation settings

Features

Dual Model Support: Traditional ML and Deep Learning models.
Flexible Processing: Single-process or multi-process execution
Automated Pipeline: End-to-end workflow automation
Single File Support: Direct inference on individual JSONL files
Comprehensive Evaluation: Multi-metric evaluation framework
Parallel Optimization: Built-in parallelization for performance

TabICL Evaluation

This part of the code needs to run in an environment with the tabicl and openpyxl libraries installed.

The evaluation code for tabicl is placed separately in the ./src/evaluation/tabicl_evaluate.py file. Use ./scripts/tabicl_evaluate.sh to obtain the evaluation results for tabicl.

Use --datasets to specify the datasets to be evaluated, and --sample_sizes to indicate the number of shots.

If multiple datasets need to be evaluated, separate them with spaces. To evaluate all CSV files in the input folder, use all.

Prior_data

MachineLearningLM uses the code from tabicl to generate prior data.

Use ./scripts/generate_data.sh to generate the prior data. It generates the corresponding .pt and .csv files, and normalizes the feature values in the CSV files to the range of 0–999, as we did in the paper.

Parameter Introduction（refer to the comments in the file `tabicl\src\tabicl\prior\dataset.py`）

Data Scale & Structure

Parameter	Type	Description
`min_features`	int	Minimum number of features per dataset
`max_features`	int	Maximum number of features per dataset
`max_classes`	int	Maximum number of target classes
`min_seq_len`	int	Minimum samples per dataset. Uses `max_seq_len` if None
`max_seq_len`	int	Maximum samples per dataset （Not Include）

Batch Configuration

Parameter	Type	Description
`batch_size`	int	Total number of datasets to generate per batch
`batch_size_per_gp`	int	Number of datasets per group (shared characteristics)
`batch_size_per_subgp`	int	Number of datasets per subgroup (similar causal structures). Defaults to `batch_size_per_gp` if None

Sequence Length Control

Parameter	Type	Description
`log_seq_len`	bool	Sample sequence length from log-uniform distribution if True
`seq_len_per_gp`	bool	Sample sequence length per group (enables variable-sized datasets)
`replay_small`	bool	Occasionally sample smaller sequences for model robustness

Train-Test Split

Parameter	Type	Description
`min_train_size`	int/float	Start position/ratio for train split (int: absolute, float: fractional)
`max_train_size`	int/float	End position/ratio for train split (int: absolute, float: fractional)

Generation Method

Parameter	Type	Description
`prior_type`	str	Prior type: 'mlp_scm', 'tree_scm', or 'mix_scm' (random selection)
`fixed_hp`	dict	Fixed structural configuration parameters
`sampled_hp`	dict	Parameters sampled during generation

Computation Settings

Parameter	Type	Description
`n_jobs`	int	Number of parallel jobs (-1 = use all processors)
`num_threads_per_generate`	int	Number of threads per generation job
`device`	str	Computation device ('cpu' or 'cuda')

Train

MachineLearningLM uses the LLaMA-Factory framework for training.

Training Environment Configuration

cd ./third_party/LLaMA-Factory
pip install -e ".[torch,metrics]" --no-build-isolation
pip install wandb

Use ./scripts/train.sh for training.

Project Structure

MachineLearningLM/
├──src/
|   ├──evaluation/
│   │   ├── data_prep/          # Data preprocessing and chunking utilities
│   │   ├── prompt_gen/         # Prompt generation for deep learning models
│   │   ├── model_pred/         # Model inference (ML and DL prediction engines)
│   │   ├── result_proc/        # 5-layer evaluation architecture and metrics processing
│   │   ├── zero_summary/       # Result summarization and report generation
│   │   └── tabicl_evaluate.py
│   └──prior_data
│       └── pt_to_csv.py     
├── scripts/
│   ├── single_process/         # Sequential execution shell scripts
│   ├── multi_process/          # Parallel execution shell scripts (with _mp suffix)
│   ├── evaluate_parameters.sh  # Global parameter configuration
|   ├── evaluate_pipeline.sh    # automated pipeline
|   ├── generate_data.sh
|   ├── tabicl_evaluate.sh
|   └── train.sh
├── datahub_inputs/
│   ├── data_demo/          # Demo datasets for testing
│   └── data_raw/           # Raw input datasets
├── third_party/
│   ├── tabicl/          
│   └── LLaMA-Factory/   
├── requirements.txt        # Python dependencies for Evaluation Framework
├── README.md
├── README_zh.md
├── THIRD_PARTY_NOTICES.md
└── LICENSE

Acknowledgement

We thank LLaMA-Factory and TabICL for the open source code.

Reference

@article{dong2025machinelearninglm,
  title={MachineLearningLM: Scaling Many-shot In-Context Learning via Continued Pretraining},
  author={Dong, Haoyu and Zhang, Pengkun and Lu, Mingzhe and Shen, Yanzhen and Ke, Guolin},
  journal={arXiv preprint arXiv:2509.06806},
  year={2025}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Machine Learning LLM

Evaluation Framework

Overview

Important Notes

Installation

Batch Processing Usage

Step 1: Activate Parameters

Step 2: Choose Execution Mode

Execution Options

Option 1: Sequential Processing

Option 2: Parallel Processing

Option 3: End-to-End Pipeline

Single File Processing

Local Model Usage Example

Cloud Model Usage

Single File Evaluation

Configuration

Features

TabICL Evaluation

Prior_data

Parameter Introduction（refer to the comments in the file `tabicl\src\tabicl\prior\dataset.py`）

Train

Training Environment Configuration

Project Structure

Acknowledgement

Reference

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
datahub_inputs		datahub_inputs
scripts		scripts
src		src
third_party		third_party
LICENSE		LICENSE
README.md		README.md
THIRD_PARTY_NOTICES.md		THIRD_PARTY_NOTICES.md
requirements.txt		requirements.txt

License

HaoAreYuDong/MachineLearningLM

Folders and files

Latest commit

History

Repository files navigation

Machine Learning LLM

Evaluation Framework

Overview

Important Notes

Installation

Batch Processing Usage

Step 1: Activate Parameters

Step 2: Choose Execution Mode

Execution Options

Option 1: Sequential Processing

Option 2: Parallel Processing

Option 3: End-to-End Pipeline

Single File Processing

Local Model Usage Example

Cloud Model Usage

Single File Evaluation

Configuration

Features

TabICL Evaluation

Prior_data

Parameter Introduction（refer to the comments in the file tabicl\src\tabicl\prior\dataset.py）

Train

Training Environment Configuration

Project Structure

Acknowledgement

Reference

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Parameter Introduction（refer to the comments in the file `tabicl\src\tabicl\prior\dataset.py`）

Packages