This repository contains the code for the NeurIPS 2025 paper Understanding and Enhancing Mask-Based Pretraining towards Universal Representations.
The paper presents a working theory of mask pretraining schemes (i.e. MIM, MAE) using high-dimensional linear regression, and proposes an embarrassingly simple improvement guided by the theory. The theoretical framework and its implications have been validated across diverse neural architectures (including MLPs, CNNs, and Transformers) applied to both vision and language tasks. The proposed improvement, termed R²MAE, is implemented in vision, language, DNA sequence, and single-cell models, where it consistently outperforms standard and more complicated masking schemes.
@article{dong2025understanding,
title={Understanding and Enhancing Mask-Based Pretraining towards Universal Representations},
author={Dong, Mingze and Wang, Leda and Kluger, Yuval},
journal={The Thirty-Ninth Annual Conference on Neural Information Processing Systems (NeurIPS 2025)},
year={2025}
}Table of Contents:
notebookscontains Jupyter notebooks for reproducing plots and results in the paper.srccontains R²MAE implementations for vision MAE, RoBERTa, GPN-MSA, and single-cell models. Note: Most modifications are in place, therefore downloading original repositories may also be required.maeis a modification of facebookresearch/mae that implements R²MAE.engine_pretrain_customized.pyis the modified version of the originalengine_pretrain.pythat implements R²MAE.main_pretrain_customized.pyis the modified version of the originalmain_pretrain_customized.py.- Several files were slightly modified for dependency issues.
robertais a modification of the RoBERTa model implementation in huggingface/transformers/examples/pytorch/language-modeling (v4.52.4) that implements R²MAE.run_mlm_customized.pyis the modified version of the originalrun_mlm.py.datacollator_customized.pyshould be placed in the same foldertransformers/examples/pytorch/language-modelingto enable override.
gpn-msais a modification of gpn-msa in songlab-cal/gpn/gpn (v0.6) that implements R²MAE and alternative baselines. The majority of modification is to enable CL-MAE (named as "map" in the repository), and a R²MAE-only version would be much simpler.model_map.pyis the modified version of the originalmodel.py.msa_mapcontains the modified version oftrain.py.
single_cellis a R²MAE implementation in single-cell MAE models implemented via scvi-tools (v0.16.0).scVIMaskModelclass supports R²MAE, Dynamic MR, etc.
scriptscontains model training scripts for vision MAE, RoBERTa, GPN-MSA and single-cell models.