Published at ICLR-2026 (14th International Conference on Learning Representations), 23-27 April 2026, Rio de Janeiro, Brazil
Pre-print and appendix available at: https://arxiv.org/abs/2510.15681
module load gcc arrow/15.0.1 opencv/4.11.0(do this only once; afterward you just source ~/lean_env/bin/activate)
cd ./LEAN_interaction
python -m venv ~/lean_env
source ~/lean_env/bin/activatepip install --upgrade pip
pip install pandas tqdm datasets lockfile rapidfuzz
pip install poetry
pip install vllm
pip install trl
pip install wandb
poetry lock
poetry installpoetry shellcurl https://elan.lean-lang.org/elan-init.sh -sSf | sh
source $HOME/.elan/envelan default leanprover/lean4:4.15.0
lean --version # should print Lean (version 4.15.0)
lake --version # should work too and print version 4.15.0module load gcc arrow/15.0.1
source ~/lean_env/bin/activatepython ./datasets_training/NuminaMath-LEAN-PF/getDataset.pyThis'll create a data structure called pairs, that's a list of tuples (informal_theoremproof_pair, formal_theoremproof_pair). This code also prints the first two elements from pairs for convenience.
NOTE: You need to do this only once.
./LEAN_interaction/setupFolders_beforeLEANtypecheck.shIf you are on a slurm-based system, run the following commands to ease space constraint:
# Move the folder to your scratch directory
mv ../autoformalization-jtemb ~/scratch/
# Recreate a symlink at the old location pointing to the new one
ln -s ~/scratch/autoformalization-jtemb ../autoformalization-jtembAfter this, you can call the check_repl(...) function in ./LEAN_interaction/checkLEAN.py for LEAN type-checking. See an example of how to use it, in ./datasets_training/NuminaMath-LEAN-PF/dataset_construction/createDataset_step2_leanTypeCheck.py.
srun --account=def-ACCOUNT --gpus-per-node=h100:1 --cpus-per-task=8 --mem=128000 --time=0-01:15 --job-name=vllm --pty bashIf you find the paper or this repository useful, please cite it with:
@inproceedings{jana2026proofbridge,
title = {{ProofBridge: Auto-Formalization of Natural Language Proofs in Lean via Joint Embeddings}},
author = {Jana, Prithwish and Kale, Kaan and Tanriverdi, Ahmet Ege and Song, Cruise and Vishwanath, Sriram and Ganesh, Vijay},
booktitle = {Proceedings of the 14th International Conference on Learning Representations (ICLR)},
year = {2026},
note = {arXiv:2510.15681},
}