This repo contains all the data and code related to our paper Foundational Large Language Models for Materials Research .
We performed domain adaptation of the models LLaMA-3 and LLaMA-2 for use in material science, via continued pretraining followed by instruction finetuning on material science and chemistry datasets.
![]() |
![]() ![]() |
for detailed results please look at our paper Foundational Large Language Models for Materials Research . The models can be downloaded from https://huggingface.co/m3rg-iitd. The codebase makes use of the Megatron-LLM library for efficient training of LLMs. Go through their documentation to understand the basics. The environment for using our codebase is same as the one for Megatron-LLM.
- src Contains code to pretrain and fine-tune LLMs that have the LLaMA-2 or LLaMA-3 architecture.
- preprocess Contains code that was used to extract text from research papers for the corpus, from elsevier and springer.
- plots Code used for creating the plots used in the paper
- evaluation_codes Contains code for running benchmark evaluations
Pretraining was performed on a text corpus of total 30B tokens, interleaved in the following way:
- 10M research paper tokens taken from Elsevier and Springer publications followed by 0.1M Red Pajama tokens
- 30M Matsci community discourse tokens included in the last 3B (10%) of the dataset in 100:1 ratio. the list of journals and DOIs of the research papers used can be accessed from zenodo
for running the benchmark evaluations on our datasets. to run, first open the evaluation_codes directory and follow the given instructions. The environment for inference for matNLP tasks requires the VLLM library.
bash ft_eval_downstream.sh <Checkpoint_path> <GPU_number> <output_name1> <output_name2>
the output and error file will be stored in the same directory and their exact names can be found from ft_eval_donwstream.sh file.
python3 {doping, mof1, mof2, discomat}_run.py <CUDA_GPU_NUMBER> <MODEL_PATH> <SAVE_NAME_PREFIX>
Output will be stored as <SAVE_NAME_PREFIX>_{doping, mof1, mof2, discomat}_test.pkl in the same folder
python3 {doping, mof1, mof2, discomat}_eval.py <SAVE_NAME_PREFIX>
This will print the output to the screen along the metrics discussed in the paper.
sh ft_pipeline.sh <load_model_path> <save_model_path> <model_iteration_to_finetune> <train_path>\
<val_path> <epochs> <number of docs in train set> <log_file_name> <llama2/llama3> <port number>
The files that are responsible for IFT:
ft_pipeline.sh
finetune.sh
ft_sft.py
ft_sft.sh
Arguments flow from top to bottom in the above list. The Instruction finetuning process was performed on 8 Nvidia-A100 80GB GPUs via IIT Delhi's High Performance Computing facility.
The weights of the input model must be stored in the Megatron format. To convert model weights from the HuggingFace format to Megatron format, wt_fromhf.sh
is used. For the reverse conversion wt_tohf.sh
is used. The model weights resulting from IFT are stored in the HF format to facilitate inference.
We used the codebase of Meditron-LLM for training our models on Nvidia A100 GPUs. We thank the High-Performance Computing (HPC) facility at IIT Delhi for computational and storage resources. This work was partially supported by the Edinburgh International Data Facility (EIDF) and the Data-Driven Innovation Programme at the University of Edinburgh. The EIDF provided access to Cerebras CS2 clusters which were used for performing pretraining on our models.