LLaMat

This repo contains all the data and code related to our paper Foundational Large Language Models for Materials Research .

Overview

We performed domain adaptation of the models LLaMA-3 and LLaMA-2 for use in material science, via continued pretraining followed by instruction finetuning on material science and chemistry datasets.

LLaMat overview

Results on MatNLP dataset

Results on structured information extraction tasks

for detailed results please look at our paper Foundational Large Language Models for Materials Research . The models can be downloaded from https://huggingface.co/m3rg-iitd. The codebase makes use of the Megatron-LLM library for efficient training of LLMs. Go through their documentation to understand the basics. The environment for using our codebase is same as the one for Megatron-LLM.

File Structure

src Contains code to pretrain and fine-tune LLMs that have the LLaMA-2 or LLaMA-3 architecture.
preprocess Contains code that was used to extract text from research papers for the corpus, from elsevier and springer.
plots Code used for creating the plots used in the paper
evaluation_codes Contains code for running benchmark evaluations

Pretraining

Pretraining was performed on a text corpus of total 30B tokens, interleaved in the following way:

10M research paper tokens taken from Elsevier and Springer publications followed by 0.1M Red Pajama tokens
30M Matsci community discourse tokens included in the last 3B (10%) of the dataset in 100:1 ratio. the list of journals and DOIs of the research papers used can be accessed from zenodo

Inference and Evaluation

for running the benchmark evaluations on our datasets. to run, first open the evaluation_codes directory and follow the given instructions. The environment for inference for matNLP tasks requires the VLLM library.

Instructions to run matNLP evaluations

    bash ft_eval_downstream.sh <Checkpoint_path> <GPU_number> <output_name1> <output_name2>

the output and error file will be stored in the same directory and their exact names can be found from ft_eval_donwstream.sh file.

Instructions to run structured information extraction evaluations:

Generating the output pickle file:

    python3 {doping, mof1, mof2, discomat}_run.py <CUDA_GPU_NUMBER> <MODEL_PATH> <SAVE_NAME_PREFIX>

Output will be stored as <SAVE_NAME_PREFIX>_{doping, mof1, mof2, discomat}_test.pkl in the same folder

running evaluation on the output file:

    python3 {doping, mof1, mof2, discomat}_eval.py <SAVE_NAME_PREFIX>

This will print the output to the screen along the metrics discussed in the paper.

Instruction finetuning

Command:

sh ft_pipeline.sh <load_model_path> <save_model_path> <model_iteration_to_finetune> <train_path>\
<val_path> <epochs> <number of docs in train set> <log_file_name> <llama2/llama3> <port number>

The files that are responsible for IFT:

ft_pipeline.sh
finetune.sh
ft_sft.py
ft_sft.sh

Arguments flow from top to bottom in the above list. The Instruction finetuning process was performed on 8 Nvidia-A100 80GB GPUs via IIT Delhi's High Performance Computing facility.

The weights of the input model must be stored in the Megatron format. To convert model weights from the HuggingFace format to Megatron format, wt_fromhf.sh is used. For the reverse conversion wt_tohf.sh is used. The model weights resulting from IFT are stored in the HF format to facilitate inference.

Acknowledgements

We used the codebase of Meditron-LLM for training our models on Nvidia A100 GPUs. We thank the High-Performance Computing (HPC) facility at IIT Delhi for computational and storage resources. This work was partially supported by the Edinburgh International Data Facility (EIDF) and the Data-Driven Innovation Programme at the University of Edinburgh. The EIDF provided access to Cerebras CS2 clusters which were used for performing pretraining on our models.

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
Megatron-LLM		Megatron-LLM
energy_calc		energy_calc
evaluation_codes		evaluation_codes
loss_landscape		loss_landscape
plots		plots
preprocess		preprocess
src		src
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
fig-app-cif-loss.png		fig-app-cif-loss.png
infer_env.txt		infer_env.txt
loss.ipynb		loss.ipynb
loss_llamat.png		loss_llamat.png
new_plot.py		new_plot.py
training_env.txt		training_env.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLaMat

Table of contents

Overview

File Structure

Pretraining

Inference and Evaluation

Instructions to run matNLP evaluations

Instructions to run structured information extraction evaluations:

Generating the output pickle file:

running evaluation on the output file:

Instruction finetuning

Command:

Acknowledgements

About

Releases

Packages

Contributors 3

Languages

M3RG-IITD/llamat

Folders and files

Latest commit

History

Repository files navigation

LLaMat

Table of contents

Overview

File Structure

Pretraining

Inference and Evaluation

Instructions to run matNLP evaluations

Instructions to run structured information extraction evaluations:

Generating the output pickle file:

running evaluation on the output file:

Instruction finetuning

Command:

Acknowledgements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages