Prerequisite: Python 3.8.6
scipy==1.6.3
numpy==1.20.2
pandas==1.2.4
nltk==3.6.2
matplotlib==3.4.2
tqdm==4.49.0
transformers==4.6.0.dev0
torch==1.4.0
nlp==0.4.0
activations==0.1.0
brokenaxes==0.4.2
easydict==1.9
file_utils==0.0.1
scikit_learn==0.24.2
utils==1.0.1
xgboost==1.4.2
To ease the installation of dependencies we suggest using the requirements.txt, via pip:
$ pip install -r requirements.txt
Optionally, you can create a Python virtual environment and install all the dependencies in it:
$ python3 -m venv venv
$ source venv/bin/activate
The vulnerability dataset is obtained from the National Vulnerability Database (NVD), a United States government repository of standards-based vulnerability management data. We obtain the information through their Application Programming Interface (API), starting from index 0 to 152,000, representing data collected until April 2021. We filter the data to only consider descriptions related to version 3 of CVSS. We divide them into train and test sets, composed of 63,848 and 15,962 instances, respectively, found in the data folder.
We process the collected data to retrieve vulnerability descriptions and the classes for each of the eight categories analyzed: Attack Vector, Attack Complexity, Privileges Required, User Interaction, Scope, Confidentiality, Integrity, and Availability. A visual representation of class proportions, for each category, of the dataset is displayed in the following figure:
We compare the performance of BERT, RoBERTa, ALBERT, BART, DeBERTa, and DistilBERT models in the created dataset. In the following table we display the hyperparameters for finetuning, which was based on the authors' methodology for each model. All pretrained models are obtained from the HuggingFace repository.
| Model | Learning Rate | Training Epochs | Batch Size | Weight Decay |
|---|---|---|---|---|
| BERT | 3e-05 | 3 | 4 | 0 |
| RoBERTa | 1.5e-05 | 2 | 4 | 0.01 |
| ALBERT | 3e-05 | 3 | 8 | 0 |
| DeBERTa | 3e-05 | 10 | 4 | 0 |
| DistilBERT | 5e-05 | 3 | 8 | 0 |
Our experiments used the script train.sh, train_specific_model.sh, and infer.sh. train_specific_model.sh is used to train a specific model for the different categories with varying parameters (learning rate, epochs, batch size, and weight decay). train.sh trains distilbert (by default in train.py, line 161) considering different text pre-processing approaches.
If our work or code helped you in your research, please use the following BibTeX entries.
@ARTICLE{9786831,
author={Costa, Joana Cabral and Roxo, Tiago and Sequeiros, João B. F. and Proenca, Hugo and INÁCIO, Pedro R. M.},
journal={IEEE Access},
title={Predicting CVSS Metric via Description Interpretation},
year={2022},
volume={10},
number={},
pages={59125-59134},
doi={10.1109/ACCESS.2022.3179692}}