-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Hello,
I have been exploring your code and datasets as described in the README and downloaded the dataset used for training BeLLM from https://huggingface.co/datasets/SeanLee97/all_nli_angle_format_b/tree/main, named SeanLee97/all_nli_angle_format_b.
After analyzing the data, I noticed that this dataset contains 480,862 rows and 3 columns. In comparison, previous works like PromptEOL utilized the simcse_nli dataset (comprised of SNLI and MNLI) which totals 275,602 rows x 3 columns.
I am curious about the composition of the all_nli_angle_format_b dataset and am wondering why there is a significant increase in the data amount. Could you please share some insights on how this dataset was compiled and what makes up the additional data?
Additionally, have you or your team tested the performance of PromptEOL with a larger dataset size of 480,862 rows x 3 columns? I am interested in understanding how the increase in dataset size might influence the model's performance.
Thank you for your time and assistance. I look forward to your response!