Inquiry About the Data Source for the Model Training Set

Hello,

I have been exploring your code and datasets as described in the README and downloaded the dataset used for training BeLLM from https://huggingface.co/datasets/SeanLee97/all_nli_angle_format_b/tree/main, named `SeanLee97/all_nli_angle_format_b`.

After analyzing the data, I noticed that this dataset contains 480,862 rows and 3 columns. In comparison, previous works like PromptEOL utilized the `simcse_nli` dataset (comprised of SNLI and MNLI) which totals 275,602 rows x 3 columns.

I am curious about the composition of the `all_nli_angle_format_b` dataset and am wondering why there is a significant increase in the data amount. Could you please share some insights on how this dataset was compiled and what makes up the additional data?

Additionally, have you or your team tested the performance of PromptEOL with a larger dataset size of 480,862 rows x 3 columns? I am interested in understanding how the increase in dataset size might influence the model's performance.

Thank you for your time and assistance. I look forward to your response!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Inquiry About the Data Source for the Model Training Set #2

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Inquiry About the Data Source for the Model Training Set #2

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions