In this GitHub repository an improved split of the PharmKG8k dataset can be found. In the original split found on their GitHub repository several issues exists.
These issues consists of:
- Duplicates in each individual set, train, test and valid.
- head, tail combinations existing in both e.g. train and test.
- tail, head combinations in e.g. train existing as head, tail in e.g. test.
- The split is not transductive, meaning there are entities appearing in test and valid, which does not appear in train.
Proof of these issues can be found in proof_of_data_leakage_etc_in_PharmKG8k.py which gives this output.
If it is wanted to investigate this faulty data further, dataframes containing these can be generated by dataframes_with_more_info.py.
This new split have some different goals in mind when creating it. The first goal is that the split should be transductive. The rest of the goals can be expressed as a math equation. The denotations we use are as mentioned:
Then the rules can be expressed as such:
and
and
In other words, the new split is created such that for all edges in the training set there does not exist an edge in either the test set nor the valid set, where the head and the tail is equivalent to the head and the tail in the training set. Nor does there exist edges where the head and tail is equivalent to the tail and head in the training set. Vice versa for the test and valid sets.
This split can be found in improved_split and the code to generate these can be found in making_the_split_files.py. It should be noted that this code is not optimized so it may take several hours to run.
The code starts off with making a new split, only taking the edge constraints into consideration. After this is done, then some edges from validate and test is moved into train, to make the split transductive, and then edges are moved back into test and validate, which do not cause data leakage and keeps it transductive, so the split sizes are kept intact and we end up with a 80%,10%,10% split.