A better split of PharmKG8k

In this GitHub repository an improved split of the PharmKG8k dataset can be found. In the original split found on their GitHub repository several issues exists.

Issues in the original split

These issues consists of:

Duplicates in each individual set, train, test and valid.
head, tail combinations existing in both e.g. train and test.
tail, head combinations in e.g. train existing as head, tail in e.g. test.
The split is not transductive, meaning there are entities appearing in test and valid, which does not appear in train.

Proof of these issues can be found in proof_of_data_leakage_etc_in_PharmKG8k.py which gives this output.

If it is wanted to investigate this faulty data further, dataframes containing these can be generated by dataframes_with_more_info.py.

The better split

This new split have some different goals in mind when creating it. The first goal is that the split should be transductive. The rest of the goals can be expressed as a math equation. The denotations we use are as mentioned:

$$(h,r,t) = (head,relation,tail)$$

$$P_{tr} = train$$

$$P_{ts} = test$$

$$P_{va} = valid$$

Then the rules can be expressed as such:

$$\forall (h,r,t) \in P_{tr} : \neg(\exists (h',t') \in (P_{ts} \cup P_{va}) : \\ ((h = h' \land t = t') \lor (h = t' \land t = h')))$$

and

$$\forall (h,r,t) \in P_{ts} : \neg(\exists (h',t') \in (P_{tr} \cup P_{va}) : \\ ((h = h' \land t = t') \lor (h = t' \land t = h')))$$

and

$$\forall (h,r,t) \in P_{va} : \neg(\exists (h',t') \in (P_{tr} \cup P_{ts}) : \\ ((h = h' \land t = t') \lor (h = t' \land t = h')))$$

In other words, the new split is created such that for all edges in the training set there does not exist an edge in either the test set nor the valid set, where the head and the tail is equivalent to the head and the tail in the training set. Nor does there exist edges where the head and tail is equivalent to the tail and head in the training set. Vice versa for the test and valid sets.

This split can be found in improved_split and the code to generate these can be found in making_the_split_files.py. It should be noted that this code is not optimized so it may take several hours to run.

The code starts off with making a new split, only taking the edge constraints into consideration. After this is done, then some edges from validate and test is moved into train, to make the split transductive, and then edges are moved back into test and validate, which do not cause data leakage and keeps it transductive, so the split sizes are kept intact and we end up with a 80%,10%,10% split.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
improved_split		improved_split
.gitignore		.gitignore
README.md		README.md
dataframes_with_more_info.py		dataframes_with_more_info.py
investigation_of_duplicates_in_pharmkg8k.py		investigation_of_duplicates_in_pharmkg8k.py
making_the_split_files.py		making_the_split_files.py
output_proof_of_data_leakage_etc.txt		output_proof_of_data_leakage_etc.txt
proof_of_data_leakage_etc_in_PharmKG8k.py		proof_of_data_leakage_etc_in_PharmKG8k.py
split_and_entity_functions.py		split_and_entity_functions.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A better split of PharmKG8k

Issues in the original split

The better split

About

Releases

Packages

Languages

skingi20/improvement_of_PharmKG8k_split

Folders and files

Latest commit

History

Repository files navigation

A better split of PharmKG8k

Issues in the original split

The better split

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages