-
Notifications
You must be signed in to change notification settings - Fork 4
Datasets Index
-
[Open Targets gene-disease associations per datatype](
: gene_disease_associations_datatypes_with_expression.csv
- Open Targets gene-disease associations per datasource: gene_disease_associations_datasources_with_expression.csv
- Gene information: gene_info_qtq.csv
- GTEx gene expression and disease relevance: gene_disease_gtex_tissue_expression.csv.gz
- Disease tissue location (EFO to Uberon): disease_uberon_location.csv
- Open Targets similar diseases (based on targets in common) relation-shared-target.json.gz relation-shared-target.tsv.gz
- Similar targets (based on diseases in common)
Filename: gene_disease_associations_datatypes_with_expression.csv
Number of rows x columns: 2,304,670 rows x 20 columns
This file is a dump of the Open Targets database summarising all the gene-disease associations per type of evidence (genetics, somatic mutations, transcriptomic studies, clinical trials, affected pathways, disease relevant animal model, text-mining).
For convenience, we have added the GTEx specific expression when known.
Column name | Description |
---|---|
target_indication | target-disease pair |
entrez_id | Entrez gene identifier |
ensembl_gene_id | Ensembl gene identifier |
symbol | gene symbol |
disease_id | disease identifier |
disease_label | disease/GWAS trait/phenotype name |
therapeutic_area | therapeutic area for this disease/trait/phenotype, e.g., metabolic disease; genetic disorder |
is_direct | Is the association drawn from a direct evidence or propagated based on the disease classification? |
overall_score | overall score of the association (aggregate the others) |
genetic_association | genetic association score |
somatic_mutation | somatic mutation score (from all somatic datasources) |
known_drug | Clinical trial score based on ChEMBL evidence |
rna_expression | mRNA differential expression score |
affected_pathway | Affected pathways score (combines Reactome and SlapEnrich) |
animal_model | Animal model score based on Phenodigm |
literature | Europe PMC score |
tissue_label | Relevant tissue name when known |
source | GTEx v6 |
max_fold_change | gene expression fold change (if mRNA expression in the indicated tissue for this gene is at least 5-fold above the median tissue and within 5-fold of the highest expression tissue) |
expression_score | normalised gene expression score for max_fold_change |
head -10 gene_disease_associations_datatypes_with_expression.csv
target_indication,entrez_id,ensembl_gene_id,symbol,disease_id,disease_label,therapeutic_area,is_direct,overall_score,genetic_association,somatic_mutation,known_drug,rna_expression,affected_pathway,animal_model,literature,tissue_label,source,max_fold_change,expression_score
ENSG00000167113-Orphanet_183616,51117.0,ENSG00000167113,COQ4,Orphanet_183616,Genetic neuro-ophthalmological disease,eye disease; genetic disorder,False,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,Unspecified,Unspecified,0.0,0.0
ENSG00000173085-EFO_0000249,27235.0,ENSG00000173085,COQ2,EFO_0000249,Alzheimers disease,nervous system disease,True,0.014184,0.0,0.0,0.0,0.0,0.0,0.0,0.014184,Unspecified,Unspecified,0.0,0.0
ENSG00000198612-EFO_0004512,10920.0,ENSG00000198612,COPS8,EFO_0004512,bone measurement,measurement,False,0.0241172844,0.0241172844,0.0,0.0,0.0,0.0,0.0,0.0,Unspecified,Unspecified,0.0,0.0
ENSG00000181789-Orphanet_101987,22820.0,ENSG00000181789,COPG1,Orphanet_101987,Constitutional neutropenia,immune system disease; genetic disorder,False,0.14474,0.0,0.0,0.0,0.0,0.0,0.14474,0.0,Unspecified,Unspecified,0.0,0.0
ENSG00000119723-Orphanet_50,51004.0,ENSG00000119723,COQ6,Orphanet_50,Aicardi syndrome,eye disease; genetic disorder,True,0.1155,0.0,0.0,0.0,0.0,0.0,0.1155,0.0,Unspecified,Unspecified,0.0,0.0
ENSG00000138663-Orphanet_71862,51138.0,ENSG00000138663,COPS4,Orphanet_71862,Retinal dystrophy,eye disease; genetic disorder,False,0.132368888889,0.0,0.0,0.0,0.0,0.0,0.132368888889,0.0,Unspecified,Unspecified,0.0,0.0
ENSG00000121022-HP_0000118,10987.0,ENSG00000121022,COPS5,HP_0000118,Phenotypic abnormality,phenotype,False,0.0478056991922,0.0,0.0,0.0,0.0,0.0,0.0,0.0478056991922,Unspecified,Unspecified,0.0,0.0
ENSG00000181789-EFO_0000400,22820.0,ENSG00000181789,COPG1,EFO_0000400,diabetes mellitus,metabolic disease,False,0.08116,0.0,0.0,0.0,0.0,0.0,0.08116,0.0,Unspecified,Unspecified,0.0,0.0
ENSG00000138663-Orphanet_217607,51138.0,ENSG00000138663,COPS4,Orphanet_217607,Familial dilated cardiomyopathy,genetic disorder; cardiovascular disease,False,0.09196,0.0,0.0,0.0,0.0,0.0,0.09196,0.0,Unspecified,Unspecified,0.0,0.0
Filename: gene_disease_associations_datasources_with_expression.csv
Number of rows x columns: 2,304,670 rows x 30 columns
Column name | Description |
---|---|
target_indication | target-disease pair |
entrez_id | Entrez gene identifier |
ensembl_gene_id | Ensembl gene identifier |
symbol | gene symbol |
disease_id | disease identifier |
disease_label | disease/GWAS trait/phenotype |
therapeutic_area | therapeutic area for this disease/trait/phenotype, e.g., metabolic disease; genetic disorder |
is_direct | Is the association drawn from a direct evidence or propagated based on the disease classification? |
overall_score | overall score of the association (aggregate the others) |
expression_atlas | Expression Atlas association score |
uniprot | UniProt genetic score |
gwas_catalog | GWAS Catalog genetic score |
phewas_catalog | PheWAS Catalog genetic score |
eva | EVA (ClinVar) genetic score |
uniprot_literature | UniProt literature curated genetic score |
genomics_england | Genomics England PanelApp genetic score |
gene2phenotype | Gene2Phenotype genetic score |
reactome | Reactome affected pathways score |
slapenrich | SlapEnrich cancer affected pathways score |
phenodigm | Phenodigm (Animal model) score |
cancer_gene_census | Cancer Gene Census score |
eva_somatic | EVA (ClinVar) somatic mutations score |
uniprot_somatic | UniProt somatic mutations score |
intogen | InToGEN cancer driver gene score |
chembl | ChEMBL clinical trial score |
europepmc | EuroPMC literature score |
tissue_label | Relevant tissue name when known |
source | GTEx v6 |
max_fold_change | gene expression fold change (if mRNA expression in the indicated tissue for this gene is at least 5-fold above the median tissue and within 5-fold of the highest expression tissue) |
expression_score | normalised gene expression score for max_fold_change |
head -10 gene_disease_associations_datasources_with_expression.csv
target_indication,entrez_id,ensembl_gene_id,symbol,disease_id,disease_label,therapeutic_area,is_direct,overall_score,expression_atlas,uniprot,gwas_catalog,phewas_catalog,eva,uniprot_literature,genomics_england,gene2phenotype,reactome,slapenrich,phenodigm,cancer_gene_census,eva_somatic,uniprot_somatic,intogen,chembl,europepmc,tissue_label,source,max_fold_change,expression_score
ENSG00000167113-Orphanet_183616,51117.0,ENSG00000167113,COQ4,Orphanet_183616,Genetic neuro-ophthalmological disease,eye disease; genetic disorder,False,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Unspecified,Unspecified,0.0,0.0
ENSG00000173085-EFO_0000249,27235.0,ENSG00000173085,COQ2,EFO_0000249,Alzheimers disease,nervous system disease,True,0.014184,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.014184,Unspecified,Unspecified,0.0,0.0
ENSG00000198612-EFO_0004512,10920.0,ENSG00000198612,COPS8,EFO_0004512,bone measurement,measurement,False,0.0241172844,0.0,0.0,0.0241172844,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Unspecified,Unspecified,0.0,0.0
ENSG00000181789-Orphanet_101987,22820.0,ENSG00000181789,COPG1,Orphanet_101987,Constitutional neutropenia,immune system disease; genetic disorder,False,0.14474,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.14474,0.0,0.0,0.0,0.0,0.0,0.0,Unspecified,Unspecified,0.0,0.0
ENSG00000119723-Orphanet_50,51004.0,ENSG00000119723,COQ6,Orphanet_50,Aicardi syndrome,eye disease; genetic disorder,True,0.1155,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1155,0.0,0.0,0.0,0.0,0.0,0.0,Unspecified,Unspecified,0.0,0.0
ENSG00000138663-Orphanet_71862,51138.0,ENSG00000138663,COPS4,Orphanet_71862,Retinal dystrophy,eye disease; genetic disorder,False,0.132368888889,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.132368888889,0.0,0.0,0.0,0.0,0.0,0.0,Unspecified,Unspecified,0.0,0.0
ENSG00000121022-HP_0000118,10987.0,ENSG00000121022,COPS5,HP_0000118,Phenotypic abnormality,phenotype,False,0.0478056991922,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0478056991922,Unspecified,Unspecified,0.0,0.0
ENSG00000181789-EFO_0000400,22820.0,ENSG00000181789,COPG1,EFO_0000400,diabetes mellitus,metabolic disease,False,0.08116,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.08116,0.0,0.0,0.0,0.0,0.0,0.0,Unspecified,Unspecified,0.0,0.0
ENSG00000138663-Orphanet_217607,51138.0,ENSG00000138663,COPS4,Orphanet_217607,Familial dilated cardiomyopathy,genetic disorder; cardiovascular disease,False,0.09196,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.09196,0.0,0.0,0.0,0.0,0.0,0.0,Unspecified,Unspecified,0.0,0.0
Filename: gene_info_qtq.csv
All the gene features you might need to build your prediction model are stored in this file. It contains the symbol, identifiers (gene locus and protein), type (e.g. protein-coding or not), cellular location, protein class) of the genes as well as their Gene Ontology annotations.
Column name | Description |
---|---|
symbol | gene symbol |
hgnc_id | HGNC official identifier |
entrez_id | Entrez gene identifier |
ensembl_gene_id | ENSEMBL gene identifier |
uniprot_id | UniProt protein identifier (note that there can be several proteins for one gene) |
locus_type | type of this genomic locus |
locus_group | group classification for this genomic locus |
go_id | Gene Ontology (GO) term identifier |
go_label | Gene Ontology (GO) term name |
evidence_type | evidence type according to GO (please refer to ) |
reported_count | how many times this type of evidence has been reported (useful for replicability) |
protein_class | ChEMBL druggable genome classification of the protein |
target_class | target class |
topology_type | topology information |
target_location | Cellular location |
ExAC_LoF | Resilient to Loss of Function according to ExAC |
pc_mouse_gene_identity | mouse ortholog |
GTEX_median_all_tissues | median expression across all GTEx tissues |
description | gene description |
Filename: gene_disease_gtex_tissue_expression.csv.gz
This compressed file contains the relation between 2 important pieces of information to build the prediction model:
- A) The relevant tissue for a disease from a systematic mining of the scientific literature (see this scientific report by Vinod Kumar and colleagues at GSK)
- B) The genes specifically expressed in the disease-affected tissue
Hence, it's possible to combine the tissue and expression in your model to assess if successful drug targets are also expressed at the protein-level.
Column name | Description |
---|---|
entrez_id | Entrez gene identifier |
ensembl_gene_id | ENSEMBL gene identifier |
symbol | gene symbol |
disease_id | disease identifier |
disease_label | disease name |
tissue_label | tissue name as described in GTEx |
source | GTEx version 6 |
max_fold_change | gene expression fold change (if mRNA expression in the indicated tissue for this gene is at least 5-fold above the median tissue and within 5-fold of the highest expression tissue) |
expression_score | normalised gene expression score for max_fold_change |
In the example below, the gene MUC7 is specifically expressed in the Salivary Gland.
gunzip -c gene_disease_gtex_tissue_expression.csv.gz | head -5
0,entrez_id,ensembl_gene_id,symbol,disease_id,disease_label,tissue_label,source,max_fold_change,expression_score
0,4589,ENSG00000171195,MUC7,EFO_0007383,Mumps virus infectious disease,Minor Salivary Gland,GTExv6,57385.21,0.99
1,4589,ENSG00000171195,MUC7,EFO_1000384,Mixed Tumor of the Salivary Gland,Minor Salivary Gland,GTExv6,57385.21,0.99
2,4589,ENSG00000171195,MUC7,EFO_0003826,salivary gland neoplasm,Minor Salivary Gland,GTExv6,57385.21,0.99
3,4589,ENSG00000171195,MUC7,EFO_1000344,Major Salivary Gland Carcinoma,Minor Salivary Gland,GTExv6,57385.21,0.99
We can double-check that on the Open Targets portal:
Filename: disease_uberon_location.csv
Open Targets integrates disease, phenotype and anatomical information using controlled vocabularies organised in ontologies. The Experimental Factor Ontology (EFO) developed and maintained at EMBL-EBI is used extensively in Open Targets to annotate the evidence linking a gene to a disease, GWAS trait or phenotype. EFO contains information about the relevant tissue for a disease too. In this case, tissues are annotated using the Uberon cross-species ontology which covers anatomical structures. The following file is an index of disease and the 'location' of the disease in a tissue.
Column name | Description |
---|---|
disease_id | disease identifier |
disease_location_id | UBERON location identifier |
disease_location_label | UBERON location corresponding name |
For instance, the following rows:
41,EFO_0003760,UBERON_0000955,brain
42,EFO_1000310,UBERON_0000160,intestine
corresponds to
- central nervous system cyst (EFO_0003760) located in the brain
- Juvenile Polyp (EFO_1000310) located in the intestine
You can use the anatomical location feature in your drug repurposing prediction method.
Filename: json format relation-shared-target.json.gz
Number of records: 557,122
Filename: tsv format relation-shared-target.tsv.gz
Number of rows x columns: 557,122 rows x 10 columns
Open Targets introduced the concept of similarity on diseases based on the number of shared targets.
For instance, if you look at the Asthma disease profile page, the section "similar diseases (based on targets in common)" displays similar diseases sharing the same target.
For instance, clicking on "Chronic Obstructive Pulmonary Disease" will display a list of shared targets.
The data in relation-shared-target.json.gz is a dump of this information and 2 examples of similar diseases are given below:
{
"sort": [
"relation-shared-target#DOID_0050890-EFO_0000584"
],
"_type": "relation-shared-target",
"_routing": "DOID_0050890",
"_source": {
"shared_targets": [
"ENSG00000164400"
],
"object": {
"id": "EFO_0000584",
"links": {
"targets_count": 18
},
"label": "infectious meningitis"
},
"scores": {
"overlap": 0.22265818869331525
},
"counts": {
"shared_count": 1,
"union_count": 20
},
"id": "DOID_0050890-EFO_0000584",
"subject": {
"id": "DOID_0050890",
"links": {
"targets_count": 3
},
"label": "synucleinopathy"
}
},
"_score": null,
"_index": "18.06_relation-data",
"_id": "DOID_0050890-EFO_0000584"
}
{
"sort": [
"relation-shared-target#DOID_0050890-EFO_0003030"
],
"_type": "relation-shared-target",
"_routing": "DOID_0050890",
"_source": {
"shared_targets": [
"ENSG00000164400"
],
"object": {
"id": "EFO_0003030",
"links": {
"targets_count": 7
},
"label": "abscess"
},
"scores": {
"overlap": 0.33106914536924065
},
"counts": {
"shared_count": 1,
"union_count": 9
},
"id": "DOID_0050890-EFO_0003030",
"subject": {
"id": "DOID_0050890",
"links": {
"targets_count": 3
},
"label": "synucleinopathy"
}
},
"_score": null,
"_index": "18.06_relation-data",
"_id": "DOID_0050890-EFO_0003030"
}
Files in json format:
- relation-shared-disease-0.json.gz
- relation-shared-disease-1.json.gz
- relation-shared-disease-2.json.gz
- relation-shared-disease-3.json.gz
- relation-shared-disease-4.json.gz
- relation-shared-disease-5.json.gz
- relation-shared-disease-6.json.gz
- relation-shared-disease-7.json.gz
Files in tsv format: