Skip to content
This repository was archived by the owner on May 5, 2021. It is now read-only.

Datasets Index

Gautier Koscielny edited this page Jun 29, 2018 · 9 revisions

Work In Progress: Some of the information might be incorrect.

Index of datasets

: gene_disease_associations_datatypes_with_expression.csv

Open Targets gene-disease associations per datatype

Filename: gene_disease_associations_datatypes_with_expression.csv

Number of rows x columns: 2,304,670 rows x 20 columns

This file is a dump of the Open Targets database summarising all the gene-disease associations per type of evidence (genetics, somatic mutations, transcriptomic studies, clinical trials, affected pathways, disease relevant animal model, text-mining).

For convenience, we have added the GTEx specific expression when known.

Column name Description
target_indication target-disease pair
entrez_id Entrez gene identifier
ensembl_gene_id Ensembl gene identifier
symbol gene symbol
disease_id disease identifier
disease_label disease/GWAS trait/phenotype name
therapeutic_area therapeutic area for this disease/trait/phenotype, e.g., metabolic disease; genetic disorder
is_direct Is the association drawn from a direct evidence or propagated based on the disease classification?
overall_score overall score of the association (aggregate the others)
genetic_association genetic association score
somatic_mutation somatic mutation score (from all somatic datasources)
known_drug Clinical trial score based on ChEMBL evidence
rna_expression mRNA differential expression score
affected_pathway Affected pathways score (combines Reactome and SlapEnrich)
animal_model Animal model score based on Phenodigm
literature Europe PMC score
tissue_label Relevant tissue name when known
source GTEx v6
max_fold_change gene expression fold change (if mRNA expression in the indicated tissue for this gene is at least 5-fold above the median tissue and within 5-fold of the highest expression tissue)
expression_score normalised gene expression score for max_fold_change
head -10 gene_disease_associations_datatypes_with_expression.csv
target_indication,entrez_id,ensembl_gene_id,symbol,disease_id,disease_label,therapeutic_area,is_direct,overall_score,genetic_association,somatic_mutation,known_drug,rna_expression,affected_pathway,animal_model,literature,tissue_label,source,max_fold_change,expression_score
ENSG00000167113-Orphanet_183616,51117.0,ENSG00000167113,COQ4,Orphanet_183616,Genetic neuro-ophthalmological disease,eye disease; genetic disorder,False,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,Unspecified,Unspecified,0.0,0.0
ENSG00000173085-EFO_0000249,27235.0,ENSG00000173085,COQ2,EFO_0000249,Alzheimers disease,nervous system disease,True,0.014184,0.0,0.0,0.0,0.0,0.0,0.0,0.014184,Unspecified,Unspecified,0.0,0.0
ENSG00000198612-EFO_0004512,10920.0,ENSG00000198612,COPS8,EFO_0004512,bone measurement,measurement,False,0.0241172844,0.0241172844,0.0,0.0,0.0,0.0,0.0,0.0,Unspecified,Unspecified,0.0,0.0
ENSG00000181789-Orphanet_101987,22820.0,ENSG00000181789,COPG1,Orphanet_101987,Constitutional neutropenia,immune system disease; genetic disorder,False,0.14474,0.0,0.0,0.0,0.0,0.0,0.14474,0.0,Unspecified,Unspecified,0.0,0.0
ENSG00000119723-Orphanet_50,51004.0,ENSG00000119723,COQ6,Orphanet_50,Aicardi syndrome,eye disease; genetic disorder,True,0.1155,0.0,0.0,0.0,0.0,0.0,0.1155,0.0,Unspecified,Unspecified,0.0,0.0
ENSG00000138663-Orphanet_71862,51138.0,ENSG00000138663,COPS4,Orphanet_71862,Retinal dystrophy,eye disease; genetic disorder,False,0.132368888889,0.0,0.0,0.0,0.0,0.0,0.132368888889,0.0,Unspecified,Unspecified,0.0,0.0
ENSG00000121022-HP_0000118,10987.0,ENSG00000121022,COPS5,HP_0000118,Phenotypic abnormality,phenotype,False,0.0478056991922,0.0,0.0,0.0,0.0,0.0,0.0,0.0478056991922,Unspecified,Unspecified,0.0,0.0
ENSG00000181789-EFO_0000400,22820.0,ENSG00000181789,COPG1,EFO_0000400,diabetes mellitus,metabolic disease,False,0.08116,0.0,0.0,0.0,0.0,0.0,0.08116,0.0,Unspecified,Unspecified,0.0,0.0
ENSG00000138663-Orphanet_217607,51138.0,ENSG00000138663,COPS4,Orphanet_217607,Familial dilated cardiomyopathy,genetic disorder; cardiovascular disease,False,0.09196,0.0,0.0,0.0,0.0,0.0,0.09196,0.0,Unspecified,Unspecified,0.0,0.0

Gene disease associations per datasource

Filename: gene_disease_associations_datasources_with_expression.csv

Number of rows x columns: 2,304,670 rows x 30 columns

Column name Description
target_indication target-disease pair
entrez_id Entrez gene identifier
ensembl_gene_id Ensembl gene identifier
symbol gene symbol
disease_id disease identifier
disease_label disease/GWAS trait/phenotype
therapeutic_area therapeutic area for this disease/trait/phenotype, e.g., metabolic disease; genetic disorder
is_direct Is the association drawn from a direct evidence or propagated based on the disease classification?
overall_score overall score of the association (aggregate the others)
expression_atlas Expression Atlas association score
uniprot UniProt genetic score
gwas_catalog GWAS Catalog genetic score
phewas_catalog PheWAS Catalog genetic score
eva EVA (ClinVar) genetic score
uniprot_literature UniProt literature curated genetic score
genomics_england Genomics England PanelApp genetic score
gene2phenotype Gene2Phenotype genetic score
reactome Reactome affected pathways score
slapenrich SlapEnrich cancer affected pathways score
phenodigm Phenodigm (Animal model) score
cancer_gene_census Cancer Gene Census score
eva_somatic EVA (ClinVar) somatic mutations score
uniprot_somatic UniProt somatic mutations score
intogen InToGEN cancer driver gene score
chembl ChEMBL clinical trial score
europepmc EuroPMC literature score
tissue_label Relevant tissue name when known
source GTEx v6
max_fold_change gene expression fold change (if mRNA expression in the indicated tissue for this gene is at least 5-fold above the median tissue and within 5-fold of the highest expression tissue)
expression_score normalised gene expression score for max_fold_change
head -10 gene_disease_associations_datasources_with_expression.csv
target_indication,entrez_id,ensembl_gene_id,symbol,disease_id,disease_label,therapeutic_area,is_direct,overall_score,expression_atlas,uniprot,gwas_catalog,phewas_catalog,eva,uniprot_literature,genomics_england,gene2phenotype,reactome,slapenrich,phenodigm,cancer_gene_census,eva_somatic,uniprot_somatic,intogen,chembl,europepmc,tissue_label,source,max_fold_change,expression_score
ENSG00000167113-Orphanet_183616,51117.0,ENSG00000167113,COQ4,Orphanet_183616,Genetic neuro-ophthalmological disease,eye disease; genetic disorder,False,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Unspecified,Unspecified,0.0,0.0
ENSG00000173085-EFO_0000249,27235.0,ENSG00000173085,COQ2,EFO_0000249,Alzheimers disease,nervous system disease,True,0.014184,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.014184,Unspecified,Unspecified,0.0,0.0
ENSG00000198612-EFO_0004512,10920.0,ENSG00000198612,COPS8,EFO_0004512,bone measurement,measurement,False,0.0241172844,0.0,0.0,0.0241172844,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Unspecified,Unspecified,0.0,0.0
ENSG00000181789-Orphanet_101987,22820.0,ENSG00000181789,COPG1,Orphanet_101987,Constitutional neutropenia,immune system disease; genetic disorder,False,0.14474,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.14474,0.0,0.0,0.0,0.0,0.0,0.0,Unspecified,Unspecified,0.0,0.0
ENSG00000119723-Orphanet_50,51004.0,ENSG00000119723,COQ6,Orphanet_50,Aicardi syndrome,eye disease; genetic disorder,True,0.1155,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1155,0.0,0.0,0.0,0.0,0.0,0.0,Unspecified,Unspecified,0.0,0.0
ENSG00000138663-Orphanet_71862,51138.0,ENSG00000138663,COPS4,Orphanet_71862,Retinal dystrophy,eye disease; genetic disorder,False,0.132368888889,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.132368888889,0.0,0.0,0.0,0.0,0.0,0.0,Unspecified,Unspecified,0.0,0.0
ENSG00000121022-HP_0000118,10987.0,ENSG00000121022,COPS5,HP_0000118,Phenotypic abnormality,phenotype,False,0.0478056991922,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0478056991922,Unspecified,Unspecified,0.0,0.0
ENSG00000181789-EFO_0000400,22820.0,ENSG00000181789,COPG1,EFO_0000400,diabetes mellitus,metabolic disease,False,0.08116,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.08116,0.0,0.0,0.0,0.0,0.0,0.0,Unspecified,Unspecified,0.0,0.0
ENSG00000138663-Orphanet_217607,51138.0,ENSG00000138663,COPS4,Orphanet_217607,Familial dilated cardiomyopathy,genetic disorder; cardiovascular disease,False,0.09196,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.09196,0.0,0.0,0.0,0.0,0.0,0.0,Unspecified,Unspecified,0.0,0.0

Supplementary files for prediction methods

Gene information

Filename: gene_info_qtq.csv

All the gene features you might need to build your prediction model are stored in this file. It contains the symbol, identifiers (gene locus and protein), type (e.g. protein-coding or not), cellular location, protein class) of the genes as well as their Gene Ontology annotations.

Column name Description
symbol gene symbol
hgnc_id HGNC official identifier
entrez_id Entrez gene identifier
ensembl_gene_id ENSEMBL gene identifier
uniprot_id UniProt protein identifier (note that there can be several proteins for one gene)
locus_type type of this genomic locus
locus_group group classification for this genomic locus
go_id Gene Ontology (GO) term identifier
go_label Gene Ontology (GO) term name
evidence_type evidence type according to GO (please refer to )
reported_count how many times this type of evidence has been reported (useful for replicability)
protein_class ChEMBL druggable genome classification of the protein
target_class target class
topology_type topology information
target_location Cellular location
ExAC_LoF Resilient to Loss of Function according to ExAC
pc_mouse_gene_identity mouse ortholog
GTEX_median_all_tissues median expression across all GTEx tissues
description gene description

GTEx gene expression and disease relevance

Filename: gene_disease_gtex_tissue_expression.csv.gz

This compressed file contains the relation between 2 important pieces of information to build the prediction model:

Hence, it's possible to combine the tissue and expression in your model to assess if successful drug targets are also expressed at the protein-level.

Column name Description
entrez_id Entrez gene identifier
ensembl_gene_id ENSEMBL gene identifier
symbol gene symbol
disease_id disease identifier
disease_label disease name
tissue_label tissue name as described in GTEx
source GTEx version 6
max_fold_change gene expression fold change (if mRNA expression in the indicated tissue for this gene is at least 5-fold above the median tissue and within 5-fold of the highest expression tissue)
expression_score normalised gene expression score for max_fold_change

In the example below, the gene MUC7 is specifically expressed in the Salivary Gland.

gunzip -c gene_disease_gtex_tissue_expression.csv.gz | head -5
0,entrez_id,ensembl_gene_id,symbol,disease_id,disease_label,tissue_label,source,max_fold_change,expression_score
0,4589,ENSG00000171195,MUC7,EFO_0007383,Mumps virus infectious disease,Minor Salivary Gland,GTExv6,57385.21,0.99
1,4589,ENSG00000171195,MUC7,EFO_1000384,Mixed Tumor of the Salivary Gland,Minor Salivary Gland,GTExv6,57385.21,0.99
2,4589,ENSG00000171195,MUC7,EFO_0003826,salivary gland neoplasm,Minor Salivary Gland,GTExv6,57385.21,0.99
3,4589,ENSG00000171195,MUC7,EFO_1000344,Major Salivary Gland Carcinoma,Minor Salivary Gland,GTExv6,57385.21,0.99

We can double-check that on the Open Targets portal:

MUC7 RNA expression

Disease tissue location (EFO to Uberon)

Filename: disease_uberon_location.csv

Open Targets integrates disease, phenotype and anatomical information using controlled vocabularies organised in ontologies. The Experimental Factor Ontology (EFO) developed and maintained at EMBL-EBI is used extensively in Open Targets to annotate the evidence linking a gene to a disease, GWAS trait or phenotype. EFO contains information about the relevant tissue for a disease too. In this case, tissues are annotated using the Uberon cross-species ontology which covers anatomical structures. The following file is an index of disease and the 'location' of the disease in a tissue.

Column name Description
disease_id disease identifier
disease_location_id UBERON location identifier
disease_location_label UBERON location corresponding name

For instance, the following rows:

41,EFO_0003760,UBERON_0000955,brain
42,EFO_1000310,UBERON_0000160,intestine

corresponds to

  • central nervous system cyst (EFO_0003760) located in the brain
  • Juvenile Polyp (EFO_1000310) located in the intestine

You can use the anatomical location feature in your drug repurposing prediction method.

Open Targets similar diseases (based on targets in common)

Filename: json format relation-shared-target.json.gz

Number of records: 557,122

Filename: tsv format relation-shared-target.tsv.gz

Number of rows x columns: 557,122 rows x 10 columns

Open Targets introduced the concept of similarity on diseases based on the number of shared targets. For instance, if you look at the Asthma disease profile page, the section "similar diseases (based on targets in common)" displays similar diseases sharing the same target. Similar diseases for Asthma

For instance, clicking on "Chronic Obstructive Pulmonary Disease" will display a list of shared targets.

Shared targets between Asthma and Chronic Obstructive Pulmonary Disease

The data in relation-shared-target.json.gz is a dump of this information and 2 examples of similar diseases are given below:

{
  "sort": [
    "relation-shared-target#DOID_0050890-EFO_0000584"
  ],
  "_type": "relation-shared-target",
  "_routing": "DOID_0050890",
  "_source": {
    "shared_targets": [
      "ENSG00000164400"
    ],
    "object": {
      "id": "EFO_0000584",
      "links": {
        "targets_count": 18
      },
      "label": "infectious meningitis"
    },
    "scores": {
      "overlap": 0.22265818869331525
    },
    "counts": {
      "shared_count": 1,
      "union_count": 20
    },
    "id": "DOID_0050890-EFO_0000584",
    "subject": {
      "id": "DOID_0050890",
      "links": {
        "targets_count": 3
      },
      "label": "synucleinopathy"
    }
  },
  "_score": null,
  "_index": "18.06_relation-data",
  "_id": "DOID_0050890-EFO_0000584"
}
{
  "sort": [
    "relation-shared-target#DOID_0050890-EFO_0003030"
  ],
  "_type": "relation-shared-target",
  "_routing": "DOID_0050890",
  "_source": {
    "shared_targets": [
      "ENSG00000164400"
    ],
    "object": {
      "id": "EFO_0003030",
      "links": {
        "targets_count": 7
      },
      "label": "abscess"
    },
    "scores": {
      "overlap": 0.33106914536924065
    },
    "counts": {
      "shared_count": 1,
      "union_count": 9
    },
    "id": "DOID_0050890-EFO_0003030",
    "subject": {
      "id": "DOID_0050890",
      "links": {
        "targets_count": 3
      },
      "label": "synucleinopathy"
    }
  },
  "_score": null,
  "_index": "18.06_relation-data",
  "_id": "DOID_0050890-EFO_0003030"
}

Similar targets (based on diseases in common)

Files in json format:

Files in tsv format:

Clone this wiki locally