-
Notifications
You must be signed in to change notification settings - Fork 1
DataLayout
The data/ folder collects all gene and interaction data input files, its top level organization is:
data/
├── attributes
├── functions
├── identifiers
├── networks
├── organism.cfg
└── metadata_fixes.txt
Organism description configuration file.
Example:
name = Saccharomyces cerevisiae
short_name = S. cerevisiae
common_name = baker's yeast
gm_organism_id = 6
ncbi_taxonomy_id = 4932
default_genes = MRE11, RAD54, RAD52, RAD10, XRS2, CDC27, APC4, APC2, APC5, APC11
The field gm_organism_id is an internal numeric identifier, and will be automatically assigned if not given. Specification of this field is allowed to maintain continuity of this identifier between builds.
Optional input file to specify and record changes to network metadata. Should not normally be needed as metadata edits can be made directly to the individual network .cfg files. However, the use case is:
- Performing bulk import of many networks with similar parameters, e.g. 100 networks from GEO all as co-expression.
- Want to reclassify a particular network as co-localization, and the import script doesn't lend itself to such customization. We can specify this in metadata_fixes.txt.
- In future we expect to overwrite these files on a subsequent import, but won't lose track of the reclassification since its contained in a separate file.
The file itself is three columns, tab delimited no header. The first column is the path relative to the data folder to the network .cfg file to be changed. The second and third columns are a variable name and variable value to be changed.
Example: change the group of a network to co-localization
networks/direct/geo/GSE123.cfg group coloc
Text files containing gene identifiers and their descriptions, organized into subfolders.
data/identifiers/
└── descriptions
└── mixed_table
└── symbols
Gene identifiers in id/symbol/source triplets, one per line. Multiple files may be provided the symbols will be aggregated.
There can be multiple records for each 'id', the 'id' is used to group them together and can be any string - it will not be used to externally identify the genes. 'symbol' and 'source' are text strings such as 'HRA1' and 'Gene Name' respectively.
No header, tab-delimited records.
Example:
1334149 Q0060 Ensembl Gene ID
1334149 854595 Entrez Gene ID
1334149 AI3 Gene Name
1334149 NP_009308 RefSeq Protein ID
1334149 NM_001184355 RefSeq mRNA ID
1334149 I-SceIII Synonym
1334149 P03877 Uniprot ID
1334149 SCE3_YEAST Uniprot ID
1334150 Q0065 Ensembl Gene ID
1334150 854596 Entrez Gene ID
1334150 AI4 Gene Name
1334150 NP_009307 RefSeq Protein ID
1334150 NM_001184356 RefSeq mRNA ID
1334150 I-SceII Synonym
1334150 P03878 Uniprot ID
1334150 SCE2_YEAST Uniprot ID
Gene descriptions in id/description pairs, one per line.
No header, tab-delimited records.
Example:
1334149 Endonuclease I-SceIII, encoded by a mobile group I intron within the mitochondrial COX1 gene [Source:SGD;Acc:S000007263]
1334150 Endonuclease I-SceII, encoded by a mobile group I intron within the mitochondrial COX1 gene; intron is normally spliced by the BI4p maturase
For compatability with identifier files from a previous system (don't use it if you can help it). This file contains multiple identifier sources, multiple identifiers per source, gene descriptions, and gene biotype information organized into one row per gene.
Tabular tab-delimited, with header row. Some fields contain multipe fields with them, delimited by semi-colons.
Example:
GMID Ensembl Gene ID Protein Coding Gene Name Ensembl Transcript ID Ensembl Protein ID Uniprot ID Entrez Gene ID RefSeq mRNA ID RefSeq Protein ID Synonyms Definition
1334136 15S_rRNA rRNA 15S_RRNA 15S_rRNA N/A N/A N/A N/A 15S_RRNA_2;14s rRNA Ribosomal RNA of the small mitochondrial ribosomal subunit; MSU1 allele suppresses ochre stop mutations in mitochondrial protein-coding genes [Source:SGD;Acc:S000007287]
1334137 21S_rRNA rRNA 21S_RRNA 21S_rRNA N/A N/A N/A N/A 21S_rRNA_4;21S_rRNA_3 Mitochondrial 21S rRNA; intron encodes the I-SceI DNA endonuclease [Source:SGD;Acc:S000007288]
1334138 HRA1 ncRNA HRA1 HRA1 N/A N/A N/A N/A N/A Non-protein-coding RNA, substrate of RNase P, possibly involved in rRNA processing, specifically maturation of 20S precursor into the mature 18S rRNA [Source:SGD;Acc:S000119380]
Interaction networks are specified by data files and optional configuration files, organized by the type of processing required to convert the data into networks.
data/networks/
├── direct
│ ├── {collection2}
│ └── {collection3}
├── profile
│ └── {collection3}
└── sharedneighbour
└── {collection4}
Collections are subfolders organizing networks for ease of managment, e.g. data source, and can be any value given by the user. Note this is different from the Network Group displayed for a network in the application, collection names are for internal organization and not displayed for the user.
Direct networks are given in text files where each record contains a gene-symbol/gene-symbol/interaction-weight triplet.
EAF1 YPI1 1
EAF1 BNI4 1
EAF1 GIP2 1
Each network can have a configuration file specifying the network name, group, and providing reference and other metadata. The file should have the exact same name as the corresponding network data file, but ending in '.cfg' instead of '.txt'.
Example:
group = gi
default_selected = 1
name = ""
description = ""
pubmed_id = 21984913
source = BIOGRID
source_id = ""
Network names and descriptions are optional and will be automatically generated from publication record retrieved from pubmed when available.
Networks are computed from profile data where each record contains a gene identifier followed by a series of numeric level measurements.
Example:
YAL001C 0.629 0.209 0.141 1.001 1.492 0.102
YAL002W 0.011 0.06 0.301 0.243 -0.046 0.14
YAL003W -0.522 -0.117 0.721 0.595 -0.402 0.315
YAL004W -0.4079 0.063 0.267 0.269 -0.627 0.276
YAL005C -0.195 0.009 1.304 2.426 -0.642 0.328
YAL007C -0.633 -0.222 -0.28 0.091 -0.447 0.267
YAL008W -0.303 -0.115 0.214 1.912 -0.598 0.223
YAL009W -0.012 0.096 -0.108 0.106 0.43 0.08
YAL010C 0.159 0.098 0.536 0.212 0.076 0.169
YAL011W 0.263 0.498 0.482 0.215 -0.045 0.32
Network metadata is specified in a corresponding .cfg file as for direct networks.
Networks are computed from sparse binary profiles where each record contains a gene identifier followed by the name (or id) of a binary feature it possesses.
YBR218C IPR000089 IPR005479 IPR003379 IPR005481 IPR005482 IPR000891
YBR221C IPR005476 IPR005475
Network metadata is specified in a corresponding .cfg file as for direct networks.
Functional annotations for network combination and enrichment analysis. These are currently specified in a single tabular text file in a legacy format. Simpler formats will be supported in the future (issue #2).
The file contains 11 columns of tab delimited data, preceeded with a pair of comment rows starting with '#'.
# go db: 2014-03-08 assocdb None
# genus 'Saccharomyces' species 'cerevisiae' taxonomy id 4932
organellar small ribosomal subunit cellular_component GO:0000314 mitochondrial small ribosomal subunit cellular_component GO:0005763
15S_RRNA SGD S000007287 ISS 1
mitochondrial ribosome cellular_component GO:0005761 mitochondrial small ribosomal subunit cellular_component GO:0005763 15S_RRNA
SGD S000007287 ISS 1
mitochondrial part cellular_component GO:0044429 mitochondrial small ribosomal subunit cellular_component GO:0005763 15S_RRNA
SGD S000007287 ISS 1
The relevant columns are 1, 2, 3, and 7 containing the category name, GO branch, category id, and gene name. Transitive annotations in addition to direct must be provided, and any other desired filtering such as evidence code should already have been applied in preparing this file. Downstream filtering performed on this file will be for gene symbol and category size. Redundant annotations are allowed and will be removed.
Gene attributes are collections of binary features treated as networks by representing each feature as a clique connecting all the genes that possess that attribute.
data/attributes/
├── attrib-gene-list
│ └── {collection1}
└── gene-attrib-list
├── {collection2}
└── {collection3}
Attributes are specified by a text file containing an attribute name followed by a list of genes that possess the attribute.
DB03307 CDK2
DB02059 GAPDH SIRT3 SIRT5 EEF2
Multiple records for the same gene is allowed, so the attribute list can be flattened to into a column.
Network metadata is specified in a corresponding .cfg file as for direct networks.
A text file containing a descriptive string for each attribute, to include in user display.
DB03307 4-[(6-Amino-4-Pyrimidinyl)Amino]Benzenesulfonamide
DB02059 Adenosine-5-Diphosphoribose
Attributes are specified by a text file containing a gene symbol followed by a list of attributes the gene possessees.
ENSDARG00000000086 SSF48065 SSF49562 SSF50044 SSF50729
ENSDARG00000000102 SSF48726
ENSDARG00000000102 SSF56112
ENSDARG00000000102 SSF57440
Multiple records for the same gene is allowed, so the attribute list can be flattened to into a column.
Network metadata is specified in a corresponding .cfg file as for direct networks.
A descriptive file is specified in a corresponding .desc file as for attrib-gene-list.
[TODO:isn't there .gmt format support also?]