Skip to content

Database Architecture

Aneil Mallavarapu edited this page Mar 22, 2018 · 10 revisions

from https://docs.google.com/document/d/1E31Oted7_QN7bCbjnJN6k1X6eP9b-rFcI-uV8vjsxbg/edit#

Data Model

Service: User

Handles user authentication, authorization and limited profile information

Model: User

Fields:

id - 	unique user id (used internally by precisely to identify the user) 
- aka the cognito user id
email
password
first name
last name
Database: Cognito

Service: UserDataMapper

The data mapper provides a way of mapping a cognito id to another user id associated with a specific type of health data for that user. This separates user identifying information from health information, and separates different types of health information from each other, providing a higher level of patient confidentiality. Note that different databases are used to store user identity (cognito), health data (dynamo) and the map (postgres). This way, compromise of any one of these systems does not result in compromise of personally-identifying user health data.

Model: UserDataMap

Fields:

user_id = cognito id
vendor_data_type_id = int id in VendorDatatype
data_type_user_id = string, the opaque id for that data type

Model: VendorDatatype

Fields:

id - int
vendor = string
data_type = string

E.g., Precisely genetic variant data user_id “1234” vendor_name “precisely” data_type “genetics” data_type_user_id 4178e78b-e66b-4020-b650-7a778b5e6179

E.g., Precisely survey data

user_id “1234” 
vendor_name “precisely”
data_type “survey”
data_type_user_id 5210fa35-decb-42eb-b52b-35441352ea05

E.g., Helix raw genetics data

user_id “1234”
vendor_name “helix”
data_type “mecfs_panel”
data_type_user_id PC-2SBCATWI45UENF5ZJ5WCFF3FTEUL2GCZ (the pac_id)

E.g., Helix support id

user_id “1234”
vendor_name ‘helix”
data_type “support_id”
data_type_user_id “US-2SB8-85CEC”  (the helix support id)

Service: Report

Model: Report

Fields:

title
description
rawContent 
parsedContent
genes
createdAt
updatedAt

Database:

Dynamo

Service: Survey

Model: Survey

Model: SurveyResult

Fields:

survey_data_type_user_id
responseName string  ← what should this be?
result string
createdAt
updatedAt

Database:

Dynamo

Service: Genotype

Model: Genotype

Fields:

opaque_id   string     // from user data mapper
source_entry_id string  // range key designed to be unique for this entry e.t., source:sample_id:chromosome:startBase
                        // 23andMe:5891b5b522d5df086d0ff0b110fbd9d21bb4fc7163af34d08286a2e846f6be03:1:2933726

sample_id  string      // VCF header attribute 23andMeSampleID
source string .        // If VCF headers include attributes with 23andMe prefixes, then '23andMe'
gene  string // In a data row in the INFO column, use the GENE attribute. Optional.
variantCall string // In a data row in the INFO column, use the HGVS attribute. Needs light transforming (see below).
zygosity string // In a data row in the INFO column, use the GS attribute.
startBase // In a data row, the POS column.
chromosomeName string // In a data row, the CHROM column
variantType // If 23andMe data, all variants are of type 'SNV'
quality 	string? // In a data row, the FILTER column.
createdAt   // VCF header attribute 23andMeProcessDate; the 23andMe processing time
updatedAt   // VCF header attribute bcftools_annotateCommand, sub-attribute Date
            // This is the time at which the Precise.ly VCF file was generated.

Note about variantCall string:

VCF format doesn't allow the use of the equals sign = nor the semi-colon ; in the INFO column, as those characters are used in the INFO attribute syntax. Unfortunately those are both used in HGVS notation. So the work-around is that before storing in the VCF file, all occurrences of = are replaced with ~, and all occurrences of ; are replaced with ///. To extract a valid HGVS notation for the locus's genotype, the string substitution needs to be reversed.

Database:

Dynamo

Raw input will be stored in S3 folders: /dev-precisely-genetics-raw-23andme /dev-precisely-genetics-raw-akesogen-cel-files

Processed output will bein /dev-precisely-genetics-vcf

The PreciselyGeneticsPipelineAccess IAM policy provides access to all dev-precisely-genetics-* folders, and this policy is attaached to the dev-genetics-pipeline.

Clone this wiki locally