-
Notifications
You must be signed in to change notification settings - Fork 0
Database Architecture
from https://docs.google.com/document/d/1E31Oted7_QN7bCbjnJN6k1X6eP9b-rFcI-uV8vjsxbg/edit#
Handles user authentication, authorization and limited profile information
id - unique user id (used internally by precisely to identify the user)
- aka the cognito user id
email
password
first name
last name
The data mapper provides a way of mapping a cognito id to another user id associated with a specific type of health data for that user. This separates user identifying information from health information, and separates different types of health information from each other, providing a higher level of patient confidentiality. Note that different databases are used to store user identity (cognito), health data (dynamo) and the map (postgres). This way, compromise of any one of these systems does not result in compromise of personally-identifying user health data.
user_id = cognito id
vendor_data_type_id = int id in VendorDatatype
data_type_user_id = string, the opaque id for that data type
id - int
vendor = string
data_type = string
E.g., Precisely genetic variant data user_id “1234” vendor_name “precisely” data_type “genetics” data_type_user_id 4178e78b-e66b-4020-b650-7a778b5e6179
E.g., Precisely survey data
user_id “1234”
vendor_name “precisely”
data_type “survey”
data_type_user_id 5210fa35-decb-42eb-b52b-35441352ea05
E.g., Helix raw genetics data
user_id “1234”
vendor_name “helix”
data_type “mecfs_panel”
data_type_user_id PC-2SBCATWI45UENF5ZJ5WCFF3FTEUL2GCZ (the pac_id)
E.g., Helix support id
user_id “1234”
vendor_name ‘helix”
data_type “support_id”
data_type_user_id “US-2SB8-85CEC” (the helix support id)
title
description
rawContent
parsedContent
genes
createdAt
updatedAt
Dynamo
survey_data_type_user_id
responseName string ← what should this be?
result string
createdAt
updatedAt
Dynamo
opaque_id string // from user data mapper
source_entry_id string // range key designed to be unique for this entry e.t., source:sample_id:chromosome:startBase
// 23andMe:5891b5b522d5df086d0ff0b110fbd9d21bb4fc7163af34d08286a2e846f6be03:1:2933726
sample_id string // VCF header attribute 23andMeSampleID
source string . // If VCF headers include attributes with 23andMe prefixes, then '23andMe'
gene string // In a data row in the INFO column, use the GENE attribute. Optional.
variantCall string // In a data row in the INFO column, use the HGVS attribute. Needs light transforming (see below).
zygosity string // In a data row in the INFO column, use the GS attribute.
startBase // In a data row, the POS column.
chromosomeName string // In a data row, the CHROM column
variantType // If 23andMe data, all variants are of type 'SNV'
quality string? // In a data row, the FILTER column.
createdAt // VCF header attribute 23andMeProcessDate; the 23andMe processing time
updatedAt // VCF header attribute bcftools_annotateCommand, sub-attribute Date
// This is the time at which the Precise.ly VCF file was generated.
Note about variantCall string:
VCF format doesn't allow the use of the equals sign =
nor the
semi-colon ;
in the INFO
column, as those characters are used in
the INFO attribute syntax. Unfortunately those are both used in HGVS
notation. So the work-around is that before storing in the VCF file,
all occurrences of =
are replaced with ~
, and all occurrences of
;
are replaced with ///
. To extract a valid HGVS notation for the
locus's genotype, the string substitution needs to be reversed.
Dynamo
Raw input will be stored in S3 folders:
/dev-precisely-genetics-raw-23andme
/dev-precisely-genetics-raw-akesogen-cel-files
Processed output will bein
/dev-precisely-genetics-vcf
The PreciselyGeneticsPipelineAccess
IAM policy provides access to all dev-precisely-genetics-*
folders, and this policy is attaached to the dev-genetics-pipeline
.