This code controls the export of FlyBase data to Alliance LinkML-compliant JSON
format. FlyBase and Alliance data types are represented by their own classes in
the fb_datatypes.py
and agr_datatypes.py
files, respectively.
There are various handlers dedicated to the transformation of specific
datatypes: e.g., gene_handler.py
.
Specific scripts are then used to call various handlers. A script may call on
multiple handlers for the generation of a single export file: e.g., Alliance
AGMs encompass both FB strains and genotypes. A script may also output multiple
files: e.g., AGM_data_retrieval_curation_gene.py
generates both gene and
allele-gene-association files.
The export files are then uploaded to the Alliance persistent store.
Modifications this repo should be made in new "release" branches corresponding to Alliance LinkML schema release branches: e.g., v1.3.1
.
These release branches should be used to spawn git releases/tags.
These scripts are intended to be run in Docker using GoCD pipelines.
Files should be validated locally before uploading to the Alliance.
Use the Alliance Curation API (details below) to find the latest supported LinkML version.
- Write your code to this LinkML version.
- Note that different objects may be have different version support as the persistent store develops.
The Alliance LinkML-based data schema is held in the agr_curation_schemas repo.
Get an OKTA account (through Chris Grove). You will need this to interact with the persistent store.
The Alliance Curation Site is the main hub. Once logged into the curation site, you'll have access to persistent store data:
- View/update entries.
- View file loading.
- Get your personal
Curation API Token
- required for file uploads and API use. You can also access the persistent store using the Alliance Curation API. - You will need to log into the API with your personal
Curation API Token
. - You can find the
Curation API Token
under your profile (top right corner) in the Alliance Curation Site
See the Alliance A-Team Confluence Site for more info.
Use the Alliance_LinkML_Submission
pipeline in the Reporting_Build
pipeline group to generate, validate and upload files.
With each run, files generated are stored locally in a directory within /data/alliance/
.
- The directory name will include the LinkML version and date (
MMMDD
), which correspond to theLINKML_VERSION
andITERATION
pipeline values.
- Specify pipeline-specific variables.
- Sometimes the FB release used for LinkML submission will not be the latest release build.
- This happens when a release build has just been sent to FB-Indiana, but the public GFF file is not yet available for that release.
- For this reason, do not rely upon the
Reporting_Build
environment values (except forALLIANCETOKEN
). - Specify these variables at the
Alliance_LinkML_Submission
pipeline level.SERVER
- flysql machine where the reporting db is located: e.g.,flysql25
DATABASE
- the name of the db to use (withaudit_chado
): e.g.,production_chado
orfb_2022_05_reporting_audit
RELEASE
- ensure that the release matches the db used: e.g.,2022_05
(reporting) or2022_05_EP3
(production).ITERATION
- the dateMMMDD
on which the pipeline is run.LINKML_VERSION
- the LinkML version for the data: e.g.,v1.3.1
.
The Alliance_LinkML_Submission
pipeline automates these steps:
- Builds directory for data and log output in
/data/alliance/
folder. - Gets HarvDev docker container.
- Builds docker image with FB
alliance-linkml-flybase
and Allianceagr_curation_schemas
repos (the latter for validation). - Fetches data using
alliance-linkml-flybase
scripts from a specified branch or release tag. - Validates data files using the Alliance
validate_agr_schema.py
against thejsonschema/allianceModel.schema
file. - Sends e-mail that files are ready for review (size, contents), then pauses the pipeline.
- Has a separate, manually initiated phase for each file upload (due to dependencies).
As mentioned above, the GoCD pipeline will upload files to the Alliance.
Details of how to upload files are here.
- Your personal
API Curation Token
is required (available from your profile the Alliance Curation Site).
Expect anOK
message upon completion of a successful file upload. There are dependencies: e.g., the ALLELE file should be fully loaded (takes many hours) at the Alliance before uploading the disease file.
Go to the Alliance Curation Site.
- From the left options, choose
Other Links
(at the bottom), thenData Loads
. - For each data type uploaded, there should be a
Direct (LinkML) * Loads
selection on the right. - For each data type, all files ever uploaded are listed.
- Click on the most recent one to see if it's been loaded (it can take hours).
- Review any errors in loading any of the data objects - may require code or data fix.
- Note - there can be load order dependencies: e.g., make sure all alleles are loaded before loading disease annotations.
- If you suspect that load order created problems, click on the pencil icon at the right to initiate a re-load of the file.
- If you want to re-load a file, might be best to first check with the A-Team though.
- There are dependencies on references too (see below).
Loading of most LinkML data is dependent on presence of FB references at the Alliance.
These references are in a separate database, the Alliance Bibliography Central (ABC), handled by the Alliance Blue Team.
The pub schema is based on the distinct agr_schemas model.
As such, FB code of pub submission is in the related alliance-flybase repo.
- See the README for this repo for details about submission and load issues.
Submissions to the ABC are handled by theAlliance_Pub_Export
pipeline (runs as part of theEpicycle
pipeline group).
The full LinkML submissions are typically made from reporting builds.
However, for curators to curate directly into the Alliance, they need access to new features (genes, alleles, genotypes) generated each epicycle.
To this end, there is the Alliance_LinkML_Incremental_Update
pipeline (in the Epicycle
pipeline group).
This pipeline generates small updates for new (or newly obsoleted) genes, alleles, constructs and genotypes/strains (AGMs).
It does so by finding what's new in production_chado
compared to flysql23 explore_chado_last_week
.
The output LinkML files are only for the new things.
The pipeline is triggered by the completion of the Load_Explore_ChadoDB
.
You will need to upset the env. variable LINKML_VERSION
for the Alliance_LinkML_Incremental_Update
pipeline, as needed.
The code works as follows:
- The code gets objects from a reference db to create a list of previously submitted objects.
- Passing a "-r REFERENCE_DB" parameter to the script makes it produce an incremental update LinkML file.
- This creates a reference_session object (pointed at
explore_chado_last_week
) that is passed to the export_chado_data() function as a kwarg. - This kwarg makes the
entity_handler.get_entities()
method run on a reference db and store IDs for reference objects in thehandler.fb_reference_entity_ids
list.- The above step is in addition to the normal run of
entity_handler.get_entities()
, which gets data from the main current db (production or reporting).
- The above step is in addition to the normal run of
- This step also sets
handler.incremental_update = True
, andhandler.reference_session
stores theRefSession()
object.
- The code flags new additions.
- When
handler.incremental_update = True
, theentity_handler.flag_new_additions_and_obsoletes()
methods runs to completion. - This methods flags data entities as
is_new_addition = True
oris_new_obsolete = True
.
- The code flags new additions as internal.
- The
handler.flag_internal_fb_entities()
has a special step for incremental updates. - If an
entity.is_new_addition = True
, then we setentity.internal=True
(do not want it public at Alliance while it is private at FlyBase).
- The code filters for new stuff when generating the export list.
- The
handler.generate_export_dict()
generates the export list. - If
handler.incremental_update = True
, it only exports objects wherei.is_new_addition = True
ori.is_new_obsolete = True
. - CRITICAL:
- For alleles associated with insertions, the allele may be current in chado (
Allele.chado_obj.is_obsolete
), but is exported as obsolete to the alliance (FBAllele.is_obsolete
). - So, when assessing obsoleteness, one needs to assess the correct "obsolete" attribute, depending on the context.
- For alleles associated with insertions, the allele may be current in chado (
- The files are uploaded with
cleanUp=False
to the Alliance persistent store.
- By default, when loading into the Alliance, any objects (e.g., genes) that are in the persistent store, but not the submission file, are made obsolete at the Alliance.
- By setting
cleanUp=False
, objects are loaded into the persistent store, but there is not clean up of objects in the Alliance store that are missing from the file. This is what we want for an incremental update. - For full submissions from reporting builds, we let the default clean up happen.
Current issues:
- Sometimes there is nothing new to upload.
- I'm not sure of the best way to handle this situation best.
- When things work, a non-empty file gets validated then uploaded if all looks good. If validation fails, the non-empty file is not uploaded (which is good).
- If it all works, there is an email that confirms completion of all four files.
- If an empty file is generated, the validation step fails (because it can't find the expected JSON file), and of course, nothing is uploaded (which is good).
- The four current incremental files are handled in parallel, so that if one fails, it does not prevent validation and uploading of the other files (which is good).
- In this case, an email is sent to HarvDev about the failed step.
- The only issue I see is that if something fails, we end up with the unsightly red box and the pipeline doesn't hit that final email phase.
- Maybe there's a better way to handle this though.