-
Notifications
You must be signed in to change notification settings - Fork 52
Take the strain, disentangle clustersets #1093
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@@ -93,34 +93,34 @@ sub populate_tree { | |||
|
|||
$compara_menu->append($self->create_node('Compara_Tree', 'Gene tree', | |||
[qw( image EnsEMBL::Web::Component::Gene::ComparaTree )], | |||
{ 'availability' => 'gene database:compara core has_gene_tree not_strain' } | |||
{ 'availability' => 'gene database:compara core has_gene_tree' } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removal of the not_strain
availability tag here and elsewhere in this module addresses ENSCOMPARASW-8481 ("Inactive homology links for Avena sativa Sang on Ensembl Plants website").
my $strain_group = $species_defs->get_config($species, 'STRAIN_GROUP'); | ||
my $related_taxon = $species_defs->get_config($species, 'RELATED_TAXON'); | ||
if ($hub->action =~ /^Strain_/) { | ||
unless (($strain_group && $strain_group eq $this_group) || ($related_taxon && $related_taxon eq $species_defs->RELATED_TAXON)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Issue ENSCOMPARASW-8511 is related to this requirement for either the STRAIN_GROUP
or RELATED_TAXON
to match in strain views.
This could be circumvented by checking instead for whether the production name is in the orthoset_prod_names
set for the given strain gene-tree collection (as below).
(See ENSCOMPARASW-8511 and ENSCOMPARASW-8516 for more information on the orthoset returned by orthoset_prod_names
.)
? "are not shown in the table above because they don't have any orthologue with" | ||
: "is not shown in the table above because it doesn't have any orthologue with" | ||
; | ||
|
||
$html .= '<br /><a name="list_no_ortho"/>' . $self->_info( | ||
'Species without orthologues', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These appear to be sufficient.
This may be unnecessary.
Description
Strain-group metadata to represent strain-reference relationships have become increasingly common in recent years, with 75 genomes in about 15 such groups in Ensembl Vertebrates 115.
The same period has seen strain gene-tree collections being used more broadly, to a point where Ensembl Plants release 115 cultivar gene-tree collections are expected to have almost as many genomes (123) as are in the default gene-tree collection (126).
Lines between strain and non-strain comparative data are becoming increasingly blurred. For some time, a proportion of genomes in strain collections have not themselves been strains, and as of release 115, there is at least one strain-tagged genome in a default gene-tree collection.
Finally, gene-tree configuration has become more complex in release 115 with the introduction of "interlocking" strain collections — strain species sets sharing one or more genomes in common with each other — which are challenging to handle in a system which has been built to treat strain and default gene-tree collections as binary.
With the accumulation of data changes, some unexpected hiccups have come to light, including:
In conjunction with eg-web-common PR 144, this PR would decouple strain view status from the strain status of individual genomes, making comparative web views more robust to data changes, and making it possible to disentangle orthology data from different gene-tree clustersets.
General changes
EnsEMBL::Web::Utils::Compara
containing one public functionorthoset_prod_names
, which returns a list of the appropriate genome production names for a set of orthologues based on relevant information such as the Compara database (e.g.compara_pan_ensembl
) or whether the orthologues are in a strain view.CLUSTERSET_PRODNAMES
data inMULTI.db.packed
for use byEnsEMBL::Web::Utils::Compara::orthoset_prod_names
, containing production names by clusterset in an Ensembl division that has at least one strain gene-tree collection (e.g. Vertebrates, Plants, Metazoa).EnsEMBL::Web::Utils::Compara::orthoset_prod_names
to get the set of relevant production names as needed in various modules. This ensures the appropriate set of production names is configured in orthologue views. As a side benefit, it's less often necessary to explicitly skipancestral_sequences
or species absent from URL lookup (e.g. Human in Ensembl Plants), since such genomes are excluded fromorthoset_prod_names
.$hub->is_strain
, one or more of three cues are used:$hub->action =~ /^Strain_/
(current view is a strain view),$hub->referer->{'ENSEMBL_ACTION'} =~ /^Strain_/
(referer view is a strain view) or$hub->param('strain')
(astrain
parameter has been passed).not_strain
availability tag is dropped, so that strains can have default gene-tree/homology links.has_default_compara
andhas_strain_compara
are used to determine whether default and strain Compara portal links are active.EnsEMBL::Web::Component::Gene::ComparaOrthologs
EnsEMBL::Web::Utils::Compara::orthoset_prod_names
. If the answer is no, the production name is stored in$species_not_relevant
and used to filter homologies, but not included in the list of species without orthologues, because they were not in any of the gene-tree analyses relevant to the given orthologue view.$species_not_shown
, but the breakdown of strain types is stored in$unshown_strain_types
.no_ortho_count
total showing the different strain types (e.g. breeds, cultivars).orthoset_prod_names
, it takes account only of genomes in the relevant clusterset(s). The effect of this on the orthologue summary is most clear for genomes in the Drosophila pangenome (e.g. Drosophila elegans), which now has only two rows in its orthologue summary: "Diptera" and "All".EnsEMBL::Web::Component::Gene::HomologAlignment
EnsEMBL::Web::Component::Gene::ComparaTree
murinae
) from theRELATED_TAXON
config.Views affected
The various changes in this pull request will affect comparative items in the gene sidebar menu, Compara portal pages, gene-tree views, orthologue/paralogue/homologue table views, orthologue/paralogue/homologue alignment views, and the "Selected species" field set in orthologue table and orthologue alignment view config forms.
Please see related JIRA tickets for more information on the effects of these changes.
Possible complications
Because of the addition of
CLUSTERSET_PRODNAMES
to theMULTI.db.packed
file, this PR would require repacking that file for Ensembl Vertebrates and Plants.The main possible source of complications is the
ViewConfigForm::add_species_fieldset
method, which is used in bothEnsEMBL::Web::ViewConfig::Gene::ComparaOrthologs
andEnsEMBL::Web::ViewConfig::Gene::Family
modules, and potentially elsewhere.However, for comparative views that do not contain strain-level genomes, the list of Compara species should remain the same as before, ultimately being taken from the
COMPARA_SPECIES
set.Merge conflicts
None detected.
Related JIRA Issues (EBI developers only)