Skip to content

Take the strain, disentangle clustersets #1093

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

twalsh-ebi
Copy link
Contributor

@twalsh-ebi twalsh-ebi commented Jun 27, 2025

Description

Strain-group metadata to represent strain-reference relationships have become increasingly common in recent years, with 75 genomes in about 15 such groups in Ensembl Vertebrates 115.

The same period has seen strain gene-tree collections being used more broadly, to a point where Ensembl Plants release 115 cultivar gene-tree collections are expected to have almost as many genomes (123) as are in the default gene-tree collection (126).

Lines between strain and non-strain comparative data are becoming increasingly blurred. For some time, a proportion of genomes in strain collections have not themselves been strains, and as of release 115, there is at least one strain-tagged genome in a default gene-tree collection.

Finally, gene-tree configuration has become more complex in release 115 with the introduction of "interlocking" strain collections — strain species sets sharing one or more genomes in common with each other — which are challenging to handle in a system which has been built to treat strain and default gene-tree collections as binary.

With the accumulation of data changes, some unexpected hiccups have come to light, including:

  • inactive default homology links for a genome tagged as a strain-level genome (ENSCOMPARASW-8481);
  • orthologue table of strain collections not showing orthologues from genomes that are in another strain collection (ENSCOMPARASW-8511);
  • various strain-related issues (ENSCOMPARASW-8512).

In conjunction with eg-web-common PR 144, this PR would decouple strain view status from the strain status of individual genomes, making comparative web views more robust to data changes, and making it possible to disentangle orthology data from different gene-tree clustersets.

General changes

  • Addition of module EnsEMBL::Web::Utils::Compara containing one public function orthoset_prod_names, which returns a list of the appropriate genome production names for a set of orthologues based on relevant information such as the Compara database (e.g. compara_pan_ensembl) or whether the orthologues are in a strain view.
  • Inclusion of CLUSTERSET_PRODNAMES data in MULTI.db.packed for use by EnsEMBL::Web::Utils::Compara::orthoset_prod_names, containing production names by clusterset in an Ensembl division that has at least one strain gene-tree collection (e.g. Vertebrates, Plants, Metazoa).
  • Use of EnsEMBL::Web::Utils::Compara::orthoset_prod_names to get the set of relevant production names as needed in various modules. This ensures the appropriate set of production names is configured in orthologue views. As a side benefit, it's less often necessary to explicitly skip ancestral_sequences or species absent from URL lookup (e.g. Human in Ensembl Plants), since such genomes are excluded from orthoset_prod_names.
  • Instead of checking strain view status with $hub->is_strain, one or more of three cues are used: $hub->action =~ /^Strain_/ (current view is a strain view), $hub->referer->{'ENSEMBL_ACTION'} =~ /^Strain_/ (referer view is a strain view) or $hub->param('strain') (a strain parameter has been passed).
  • The not_strain availability tag is dropped, so that strains can have default gene-tree/homology links.
  • Availability tags has_default_compara and has_strain_compara are used to determine whether default and strain Compara portal links are active.

EnsEMBL::Web::Component::Gene::ComparaOrthologs

  • The question, "Should we be showing this orthologue on this page by default?" is now answered by checking if the given production name is in the list returned by EnsEMBL::Web::Utils::Compara::orthoset_prod_names. If the answer is no, the production name is stored in $species_not_relevant and used to filter homologies, but not included in the list of species without orthologues, because they were not in any of the gene-tree analyses relevant to the given orthologue view.
  • Relevant species and strains lacking orthologues with the current gene are now counted together in $species_not_shown, but the breakdown of strain types is stored in $unshown_strain_types.
  • A single list of "Species without orthologues" (ordered by display name) is shown at the bottom of the orthologue view. If relevant, a strain-type breakdown is placed after the no_ortho_count total showing the different strain types (e.g. breeds, cultivars).
  • Because the orthologue summary table is generated using orthoset_prod_names, it takes account only of genomes in the relevant clusterset(s). The effect of this on the orthologue summary is most clear for genomes in the Drosophila pangenome (e.g. Drosophila elegans), which now has only two rows in its orthologue summary: "Diptera" and "All".

EnsEMBL::Web::Component::Gene::HomologAlignment

  • Homology alignments are not shown if the homologue is in a genome that is not in one of the relevant clustersets.

EnsEMBL::Web::Component::Gene::ComparaTree

  • False alarms affecting the "Phylogenetic model selection" info box are reduced by fetching the strain clusterset_id (e.g. murinae) from the RELATED_TAXON config.
  • The gene-tree option to "Collapse all the nodes at the taxonomic rank" is deactivated in strain views, rather than for gene trees viewed from a non-reference-strain genome.

Views affected

The various changes in this pull request will affect comparative items in the gene sidebar menu, Compara portal pages, gene-tree views, orthologue/paralogue/homologue table views, orthologue/paralogue/homologue alignment views, and the "Selected species" field set in orthologue table and orthologue alignment view config forms.

Please see related JIRA tickets for more information on the effects of these changes.

Possible complications

Because of the addition of CLUSTERSET_PRODNAMES to the MULTI.db.packed file, this PR would require repacking that file for Ensembl Vertebrates and Plants.

The main possible source of complications is the ViewConfigForm::add_species_fieldset method, which is used in both EnsEMBL::Web::ViewConfig::Gene::ComparaOrthologs and EnsEMBL::Web::ViewConfig::Gene::Family modules, and potentially elsewhere.

However, for comparative views that do not contain strain-level genomes, the list of Compara species should remain the same as before, ultimately being taken from the COMPARA_SPECIES set.

Merge conflicts

None detected.

Related JIRA Issues (EBI developers only)

  • ENSCOMPARASW-8481
  • ENSCOMPARASW-8511
  • ENSCOMPARASW-8512
  • ENSCOMPARASW-8516

@@ -93,34 +93,34 @@ sub populate_tree {

$compara_menu->append($self->create_node('Compara_Tree', 'Gene tree',
[qw( image EnsEMBL::Web::Component::Gene::ComparaTree )],
{ 'availability' => 'gene database:compara core has_gene_tree not_strain' }
{ 'availability' => 'gene database:compara core has_gene_tree' }
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removal of the not_strain availability tag here and elsewhere in this module addresses ENSCOMPARASW-8481 ("Inactive homology links for Avena sativa Sang on Ensembl Plants website").

my $strain_group = $species_defs->get_config($species, 'STRAIN_GROUP');
my $related_taxon = $species_defs->get_config($species, 'RELATED_TAXON');
if ($hub->action =~ /^Strain_/) {
unless (($strain_group && $strain_group eq $this_group) || ($related_taxon && $related_taxon eq $species_defs->RELATED_TAXON)) {
Copy link
Contributor Author

@twalsh-ebi twalsh-ebi Jun 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue ENSCOMPARASW-8511 is related to this requirement for either the STRAIN_GROUP or RELATED_TAXON to match in strain views.

This could be circumvented by checking instead for whether the production name is in the orthoset_prod_names set for the given strain gene-tree collection (as below).

(See ENSCOMPARASW-8511 and ENSCOMPARASW-8516 for more information on the orthoset returned by orthoset_prod_names.)

? "are not shown in the table above because they don't have any orthologue with"
: "is not shown in the table above because it doesn't have any orthologue with"
;

$html .= '<br /><a name="list_no_ortho"/>' . $self->_info(
'Species without orthologues',
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What this looks like if 1 breed lacks orthologues:
ENSSSCG00110020516_no_ortho_sandbox


What this looks like if many breeds, isolates and species lack orthologues:
ENSSSCG00000048809_no_ortho_sandbox

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant