Sex Chromosome Identification by Negating Kmer Densities (SCINKD) is a wrapper to implicate the sex chromosome linkage group of a haplotype-resolved genome of the heterogametic sex with an unknown sex chromosome system. SCINKD [v2.1.0] is a Snakemake implementation of the below conceptual framework.
SCINKD is a framework to identify sex chromosomes that operates under a few generalized assumptions of a diploid genome.
- Polymorphisms are broadly uniform between haplotypes within a single diploid individual.
- The density of genetic differences occur at much higher densities on the sex-limited region of the sex chromosomes
- This density is then identifiable by isolating haplotype-specific kmer densities and comparing within and between both haplotypes (smallest SDR identified to-date has been ~1Mb).
Here is a graphical represtention of these points:
This implementation of this tool uses meryl to count and negate kmers from two genomic haplotypes.
SCINKD/SCINKD.v2.1.0.FULL = Most up-to-date SCINKD pipeline (without kmer compression). SCINKD/SCINKD.v2.1.0.GREEDY = Most up-to-date SCINKD pipeline with added homopolymer compression reduces runtime many-fold, but reduces sensitivity enormously (and file sizes), This may be optimal for known systems with strong signals (e.g. mammals and birds) or in taxa with large genomes ~10Gb+.
Running on the test dataset on a cluster with a 24 core/24Gb RAM allocation reported these times upon successful completion:
time snakemake --use-conda -c 24 -s SCINKD/SCINKD.v2.1.0.FULL.snakefile
real 19m49.171s
user 37m25.996s
sys 1m2.541s
time snakemake --use-conda -c 24 -s SCINKD/SCINKD.v2.1.0.GREEDY.snakefile
real 6m22.552s
user 13m42.636s
sys 0m37.158s
To install:
git clone https://github.com/DrPintoThe2nd/SCINKD.git
mamba create -n scinkd meryl=1.4.1 snakemake=7.32.4 pigz r r-dplyr r-ggplot2 samtools --yes
mamba activate scinkd
File naming restriction: Both input haplotype fasta files MUST be gzipped (or bgzipped) and MUST end in ".hap1.fasta.gz" and ".hap2.fasta.gz" (or their symbolic link does).
Disclaimer This technique reports phasing differences between haplotypes, including contaminants, it's important to look deeper into any regions of interest.
For the test dataset provided (https://doi.org/10.6084/m9.figshare.27040678.v2), this could be applied simply via:
wget https://figshare.com/ndownloader/files/49948980
wget https://figshare.com/ndownloader/files/49948983
ln -s 49948980 Anniella_stebbinsi_HiFi_2024.asm.hic.hap1.fasta.gz
ln -s 49948983 Anniella_stebbinsi_HiFi_2024.asm.hic.hap2.fasta.gz
Then, ensure the SCINKD/config.json file reads:
{
"prefix": "Anniella_stebbinsi_HiFi_2024.asm.hic"
}
To run the pipeline on the provided Anniella genome on a machine with 24 available threads (and the default setting of 16Gb of available RAM):
snakemake --use-conda -np -s SCINKD/SCINKD.v2.1.0.FULL.snakefile #dry-run to test inputs
snakemake --use-conda -c 24 -s SCINKD/SCINKD.v2.1.0.GREEDY.snakefile #run SCINKD in greedy mode for quick testing
Chromosome lengths can be calculated using samtools faidx (column two of the fasta index file):
samtools faidx Anniella_stebbinsi_HiFi_2024.asm.hic.hap1.fasta.gz
samtools faidx Anniella_stebbinsi_HiFi_2024.asm.hic.hap2.fasta.gz
Template code used in generating these plots is enclosed (Anniella_template.R) and test files useful for replicating these plots are available alongside the test dataset (https://doi.org/10.6084/m9.figshare.27040678.v2).
Downstream plotting establishes the linear relationship between chromosome length and number of haplotype-specific kmers, as well as the sex chromosomes that significantly deviate from this expectation:
Kmer densities on the Z and W are observably higher:
Regions of increased kmer dentities converge on a single part of the chromosome, syntenic with chicken chromosome 11.
[additional documentation to be added]
After implicating a linkage group as a putative sex chromosome, additional anaylses still need to be conducted to validate. I'd recommend starting with a 1:1 haplotype alignment and diving deeper into that using a program like pafr https://github.com/dwinter/pafr or SVbyEye (shown) https://github.com/daewoooo/SVbyEye: