Skip to content
/ hapmatch Public

Determines lineages with given ingroup and outgroup frequencies in coalescent genealogies

License

Notifications You must be signed in to change notification settings

mpcox/hapmatch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

hapmatch

This C++ program calculates haplotypes with given ingroup and outgroup frequencies in a set of coalescent genealogies. The code was used to calculate lineage frequencies for a bottlenecked Madagascar ingroup versus its source Indonesian outgroup in:

Cox MP, MG Nelson, MK Tumonggor, FX Ricaut and H Sudoyo. 2012. A small cohort of Island Southeast Asian women founded Madagascar. Proceedings of the Royal Society B 279: 2761-2768.

hapmatch reads input from Richard Hudson's ms, like the following worked example file, test.ms:

ms 10 1 –t 4 –I 2 4 6 ...
...
1    111001100000110110000011110000000011101101110001110011 ⎤
2    110001100000110110000011110000000011101101110001110011   Malagasy
3    000001000000100110000011000000010000001010001100111000 
4    000001000000100110000011000000010000001010001100111000 ⎦
5    000001000000100110000011000000010000000000001100111000 ⎤
6    000001000000100110000011000000010000001010001100111000 
7    000000000010100110000011000110010000001110001100000000   Indonesians
8    000000000010100110000011000110010000001110001100000000 
9    000000000010100110000011000110000000001110001000000000 
10   000000000010100110000011000110011000001110011100000000 ⎦

Where sequences 1–4 are simulated data representing the Madagascar ingroup and 5–10 represent the Indonesian outgroup.

Running hapmatch with an ingroup frequency cutoff of ≥0.2, an outgroup frequency cutoff of ≤0.8 and requiring exact matches:

cat test.ms | hapmatch 4 0.2 6 0.8 0

returns:

S       Sin     Sout    hapmatch
54      31      20      1

This output occurs because:

  • Ingroup sequence 1 has >2 mismatches with all six outgroup sequences.
  • Ingroup sequence 2 has >2 mismatches with all six outgroup sequences.
  • Ingroup sequence 3 has 2 mismatches with outgroup sequence 5. This meets the frequency cutoff criteria (1/4 ≥ 0.2 for the ingroup and 1/6 ≤ 0.8 for the outgroup), but not the exact match criterion.
  • Ingroup sequence 4 is an exact match with outgroup sequence 6. This meets the frequency cutoff criteria (1/4 ≥ 0.2 for the ingroup and 1/6 ≤ 0.8 for the outgroup), including the exact match criterion. This is the match identified under the 'hapmatch' heading.

INSTALLATION

hapmatch requires a working installation of Kevin Thornton's libsequence library. The current code has been updated to require libsequence v2.

The easiest way to install libsequence is via bioconda:

conda install -c bioconda libsequence

The hapmatch source code can then be compiled with:

g++ -o hapmatch hapmatch.cc -lsequence -std=c++11 -Wall

USAGE

Usage information can be found by running the command:

hapmatch

The program expects five input values: the sample size and frequency cutoff of the ingroup, the sample size and frequency cutoff of the outgroup, and a threshold for the number of allowed mismatches. A threshold of 0 returns exact matches only.

hapmatch 4 0.444 6 0.948 0

EXAMPLE

The following command simulates 3 datasets, each containing 10 chromosome copies, with the first 4 copies deriving from population 1 (the ingroup) and the remaining 6 copies from population 2 (the outgroup). hapmatch is set to find exact matches only, where haplotypes also meet the frequency cutoffs of ≥44.4% for population 1 and ≤94.8% for population 2.

ms 10 3 -t 4 -I 2 4 6 1 | hapmatch 4 0.444 6 0.948 0

The coalescent genealogies generated by ms are random, but the output formatting looks like the following, with each line containing summary values for a single input dataset:

S    Sin    Sout    hapmatch
27    26    24    0
17    12    15    1
14    8    9    0

where S is the total number of segregating sites, Sin is the number of segregating sites in the ingroup, Sout is the number of segregating sites in the outgroup, and hapmatch is the number of haplotypes that meet the required criteria.

About

Determines lineages with given ingroup and outgroup frequencies in coalescent genealogies

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published