Skip to content
Permalink

Comparing changes

Choose two branches to see what’s changed or to start a new pull request. If you need to, you can also or learn more about diff comparisons.

Open a pull request

Create a new pull request by comparing changes across two branches. If you need to, you can also . Learn more about diff comparisons here.
base repository: wtsi-npg/simtools
Failed to load repositories. Confirm that selected base ref is valid, then try again.
Loading
base: V1.4
Choose a base ref
...
head repository: wtsi-npg/simtools
Failed to load repositories. Confirm that selected head ref is valid, then try again.
Loading
compare: master
Choose a head ref

Commits on Jul 16, 2013

  1. Merge pull request #6 from wtsi-npg/master

    update dev branch from master (I hope)
    jenniferliddle committed Jul 16, 2013
    Copy the full SHA
    e7c47d9 View commit details

Commits on Jul 22, 2013

  1. Merge pull request #7 from wtsi-npg/master

    Fixed "pure virtual method called" error, and "dereferencing type-punned...
    jenniferliddle committed Jul 22, 2013
    Copy the full SHA
    221149f View commit details

Commits on Sep 2, 2013

  1. Added help text and placeholders for QC option

    Iain Bancarz committed Sep 2, 2013
    Copy the full SHA
    5b803c5 View commit details
  2. More QC placeholders

    Iain Bancarz committed Sep 2, 2013
    Copy the full SHA
    5675994 View commit details
  3. Added simple test for QC function

    Iain Bancarz committed Sep 2, 2013
    Copy the full SHA
    bb345c0 View commit details

Commits on Sep 4, 2013

  1. Ignore emacs temp files

    Iain Bancarz committed Sep 4, 2013
    Copy the full SHA
    5189646 View commit details
  2. Initial changes to support intensity QC metrics

    Iain Bancarz committed Sep 4, 2013
    Copy the full SHA
    8142268 View commit details

Commits on Sep 5, 2013

  1. Write sample names and magnitudes to file

    Iain Bancarz committed Sep 5, 2013
    Copy the full SHA
    e7b685f View commit details
  2. Find magnitude with arbitrarily many intensity channels

    Iain Bancarz committed Sep 5, 2013
    Copy the full SHA
    9bd8562 View commit details
  3. Use new QC class

    Iain Bancarz committed Sep 5, 2013
    Copy the full SHA
    ff61d3c View commit details
  4. Fix signed/unsigned integer comparison warning

    Iain Bancarz committed Sep 5, 2013
    Copy the full SHA
    c16ba26 View commit details
  5. Copy the full SHA
    cd4bb88 View commit details
  6. Include QC.o in simtools compliation

    Iain Bancarz committed Sep 5, 2013
    Copy the full SHA
    302f538 View commit details
  7. Get rid of hello() method, add placeholder for xydiff

    Iain Bancarz committed Sep 5, 2013
    Copy the full SHA
    167bb49 View commit details
  8. Add constructor and Sim instance variable to QC class

    Iain Bancarz committed Sep 5, 2013
    Copy the full SHA
    ce89355 View commit details
  9. Add xydiff computation; remove sim argument from magnitude functions;…

    … use fprintf for output
    Iain Bancarz committed Sep 5, 2013
    Copy the full SHA
    84824d4 View commit details
  10. Changed variable names for greater clarity

    Iain Bancarz committed Sep 5, 2013
    Copy the full SHA
    d62dff9 View commit details
  11. Changed help messages; require QC output to files (must specify at le…

    …ast one of magnitude, xydiff)
    Iain Bancarz committed Sep 5, 2013
    Copy the full SHA
    5b5556a View commit details
  12. Updated QC tests

    Iain Bancarz committed Sep 5, 2013
    Copy the full SHA
    8196a23 View commit details
  13. Changed QC test filenames

    Iain Bancarz committed Sep 5, 2013
    Copy the full SHA
    7d90087 View commit details

Commits on Sep 6, 2013

  1. Fix memory leaks and support greater verbosity for QC

    Iain Bancarz committed Sep 6, 2013
    Copy the full SHA
    727bb34 View commit details

Commits on Sep 9, 2013

  1. Intensity vectors as class members; do not repeatedly create/destroy …

    …vector objects
    Iain Bancarz committed Sep 9, 2013
    Copy the full SHA
    1f51094 View commit details
  2. Reserve sufficient vector space initially

    Iain Bancarz committed Sep 9, 2013
    Copy the full SHA
    15101db View commit details

Commits on Sep 11, 2013

  1. Use pointers instead of at() for faster access to vector

    Iain Bancarz committed Sep 11, 2013
    Copy the full SHA
    19b77eb View commit details

Commits on Sep 12, 2013

  1. Read intensities into array instead of vector for QC

    Iain Bancarz committed Sep 12, 2013
    Copy the full SHA
    20d4626 View commit details
  2. Make inFileRaw private

    Iain Bancarz committed Sep 12, 2013
    Copy the full SHA
    01a3226 View commit details
  3. Read header from C-style file descriptor

    Iain Bancarz committed Sep 12, 2013
    Copy the full SHA
    40ff779 View commit details
  4. Only use C-style file descriptor for input; may still be a few redund…

    …ant variables to fix
    Iain Bancarz committed Sep 12, 2013
    Copy the full SHA
    ccfe087 View commit details
  5. Read intensities into arrays, not vectors

    Iain Bancarz committed Sep 12, 2013
    Copy the full SHA
    488541f View commit details
  6. Remove output from previous tests; introduced variable for illuminus …

    …output filename
    Iain Bancarz committed Sep 12, 2013
    Copy the full SHA
    829399a View commit details
  7. Bugfix; do not call fclose on inFileRaw

    Iain Bancarz committed Sep 12, 2013
    Copy the full SHA
    11c4a73 View commit details
  8. Remove unnecessary cerr statements

    Iain Bancarz committed Sep 12, 2013
    Copy the full SHA
    ac1e3ea View commit details

Commits on Sep 13, 2013

  1. Remove unnecessary iostream input code

    Iain Bancarz committed Sep 13, 2013
    Copy the full SHA
    c1dbcee View commit details
  2. Rename inFileRaw variable to inFile

    Iain Bancarz committed Sep 13, 2013
    Copy the full SHA
    f98e355 View commit details
  3. Remove unnecessary status chatter

    Iain Bancarz committed Sep 13, 2013
    Copy the full SHA
    df6d1dd View commit details
  4. Modified to work with new Sim class

    Iain Bancarz committed Sep 13, 2013
    Copy the full SHA
    c0e4d1a View commit details
  5. Bigger sample name buffer

    Iain Bancarz committed Sep 13, 2013
    Copy the full SHA
    9caa363 View commit details
  6. Add test for 'sim' executable

    Iain Bancarz committed Sep 13, 2013
    Copy the full SHA
    1a2f279 View commit details
  7. Do not store input path in 'filename' variable

    Iain Bancarz committed Sep 13, 2013
    Copy the full SHA
    a4b4526 View commit details
  8. Rename Sim.open() to Sim.openInput()

    Iain Bancarz committed Sep 13, 2013
    Copy the full SHA
    aadcbf0 View commit details
  9. close() now checks and closes both input and output

    Iain Bancarz committed Sep 13, 2013
    Copy the full SHA
    ef771be View commit details
  10. Call close() method on Sim object for QC

    Iain Bancarz committed Sep 13, 2013
    Copy the full SHA
    e7a01d7 View commit details
  11. Close and delete sim/qc objects before exit

    Iain Bancarz committed Sep 13, 2013
    Copy the full SHA
    882638f View commit details
  12. Rename createFile to openOutput

    Iain Bancarz committed Sep 13, 2013
    Copy the full SHA
    2e20257 View commit details

Commits on Sep 17, 2013

  1. Initial commit of additional test data

    Iain Bancarz committed Sep 17, 2013
    Copy the full SHA
    b6ef70f View commit details
  2. Add -lm compiler flag. Remove -ffast-math which causes errors handlin…

    …g INF/NaN values.
    Iain Bancarz committed Sep 17, 2013
    Copy the full SHA
    e37f5d7 View commit details
  3. Add CLEANUP constatns

    Iain Bancarz committed Sep 17, 2013
    Copy the full SHA
    5d2dd5f View commit details
  4. Copy the full SHA
    7b1096b View commit details
  5. Copy the full SHA
    9d1bcb0 View commit details
  6. Report non-numeric values

    Iain Bancarz committed Sep 17, 2013
    Copy the full SHA
    298b37a View commit details
19 changes: 19 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
# Emacs temporary files
*~

# Compiled Object files
*.slo
*.lo
@@ -11,3 +14,19 @@
*.lai
*.la
*.a

# Temporary test directories
temp*

# Files generated by cxxtest
test_simtools.cpp
runner
runner.cpp

# Compiled executables
sim
simtools
g2i
gtc
gtc_process
normalize_manifest
287 changes: 287 additions & 0 deletions Egt.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,287 @@

// Egt.cpp
//
// Author: Iain Bancarz <ib5@sanger.ac.uk>
//
// Copyright (c) 2014 Genome Research Ltd.
//
// Redistribution and use in source and binary forms, with or without
// modification, are permitted provided that the following conditions are met:
// 1. Redistributions of source code must retain the above copyright notice,
// this list of conditions and the following disclaimer.
// 2. Redistributions in binary form must reproduce the above copyright
// notice, this list of conditions and the following disclaimer in the
// documentation and/or other materials provided with the distribution.
// 3. Neither the name of Genome Research Ltd nor the names of the
// contributors may be used to endorse or promote products derived from
// software without specific prior written permission.
//
// THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR
// IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
// OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
// IN NO EVENT SHALL GENOME RESEARCH LTD. BE LIABLE FOR ANY DIRECT, INDIRECT,
// INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
// BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
// USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
// THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
// (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF
// THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
//

/*
* Egt is a class to parse Illumina EGT cluster files.
*
* This is a port of the EGT Python class in zCall to C++.
* There is no public documentation for the EGT binary file format.
* (However, the zCall EGT class is claimed to originate from Illumina.)
*
* Repository for zCall: https://github.com/wtsi-npg/zCall
*
* Native EGT format stores coordinates as (R, Theta) polar coordinates
* Python EGT class converts to Cartesian for storage in memory
* This class stores as polar
* (More efficient than converting and storing as Cartesian, since both are
needed for FCR output)
*
* IMPORTANT: Illumina defines its own polar coordinates:
* - Theta is rescaled, such that 1 unit = pi/2 radians
* - R is equal to x+y, *not* sqrt(x^2 + y^2)
* - R is a "Manhattan distance", not Euclidean distance
*/

#include <string>
#include <iostream>
#include <sstream>
#include "Egt.h"

using namespace std;

Egt::Egt(bool verbose)
{
this->verbose = verbose;
GENOTYPES_PER_SNP = 3;
PARAMS_PER_SNP = 12;
NUMERIC_BYTES = 4;
ENTRIES_IN_RECORD = 30;
BYTES_IN_RECORD = NUMERIC_BYTES * ENTRIES_IN_RECORD;
ENTRIES_TO_USE = 15;
}

void Egt::open(string filename)
{
this->filename = filename;
ifstream file;

file.open(filename.c_str());
if (!file) {
cerr << "Can't open file: " << filename << endl << flush;
exit(1);
}
// read header data
readHeader(file);
readPreface(file);
if (verbose) {
printHeader();
printPreface();
}
// read cluster data
// expected cluster positions, in polar coordinates (R, Theta) for (AA,AB,BB)
// Order of params is: devRAA, devRAB, devRBB, meanRAA, meanRAB, meanRBB, devThetaAA, devThetaAB, devThetaBB, meanThetaAA, meanThetaAB, meanThetaBB
// Can't use a multidimensional array because we don't know snpTotal at compile time; instead use a pseudo-multidimensional array such that the (i,j)th value is (i*WIDTH + j)
counts = new int[GENOTYPES_PER_SNP*snpTotal]; // nAA, nAB, nBB
params = new float[PARAMS_PER_SNP*snpTotal];
char *block = new char[BYTES_IN_RECORD];
for (int i=0; i<snpTotal; i++) {
file.read(block, BYTES_IN_RECORD);
int *ints = bytesToInts(block, 0, GENOTYPES_PER_SNP);
float *floats = bytesToFloats(block, GENOTYPES_PER_SNP, ENTRIES_IN_RECORD);
for (int j=0;j<GENOTYPES_PER_SNP;j++)
counts[i*GENOTYPES_PER_SNP + j] = ints[j];
for (int j=0;j<PARAMS_PER_SNP;j++)
params[i*PARAMS_PER_SNP + j] = floats[j];
delete floats;
delete ints;
}
delete block;
snpNames = new string[snpTotal];
readSNPNames(file, snpNames);
file.close();
}

void Egt::open(char *filename)
{
string f = filename;
open(f);
}


int* Egt::bytesToInts(char block[], int start, int end) {
// convert a section of a byte array into ints
// start, end indices refer to positions in the array of ints (not bytes)
int *results = new int[end - start];
numericConverter converter;
for (int i=start;i<end;i++) {
for (int j=0;j<NUMERIC_BYTES;j++) {
converter.ncChar[j] = block[i*NUMERIC_BYTES + j];
}
results[i-start] = converter.ncInt;
}
return results;
}

float* Egt::bytesToFloats(char block[], int start, int end) {
// convert a section of a byte array into floats
// start, end indices refer to positions in the array of floats (not bytes)
float *results = new float[end - start];
numericConverter converter;
for (int i=start;i<end;i++) {
for (int j=0;j<NUMERIC_BYTES;j++) {
converter.ncChar[j] = block[i*NUMERIC_BYTES + j];
}
results[i-start] = converter.ncFloat;
}
return results;
}

void Egt::getClusters(long index, float snp_params[]) {
// return all (theta, R) params for given position in the SNP manifest
// includes s.d. and mean of (r, theta) for AA, AB, BB
int start = index*PARAMS_PER_SNP;
for (int j=0; j < PARAMS_PER_SNP; j++) {
snp_params[j] = this->params[start + j];
}
}

void Egt::getMeanR(long index, float means[]) {
// find mean polar radius for AA, AB, BB at given index
int start = index*PARAMS_PER_SNP + GENOTYPES_PER_SNP;
for (int j=0; j < GENOTYPES_PER_SNP; j++) {
means[j] = this->params[start + j];
}
}

void Egt::getMeanTheta(long index, float means[]) {
// find mean polar angle for AA, AB, BB at given index
int start = index*PARAMS_PER_SNP + 3*GENOTYPES_PER_SNP;
for (int j=0; j < GENOTYPES_PER_SNP; j++) {
means[j] = this->params[start + j];
}
}

numericConverter Egt::getNextConverter(ifstream &file) {
// convenience method to read the next few bytes into a union
// can then use the union for numeric conversion
char * buffer;
buffer = new char[NUMERIC_BYTES];
file.read(buffer, NUMERIC_BYTES);
numericConverter converter;
for (int i=0; i<NUMERIC_BYTES; i++) {
converter.ncChar[i] = buffer[i];
}
delete buffer;
return converter;
}

float Egt::readFloat(ifstream &file) {
float result;
numericConverter converter = getNextConverter(file);
result = converter.ncFloat;
return result;

}

void Egt::readHeader(ifstream &file)
{
// populate instance variables with header values
file.seekg(0); // set read position to zero, if not already there
fileVersion = readInteger(file);
gcVersion = readString(file, "GC version");
clusterVersion = readString(file, "Cluster version");
callVersion = readString(file, "Call version");
normalizationVersion = readString(file, "Normalization version");
dateCreated = readString(file, "Date created");
mode = file.get();
manifest = readString(file, "Manifest name");
}

int Egt::readInteger(ifstream &file)
{
int result;
numericConverter converter = getNextConverter(file);
result = converter.ncInt;
return result;
}

void Egt::readPreface(ifstream &file) {
// read the 'preface' from the body of an EGT file
// assumes file is positioned at start of the body
dataVersion = readInteger(file);
opa = readString(file, "OPA");
snpTotal = readInteger(file);
}

void Egt::readSNPNames(ifstream &file, string names[]) {
// read SNP names from an EGT file
// assumes file is positioned at end of cluster (mean, sd) data
int pos = file.tellg();
file.seekg(pos + 13 * snpTotal); // skip SNP quality scores
for (int i=0;i<snpTotal;i++) {
// skip genotype scores
// length of strings is unknown, so must read each one and discard it
readString(file, "genotype score");
}
for (int i=0;i<snpTotal;i++) {
stringstream sstream;
sstream << "SNP name at index " << i;
names[i] = readString(file, sstream.str());
}
}

string Egt::readString(ifstream &file, string name) {
// EGT string format is as follows:
// - First byte is a *signed* char encoding the string length
// - Subsequent bytes contain the string
// Total bytes read is (length encoded in first byte)+1 -- at most 128
// Second argument is an (optional) identifier used for logging
int length = int(file.get()); // get a single byte
if (not file.good()) {
throw("Cannot read length from EGT file, file state is not good");
} else if (length < 0) {
throw("Illegal string length in EGT file");
} else if (length == 0) {
cerr << "Warning: String '" << name << "' has zero length." << endl;
return "";
}
char * buffer;
buffer = new char[length+1];
buffer[length] = '\0'; // ensure buffer ends with a null character
file.read(buffer, length);
if (not file.good()) {
stringstream sstream;
sstream << "Cannot read string '" << name << \
"' from EGT file, file state is not good";
throw(sstream.str());
}
string result = string(buffer);
delete buffer;
return result;
}

void Egt::printHeader() {
// convenience method to print file header
cout << "FILE_VERSION " << fileVersion << endl;
cout << "GC_VERSION " << gcVersion << endl;
cout << "CLUSTER_VERSION " << clusterVersion << endl;
cout << "CALL_VERSION " << callVersion << endl;
cout << "NORMALIZATION_VERSION " << normalizationVersion << endl;
cout << "DATE_CREATED " << dateCreated << endl;
cout << "MODE " << (int) mode << endl;
cout << "MANIFEST " << manifest << endl;
}

void Egt::printPreface() {
// convenience method to print preface of "main" data section
cout << "DATA_VERSION " << dataVersion << endl;
cout << "OPA " << opa << endl;
cout << "TOTAL_SNPS " << snpTotal << endl;
}
Loading