Do not add RG by default #58

okartal · 2018-11-28T13:33:21Z

Problem

I have encountered a problem with bwameth that pops up when the FASTQ comment contains a read group. In this case, bwameth only outputs the SAM header without any reads.

Details

This is the command I run:

> bwameth.py --reference ../data/arabidopsis_thaliana/genome_assembly/TAIR10.fasta -t 4 data/test/test-line_A-R1.classified.qc.fastq data/test/test-line_A-R2.classified.qc.fastq > data/test/test-line_A.mapped.sam

The stdout/stderr output is here:

running: /home/oender/anaconda3/envs/population-epigenetics/bin/python /home/oender/anaconda3/envs/population-epigenetics/lib/python3.6/site-packages/bwameth-0.2.2-py2.7.egg-info/scripts/bwameth.py c2t data/test/test-line_A-R1.classified.qc.fastq data/test/test-line_A-R2.classified.qc.fastq |bwa mem -T 40 -B 2 -L 10 -CM -U 100 -p -R '@RG\tID:test-line_A-R.classified.qc\tSM:test-line_A-R.classified.qc' -t 4  ../data/arabidopsis_thaliana/genome_assembly/TAIR10.fasta.bwameth.c2t -
converting reads in data/test/test-line_A-R1.classified.qc.fastq,data/test/test-line_A-R2.classified.qc.fastq
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 320080 sequences (40000212 bp)...
[M::process] 0 single-end sequences; 320080 paired-end sequences
WARNING: 1709 reads with length < 80
       : this program is designed for long reads
[M::process] read 121626 sequences (15199052 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (0, 97487, 4, 0)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (169, 215, 277)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 493)
[M::mem_pestat] mean and std.dev: (227.57, 79.20)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 601)
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::mem_process_seqs] Processed 320080 reads in 245.362 CPU sec, 61.338 real sec

['NM:i:21', 'MD:Z:40^GGAATTGTTGATTTGGATTT80G5', 'MC:Z:126M', 'AS:i:97', 'XS:i:83', 'RG:Z:test-line_A-R.classified.qc', 'XA:Z:f3,+14193782,40S86M,1;f3,+14204191,40S86M,1;', 'RG:Z:CB0L6ANXX:1:ATTCCT YS:Z:TTTGGATTTGGAATTGTTGAGAAAAGTTTATCGGGTTTGAGGAATTGTTGAGAAAAGTTTATTGGGTTTGAGGATTTGTTGATTAGGAGTGGAAATTGTTGAGAAAAATTTATTGGGTTTTAGGAA', 'YC:Z:CT']
700523F:121:CB0L6ANXX:1:1103:2712:2482
Traceback (most recent call last):
  File "/home/oender/anaconda3/envs/population-epigenetics/bin/bwameth.py", line 4, in <module>
    __import__('pkg_resources').run_script('bwameth==0.2.2', 'bwameth.py')
  File "/home/oender/anaconda3/envs/population-epigenetics/lib/python3.6/site-packages/pkg_resources/__init__.py", line 664, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/home/oender/anaconda3/envs/population-epigenetics/lib/python3.6/site-packages/pkg_resources/__init__.py", line 1444, in run_script
    exec(code, namespace, namespace)
  File "/home/oender/anaconda3/envs/population-epigenetics/lib/python3.6/site-packages/bwameth-0.2.2-py2.7.egg-info/scripts/bwameth.py", line 509, in <module>
    main(sys.argv[1:])
  File "/home/oender/anaconda3/envs/population-epigenetics/lib/python3.6/site-packages/bwameth-0.2.2-py2.7.egg-info/scripts/bwameth.py", line 506, in main
    set_as_failed=args.set_as_failed)
  File "/home/oender/anaconda3/envs/population-epigenetics/lib/python3.6/site-packages/bwameth-0.2.2-py2.7.egg-info/scripts/bwameth.py", line 331, in bwa_mem
    as_bam(cmd, fa, set_as_failed)
  File "/home/oender/anaconda3/envs/population-epigenetics/lib/python3.6/site-packages/bwameth-0.2.2-py2.7.egg-info/scripts/bwameth.py", line 353, in as_bam
    for aln in handle_reads(pair_list, set_as_failed):
  File "/home/oender/anaconda3/envs/population-epigenetics/lib/python3.6/site-packages/bwameth-0.2.2-py2.7.egg-info/scripts/bwameth.py", line 376, in handle_reads
    orig_seq = aln.original_seq
  File "/home/oender/anaconda3/envs/population-epigenetics/lib/python3.6/site-packages/bwameth-0.2.2-py2.7.egg-info/scripts/bwameth.py", line 284, in original_seq
    return next(x for x in self.other if x.startswith("YS:Z:"))[5:]
StopIteration
[M::process] 0 single-end sequences; 121626 paired-end sequences

As you can see, RG:Z:CB0L6ANXX:1:ATTCCT is the RG that was part of the FASTQ input:

> head -n1 data/test/test-line_A-R{1,2}.classified.qc.fastq
==> data/test/test-line_A-R1.classified.qc.fastq <==
@700523F:121:CB0L6ANXX:1:1103:2712:2482	RG:Z:CB0L6ANXX:1:ATTCCT

==> data/test/test-line_A-R2.classified.qc.fastq <==
@700523F:121:CB0L6ANXX:1:1103:2712:2482	RG:Z:CB0L6ANXX:1:ATTCCT

I think it is a bug that bwameth adds RG:Z:test-line_A-R.classified.qc although I did not supply any read group parameter and actually want to pass through the RGs in the FASTQs. Indeed, when I run the command

bwameth.py c2t data/test/test-line_A-R1.classified.qc.fastq data/test/test-line_A-R2.classified.qc.fastq |bwa mem -T 40 -B 2 -L 10 -CM -U 100 -p -t 4  ../data/arabidopsis_thaliana/genome_assembly/TAIR10.fasta.bwameth.c2t -

(i.e., explicitly removing -R '...') everything works, although the SAM has to be converted back.

Suggestion

As I see it, the problem arises because of the way in which the read group argument is handled. Probably, you can leave the function bwa_mem as it is but change how it is called. It is not quite clear but I guess in the call of bwa_mem,

rg=args.read_group or rname(*args.fastqs)

causes the trouble if I do not supply a read group parameter on the command line. Or you have to disentangle the addition of RG to the header from RGs for individual reads.

The text was updated successfully, but these errors were encountered:

brentp · 2018-11-28T14:52:13Z

thanks for the careful description. I don't intend to fix, but will accept a PR that does. I think that would also require manually setting the addition of the RG to the header.

okartal · 2018-11-30T12:18:08Z

@brentp I will fork and try it

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not add RG by default #58

Do not add RG by default #58

okartal commented Nov 28, 2018

brentp commented Nov 28, 2018

okartal commented Nov 30, 2018

Do not add RG by default #58

Do not add RG by default #58

Comments

okartal commented Nov 28, 2018

Problem

Details

Suggestion

brentp commented Nov 28, 2018

okartal commented Nov 30, 2018