Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not add RG by default #58

Open
okartal opened this issue Nov 28, 2018 · 2 comments
Open

Do not add RG by default #58

okartal opened this issue Nov 28, 2018 · 2 comments

Comments

@okartal
Copy link

okartal commented Nov 28, 2018

Problem

I have encountered a problem with bwameth that pops up when the FASTQ comment contains a read group. In this case, bwameth only outputs the SAM header without any reads.

Details

This is the command I run:

> bwameth.py --reference ../data/arabidopsis_thaliana/genome_assembly/TAIR10.fasta -t 4 data/test/test-line_A-R1.classified.qc.fastq data/test/test-line_A-R2.classified.qc.fastq > data/test/test-line_A.mapped.sam

The stdout/stderr output is here:

running: /home/oender/anaconda3/envs/population-epigenetics/bin/python /home/oender/anaconda3/envs/population-epigenetics/lib/python3.6/site-packages/bwameth-0.2.2-py2.7.egg-info/scripts/bwameth.py c2t data/test/test-line_A-R1.classified.qc.fastq data/test/test-line_A-R2.classified.qc.fastq |bwa mem -T 40 -B 2 -L 10 -CM -U 100 -p -R '@RG\tID:test-line_A-R.classified.qc\tSM:test-line_A-R.classified.qc' -t 4  ../data/arabidopsis_thaliana/genome_assembly/TAIR10.fasta.bwameth.c2t -
converting reads in data/test/test-line_A-R1.classified.qc.fastq,data/test/test-line_A-R2.classified.qc.fastq
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 320080 sequences (40000212 bp)...
[M::process] 0 single-end sequences; 320080 paired-end sequences
WARNING: 1709 reads with length < 80
       : this program is designed for long reads
[M::process] read 121626 sequences (15199052 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (0, 97487, 4, 0)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (169, 215, 277)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 493)
[M::mem_pestat] mean and std.dev: (227.57, 79.20)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 601)
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::mem_process_seqs] Processed 320080 reads in 245.362 CPU sec, 61.338 real sec

['NM:i:21', 'MD:Z:40^GGAATTGTTGATTTGGATTT80G5', 'MC:Z:126M', 'AS:i:97', 'XS:i:83', 'RG:Z:test-line_A-R.classified.qc', 'XA:Z:f3,+14193782,40S86M,1;f3,+14204191,40S86M,1;', 'RG:Z:CB0L6ANXX:1:ATTCCT YS:Z:TTTGGATTTGGAATTGTTGAGAAAAGTTTATCGGGTTTGAGGAATTGTTGAGAAAAGTTTATTGGGTTTGAGGATTTGTTGATTAGGAGTGGAAATTGTTGAGAAAAATTTATTGGGTTTTAGGAA', 'YC:Z:CT']
700523F:121:CB0L6ANXX:1:1103:2712:2482
Traceback (most recent call last):
  File "/home/oender/anaconda3/envs/population-epigenetics/bin/bwameth.py", line 4, in <module>
    __import__('pkg_resources').run_script('bwameth==0.2.2', 'bwameth.py')
  File "/home/oender/anaconda3/envs/population-epigenetics/lib/python3.6/site-packages/pkg_resources/__init__.py", line 664, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/home/oender/anaconda3/envs/population-epigenetics/lib/python3.6/site-packages/pkg_resources/__init__.py", line 1444, in run_script
    exec(code, namespace, namespace)
  File "/home/oender/anaconda3/envs/population-epigenetics/lib/python3.6/site-packages/bwameth-0.2.2-py2.7.egg-info/scripts/bwameth.py", line 509, in <module>
    main(sys.argv[1:])
  File "/home/oender/anaconda3/envs/population-epigenetics/lib/python3.6/site-packages/bwameth-0.2.2-py2.7.egg-info/scripts/bwameth.py", line 506, in main
    set_as_failed=args.set_as_failed)
  File "/home/oender/anaconda3/envs/population-epigenetics/lib/python3.6/site-packages/bwameth-0.2.2-py2.7.egg-info/scripts/bwameth.py", line 331, in bwa_mem
    as_bam(cmd, fa, set_as_failed)
  File "/home/oender/anaconda3/envs/population-epigenetics/lib/python3.6/site-packages/bwameth-0.2.2-py2.7.egg-info/scripts/bwameth.py", line 353, in as_bam
    for aln in handle_reads(pair_list, set_as_failed):
  File "/home/oender/anaconda3/envs/population-epigenetics/lib/python3.6/site-packages/bwameth-0.2.2-py2.7.egg-info/scripts/bwameth.py", line 376, in handle_reads
    orig_seq = aln.original_seq
  File "/home/oender/anaconda3/envs/population-epigenetics/lib/python3.6/site-packages/bwameth-0.2.2-py2.7.egg-info/scripts/bwameth.py", line 284, in original_seq
    return next(x for x in self.other if x.startswith("YS:Z:"))[5:]
StopIteration
[M::process] 0 single-end sequences; 121626 paired-end sequences

As you can see, RG:Z:CB0L6ANXX:1:ATTCCT is the RG that was part of the FASTQ input:

> head -n1 data/test/test-line_A-R{1,2}.classified.qc.fastq
==> data/test/test-line_A-R1.classified.qc.fastq <==
@700523F:121:CB0L6ANXX:1:1103:2712:2482	RG:Z:CB0L6ANXX:1:ATTCCT

==> data/test/test-line_A-R2.classified.qc.fastq <==
@700523F:121:CB0L6ANXX:1:1103:2712:2482	RG:Z:CB0L6ANXX:1:ATTCCT

I think it is a bug that bwameth adds RG:Z:test-line_A-R.classified.qc although I did not supply any read group parameter and actually want to pass through the RGs in the FASTQs. Indeed, when I run the command

bwameth.py c2t data/test/test-line_A-R1.classified.qc.fastq data/test/test-line_A-R2.classified.qc.fastq |bwa mem -T 40 -B 2 -L 10 -CM -U 100 -p -t 4  ../data/arabidopsis_thaliana/genome_assembly/TAIR10.fasta.bwameth.c2t -

(i.e., explicitly removing -R '...') everything works, although the SAM has to be converted back.

Suggestion

As I see it, the problem arises because of the way in which the read group argument is handled. Probably, you can leave the function bwa_mem as it is but change how it is called. It is not quite clear but I guess in the call of bwa_mem,

rg=args.read_group or rname(*args.fastqs)

causes the trouble if I do not supply a read group parameter on the command line. Or you have to disentangle the addition of RG to the header from RGs for individual reads.

@brentp
Copy link
Owner

brentp commented Nov 28, 2018

thanks for the careful description. I don't intend to fix, but will accept a PR that does. I think that would also require manually setting the addition of the RG to the header.

@okartal
Copy link
Author

okartal commented Nov 30, 2018

@brentp I will fork and try it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants