-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow chunking uBAM #141
Allow chunking uBAM #141
Conversation
582ac4a
to
970c75a
Compare
1c94921
to
8ed8a92
Compare
@marcelm I did a proof concept here: https://github.com/marcelm/cutadapt/pull/812/files It is possible to get multithreaded BAM reading going with just the changes in this PR. In the future we need to think how to handle the header. Ideally the header is copied from the input file with an additional PG line added from cutadapt. EDIT: I propose delaying releasing 1.3.0 until BAM writing and tag support is implemented. Some of the tags, mainly the methylation tag, also need to be cut properly if cutadapt is going to be useful for Nanopore data. It is not going to be "fun" to implement this, but on the other hand this will elevate dnaio and cutadapt over the competition, making it the only viable tool. |
Awesome, thanks! Will hopefully be able to review this towards the end of the week, no time today. |
@marcelm . Friendly reminder ping. I am currently watching cutadapt blazing through some nanopore files at only one core, which reminded me of this PR. Please do not take this as pressure. I am happy to wait for a few weeks more. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, this looks good! And sorry for the delay. It did not actually take that long to review, should have done that earlier.
So one thought I had is that this way of chunking only works for single-end reads. So that is one more issue that would need to be solved in order to enable BAM paired-end input.
With Python’s support for free threading coming, I’d actually hope that we could get rid of all this chunking business in the future. Splitting the input up into chunks of bytes is only necessary because sending already parsed records from one process to another using multiprocessing is expensive. With threads, we could have one thread that is dedicated to parsing the input and supplying worker threads with lists of records. The parser thread can then make decisions based on the content of the records, that is, it can ensure that a worker always gets both paired-end reads.
There is no haste at all. I can run code from git if needed.
Let's think about that when the use case arises. In the event, only support for name-sorted reads seems viable. Indexing support is best done through htslib and in that case pysam is a much more viable option.
Yes that would be great! Looks like we still need to wait a bit on that. I am curious what the impact of that change will be. |
359799b
to
497310f
Compare
Feel free to merge when ready. |
Done. Thanks for the review! |
This allows reading raw BAM records with no header present. This format cannot be detected so must be provided beforehand. "bam_no_header" was chosen as the format name.
A chunker is also provided that can provide chunks of BAM records. These can be read using the "bam_no_header" format in dnaio.open.
With these additions cutadapt should be able to set up multithreaded reading of BAM records.