Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow chunking uBAM #141

Merged
merged 8 commits into from
Nov 4, 2024
Merged

Allow chunking uBAM #141

merged 8 commits into from
Nov 4, 2024

Conversation

rhpvorderman
Copy link
Collaborator

This allows reading raw BAM records with no header present. This format cannot be detected so must be provided beforehand. "bam_no_header" was chosen as the format name.

A chunker is also provided that can provide chunks of BAM records. These can be read using the "bam_no_header" format in dnaio.open.

With these additions cutadapt should be able to set up multithreaded reading of BAM records.

@rhpvorderman
Copy link
Collaborator Author

rhpvorderman commented Oct 8, 2024

@marcelm I did a proof concept here: https://github.com/marcelm/cutadapt/pull/812/files

It is possible to get multithreaded BAM reading going with just the changes in this PR.

In the future we need to think how to handle the header. Ideally the header is copied from the input file with an additional PG line added from cutadapt.

EDIT: I propose delaying releasing 1.3.0 until BAM writing and tag support is implemented. Some of the tags, mainly the methylation tag, also need to be cut properly if cutadapt is going to be useful for Nanopore data. It is not going to be "fun" to implement this, but on the other hand this will elevate dnaio and cutadapt over the competition, making it the only viable tool.

@marcelm
Copy link
Owner

marcelm commented Oct 8, 2024

Awesome, thanks! Will hopefully be able to review this towards the end of the week, no time today.

@rhpvorderman
Copy link
Collaborator Author

@marcelm . Friendly reminder ping. I am currently watching cutadapt blazing through some nanopore files at only one core, which reminded me of this PR. Please do not take this as pressure. I am happy to wait for a few weeks more.

Copy link
Owner

@marcelm marcelm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, this looks good! And sorry for the delay. It did not actually take that long to review, should have done that earlier.

So one thought I had is that this way of chunking only works for single-end reads. So that is one more issue that would need to be solved in order to enable BAM paired-end input.

With Python’s support for free threading coming, I’d actually hope that we could get rid of all this chunking business in the future. Splitting the input up into chunks of bytes is only necessary because sending already parsed records from one process to another using multiprocessing is expensive. With threads, we could have one thread that is dedicated to parsing the input and supplying worker threads with lists of records. The parser thread can then make decisions based on the content of the records, that is, it can ensure that a worker always gets both paired-end reads.

@rhpvorderman
Copy link
Collaborator Author

Awesome, this looks good! And sorry for the delay. It did not actually take that long to review, should have done that earlier.

There is no haste at all. I can run code from git if needed.

So one thought I had is that this way of chunking only works for single-end reads. So that is one more issue that would need to be solved in order to enable BAM paired-end input.

Let's think about that when the use case arises. In the event, only support for name-sorted reads seems viable. Indexing support is best done through htslib and in that case pysam is a much more viable option.

With Python’s support for free threading coming, I’d actually hope that we could get rid of all this chunking business in the future. Splitting the input up into chunks of bytes is only necessary because sending already parsed records from one process to another using multiprocessing is expensive. With threads, we could have one thread that is dedicated to parsing the input and supplying worker threads with lists of records. The parser thread can then make decisions based on the content of the records, that is, it can ensure that a worker always gets both paired-end reads.

Yes that would be great! Looks like we still need to wait a bit on that. I am curious what the impact of that change will be.

@marcelm
Copy link
Owner

marcelm commented Nov 4, 2024

Feel free to merge when ready.

@rhpvorderman rhpvorderman merged commit ec56277 into marcelm:main Nov 4, 2024
16 checks passed
@rhpvorderman rhpvorderman deleted the chunk_ubam branch November 4, 2024 14:11
@rhpvorderman
Copy link
Collaborator Author

Done. Thanks for the review!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants