Allow chunking uBAM #141

rhpvorderman · 2024-10-08T08:21:29Z

This allows reading raw BAM records with no header present. This format cannot be detected so must be provided beforehand. "bam_no_header" was chosen as the format name.

A chunker is also provided that can provide chunks of BAM records. These can be read using the "bam_no_header" format in dnaio.open.

With these additions cutadapt should be able to set up multithreaded reading of BAM records.

rhpvorderman · 2024-10-08T10:26:34Z

@marcelm I did a proof concept here: https://github.com/marcelm/cutadapt/pull/812/files

It is possible to get multithreaded BAM reading going with just the changes in this PR.

In the future we need to think how to handle the header. Ideally the header is copied from the input file with an additional PG line added from cutadapt.

EDIT: I propose delaying releasing 1.3.0 until BAM writing and tag support is implemented. Some of the tags, mainly the methylation tag, also need to be cut properly if cutadapt is going to be useful for Nanopore data. It is not going to be "fun" to implement this, but on the other hand this will elevate dnaio and cutadapt over the competition, making it the only viable tool.

src/dnaio/_core.pyx

marcelm · 2024-10-08T11:46:44Z

Awesome, thanks! Will hopefully be able to review this towards the end of the week, no time today.

rhpvorderman · 2024-10-29T09:59:12Z

@marcelm . Friendly reminder ping. I am currently watching cutadapt blazing through some nanopore files at only one core, which reminded me of this PR. Please do not take this as pressure. I am happy to wait for a few weeks more.

marcelm

Awesome, this looks good! And sorry for the delay. It did not actually take that long to review, should have done that earlier.

So one thought I had is that this way of chunking only works for single-end reads. So that is one more issue that would need to be solved in order to enable BAM paired-end input.

With Python’s support for free threading coming, I’d actually hope that we could get rid of all this chunking business in the future. Splitting the input up into chunks of bytes is only necessary because sending already parsed records from one process to another using multiprocessing is expensive. With threads, we could have one thread that is dedicated to parsing the input and supplying worker threads with lists of records. The parser thread can then make decisions based on the content of the records, that is, it can ensure that a worker always gets both paired-end reads.

src/dnaio/_bam.py

rhpvorderman · 2024-11-04T07:56:00Z

Awesome, this looks good! And sorry for the delay. It did not actually take that long to review, should have done that earlier.

There is no haste at all. I can run code from git if needed.

So one thought I had is that this way of chunking only works for single-end reads. So that is one more issue that would need to be solved in order to enable BAM paired-end input.

Let's think about that when the use case arises. In the event, only support for name-sorted reads seems viable. Indexing support is best done through htslib and in that case pysam is a much more viable option.

With Python’s support for free threading coming, I’d actually hope that we could get rid of all this chunking business in the future. Splitting the input up into chunks of bytes is only necessary because sending already parsed records from one process to another using multiprocessing is expensive. With threads, we could have one thread that is dedicated to parsing the input and supplying worker threads with lists of records. The parser thread can then make decisions based on the content of the records, that is, it can ensure that a worker always gets both paired-end reads.

Yes that would be great! Looks like we still need to wait a bit on that. I am curious what the impact of that change will be.

marcelm · 2024-11-04T12:42:25Z

Feel free to merge when ready.

rhpvorderman · 2024-11-04T14:12:05Z

Done. Thanks for the review!

rhpvorderman added 3 commits October 7, 2024 19:03

Allow reading BAM from memory without a header

9880d15

Add chunking for BAM files.

4bbcc2f

Ensure buffer is properly released before exit

970c75a

rhpvorderman force-pushed the chunk_ubam branch from 582ac4a to 970c75a Compare October 8, 2024 08:29

rhpvorderman added 3 commits October 8, 2024 11:12

Remove rogue import

4d2a4b5

Silence too complex warning

596ef7b

Fix several typing errors

8ed8a92

rhpvorderman force-pushed the chunk_ubam branch from 1c94921 to 8ed8a92 Compare October 8, 2024 09:46

rhpvorderman mentioned this pull request Oct 8, 2024

Multithread bam read marcelm/cutadapt#812

Merged

marcelm reviewed Oct 8, 2024

View reviewed changes

src/dnaio/_core.pyx Outdated Show resolved Hide resolved

Remove print statement

8a35f9e

marcelm approved these changes Nov 3, 2024

View reviewed changes

src/dnaio/_bam.py Outdated Show resolved Hide resolved

src/dnaio/_bam.py Outdated Show resolved Hide resolved

Improve user messages

497310f

rhpvorderman force-pushed the chunk_ubam branch from 359799b to 497310f Compare November 4, 2024 07:58

rhpvorderman merged commit ec56277 into marcelm:main Nov 4, 2024
16 checks passed

rhpvorderman deleted the chunk_ubam branch November 4, 2024 14:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow chunking uBAM #141

Allow chunking uBAM #141

rhpvorderman commented Oct 8, 2024

rhpvorderman commented Oct 8, 2024 •

edited

Loading

marcelm commented Oct 8, 2024

rhpvorderman commented Oct 29, 2024

marcelm left a comment

rhpvorderman commented Nov 4, 2024

marcelm commented Nov 4, 2024

rhpvorderman commented Nov 4, 2024

Allow chunking uBAM #141

Allow chunking uBAM #141

Conversation

rhpvorderman commented Oct 8, 2024

rhpvorderman commented Oct 8, 2024 • edited Loading

marcelm commented Oct 8, 2024

rhpvorderman commented Oct 29, 2024

marcelm left a comment

Choose a reason for hiding this comment

rhpvorderman commented Nov 4, 2024

marcelm commented Nov 4, 2024

rhpvorderman commented Nov 4, 2024

rhpvorderman commented Oct 8, 2024 •

edited

Loading