High memory use when using Python and threads #855

cjw85 · 2022-01-05T17:40:15Z

The program align.py uses mappy to align reads in Python using multiple worker threads. After loading the index the memory usage jumps up quickly to >20Gb and then continues to climb steadily through 40Gb an beyond.

This issue was first discovered in bonito and isolated to mappy. The data flow in the example mirrors that in bonito but reduced to using only Python stdlib functionality.

mappy: v2.24
pysam: v0.18 (just for optionally reading fastq inputs)
python: v3.8.6

Run program, creating query sequences from index on the fly

python align.py GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta.mmi --threads 48

or using a directory containing *.fastq* files:

python align.py GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta.mmi --fastq_dir FAQ32498 --threads 48

The inputs I am using are available in the AWS S3 bucket at:

s3://ont-research/misc/mappy-mem/FAQ32498.tar
s3://ont-research/misc/mappy-mem/GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta.mmi

I've not fully ascertained if using lots of threads exacerbates the problem or simply makes the symptom apparent more quickly.

The text was updated successfully, but these errors were encountered:

cjw85 · 2022-01-06T10:46:04Z

@lh3 I or might be able to spare some time to dig through the Cython (though @marcus1487 is more a Cython person than me). valgrind gave me quite a bit of noise when I quickly ran it yesterday.

cjw85 · 2022-01-06T18:39:26Z

I've looked at this a little today. If I modify the program to not reuse the ThreadBuffer for each call to aligner.map() I don't observe such egregious memory use.

@lh3 Am I correct in thinking the minimap2 program does not use persistent mm_tbuf_ts for its entire lifetime? I'm starting to think this isn't a leak as such but an expansion in a buffer within mm_tbuf_ts as pathological reads/alignments are processed, with a buffer not being shrunk afterwards?

lh3 · 2022-01-06T19:40:46Z

Sorry that I don't use python threads and I don't know how python threads handle global and thread-local memory. Anyway, a ThreadBuffer only grows and never shrinks, until it gets destroyed. It is intended to be used through the life span of a thread. Minimap2 allocates one ThreadBuffer inside a newly spawned thread and uses the same buffer for multiple reads the thread processes. Minimap2 deallocates the buffer towards the end of the thread.

lh3 · 2022-01-06T19:51:33Z

Anyway, a ThreadBuffer only grows and never shrinks

Actually a ThreadBuffer may shrink. The following block means if the size of the buffer is larger than opt->cap_kalloc (default to 1GB in v2.24) or the largest memory block is over 256MB, reallocate the thread buffer.

minimap2/map.c

Lines 367 to 378 in 06fedaa

    
           if (b->km) { 
        
           	km_stat(b->km, &kmst); 
        
           	if (mm_dbg_flag & MM_DBG_PRINT_QNAME) 
        
           		fprintf(stderr, "QM\t%s\t%d\tcap=%ld,nCore=%ld,largest=%ld\n", qname, qlen_sum, kmst.capacity, kmst.n_cores, kmst.largest); 
        
           	assert(kmst.n_blocks == kmst.n_cores); // otherwise, there is a memory leak 
        
           	if (kmst.largest > 1U<<28 || (opt->cap_kalloc > 0 && kmst.capacity > opt->cap_kalloc)) { 
        
           		if (mm_dbg_flag & MM_DBG_PRINT_QNAME) 
        
           			fprintf(stderr, "[W::%s] reset thread-local memory after read %s\n", __func__, qname); 
        
           		km_destroy(b->km); 
        
           		b->km = km_init(); 
        
           	} 
        
           }

cjw85 · 2022-01-07T20:13:08Z

After a bit my prodding from both myself and @jts, I'm fairly well convinced that the high memory use I've observed is simply an accumulation in the size of the thread buffer, nothing untoward in Python or Cython. I do occasionally see sizeable (i.e. 1Gb) deallocations.

If the example Python program is changed to periodically use a new ThreadBuffer in each thread, or not pass one to aligner.map() calls, memory use is more controlled. Both of us have also observed that when kalloc is disabled the example Python program does not have excessive memory use.

The part that I am still perplexed by is why this is happening in the Python program but not in minimap2 when applied to the same dataset. I have a theory it might simply come down to how work is being processed by the thread pools in the two cases and how often the allocation cap is therefore being hit and the thread buffer being reset.

cjw85 · 2024-03-14T17:13:07Z

After studying things more, I'm relatively well satisfied that in a sense this is the intended behaviour of the code and not a bug per-se. (I will change the title of this issue to reflect this).

I have datasets where for aligning HG002 reads GRCh38 and using the minimap2 program not mappy I see runaway memory usage up to around 65GB on top of a baseline of around 20GB. I see there are various other issues reporting similar behaviour.

(using minimap v2.27: `minimap2 -t 64 -a -x map-ont grch38.fastq.gz reads.fastq.gz)

If I disable use of kalloc I see much more stable memory usage, and no loss in performance. This begs the question: when does the use of kalloc out perform vanilla use of malloc in minimap2?

lh3 · 2024-03-14T17:48:36Z

Malloc performance is system dependent. When minimap2 was developed in 2018, kalloc was giving considerable performance improvement, on our server, over glibc (CentOS 6), musl and rpmalloc and minor improvement over tcmalloc and jemalloc. Similarly for bwa-mem, some users and myself could observe large performance increase with tcmalloc but some other users didn't see this.

Minimap2 does frequent heap allocation per read and across threads. Allocators are usually sensitive to this pattern. It is safer to enable kalloc for consistent performance across systems. One thing I may try is to reset kalloc much more frequently, for example, reset per million query bases. The resetting logic is currently implemented here:

minimap2/map.c

Lines 362 to 373 in 9b0ff24

    
           if (b->km) { 
        
           	km_stat(b->km, &kmst); 
        
           	if (mm_dbg_flag & MM_DBG_PRINT_QNAME) 
        
           		fprintf(stderr, "QM\t%s\t%d\tcap=%ld,nCore=%ld,largest=%ld\n", qname, qlen_sum, kmst.capacity, kmst.n_cores, kmst.largest); 
        
           	assert(kmst.n_blocks == kmst.n_cores); // otherwise, there is a memory leak 
        
           	if (kmst.largest > 1U<<28 || (opt->cap_kalloc > 0 && kmst.capacity > opt->cap_kalloc)) { 
        
           		if (mm_dbg_flag & MM_DBG_PRINT_QNAME) 
        
           			fprintf(stderr, "[W::%s] reset thread-local memory after read %s\n", __func__, qname); 
        
           		km_destroy(b->km); 
        
           		b->km = km_init(); 
        
           	} 
        
           }

Resetting for every read would look like:

if (b->km) {
    km_destroy(b->km);
    b->km = km_init();
}

lh3 · 2024-03-14T17:53:36Z

Looking at the source code, I realized another way to control kalloc resetting frequency is to add --cap-kalloc. It defaults to 1GB – this is partly why the memory in your run was peaked at ~1 GB per thread. You may set a smaller --cap-kalloc and see what happens.

cjw85 · 2024-03-14T18:15:50Z

For what its worth I'm using:

gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0

so nothing blazingly new, but not terribly crusty either. Maybe I'll spend my evening going into the weeds of glibc changes.

I've got a few experiments running including setting --cap-kalloc smaller, I'd like to be comfortably below 32GB as a baseline; the vanilla malloc test shows that's certainly possible (for this dataset at least).

By the way, I noticed that the define HAVE_KALLOC appears to only apply to some parts of the code, I don't know if that was intentional or not.

cjw85 · 2024-03-14T22:20:00Z

Setting --cap-kalloc 100m --cap-sw-mem 50m (no particular reason for those choices, other than being smaller than the defaults) does provide more controlled memory usage as intended. The performance isn't noticeably worse so far with these settings.

Tomorrow I may look at the Python code to see if it can be made to expose these options.

lh3 added the help wanted label Jan 5, 2022

Adoni5 mentioned this issue Jun 26, 2023

Memory leaks jguhlin/minimap2-rs#40

Merged

cjw85 changed the title ~~Memory leak when using Python and threads~~ High memory use when using Python and threads Mar 14, 2024

jodjo86 mentioned this issue Feb 4, 2025

Problem running EMU with some samples treangenlab/emu#53

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High memory use when using Python and threads #855

High memory use when using Python and threads #855

cjw85 commented Jan 5, 2022

cjw85 commented Jan 6, 2022

cjw85 commented Jan 6, 2022

lh3 commented Jan 6, 2022

lh3 commented Jan 6, 2022 •

edited

Loading

cjw85 commented Jan 7, 2022

cjw85 commented Mar 14, 2024

lh3 commented Mar 14, 2024 •

edited

Loading

lh3 commented Mar 14, 2024 •

edited

Loading

cjw85 commented Mar 14, 2024

cjw85 commented Mar 14, 2024

High memory use when using Python and threads #855

High memory use when using Python and threads #855

Comments

cjw85 commented Jan 5, 2022

cjw85 commented Jan 6, 2022

cjw85 commented Jan 6, 2022

lh3 commented Jan 6, 2022

lh3 commented Jan 6, 2022 • edited Loading

cjw85 commented Jan 7, 2022

cjw85 commented Mar 14, 2024

lh3 commented Mar 14, 2024 • edited Loading

lh3 commented Mar 14, 2024 • edited Loading

cjw85 commented Mar 14, 2024

cjw85 commented Mar 14, 2024

lh3 commented Jan 6, 2022 •

edited

Loading

lh3 commented Mar 14, 2024 •

edited

Loading

lh3 commented Mar 14, 2024 •

edited

Loading