Multithreading vs. multiprocessing #171

hagenw · 2024-04-12T09:32:33Z

At the moment we have as default multiprocessing=False, but I wonder what was/is the reasoning behind it.

When browsing the web, I can find the following statement:

multi-threading is good for IO-bound processes like reading or downloading files
multi-processing is good for computational heavy tasks

When doing a simple test:

import audb
import audinterface
import audmath
import time

def process_func(signal, sampling_rate):
    return audmath.db(audmath.rms(signal))

db = audb.load("emodb", version="1.4.1")
for multiprocessing in [False, True]:
    for num_workers in [1, 5]:
        interface = audinterface.Feature(
            ["rms"],
            process_func=process_func,
            num_workers=num_workers,
            multiprocessing=multiprocessing,
        )
        t0 = time.time()
        df = interface.process_index(db.files)
        t = time.time() - t0
        print(f"{multiprocessing=}, {num_workers=}: {t:.2f} s")

it returns (after running the second time)

multiprocessing=False, num_workers=1: 0.16 s                                                        
multiprocessing=False, num_workers=5: 0.26 s
multiprocessing=True, num_workers=1: 0.16 s
multiprocessing=True, num_workers=5: 0.11 s

Even though we don't do heavy processing here, multi-processing seems to be faster in this case. Is this expected?

/cc @ureichel, @ChristianGeng, @frankenjoe, @maxschmitt, @audeerington, @schruefer

The text was updated successfully, but these errors were encountered:

frankenjoe · 2024-04-12T09:35:31Z

I sometimes run into problems with multi-processing, e.g. an older version ofopensmile was not supporting I think.

hagenw · 2024-04-12T09:37:53Z

Yes, I also remembered that multiprocessing=False seemed to be the safer choice, and in audb it does provide the expected speed enhancement when downloading files. But I wonder, if this might be different when executing the process function in audinterface.

maxschmitt · 2024-04-15T09:10:53Z

I think "heavy processing" is always relative but anyway, the overhead might still occupy most of the computing time.

Measuring time spent in the processing function:

import audb
import audinterface
import audmath
import time

def process_func(signal, sampling_rate):
    global tsum
    tx = time.time()
    res = audmath.db(audmath.rms(signal))
    tsum += time.time() - tx
    return res

db = audb.load("emodb", version="1.4.1")
for multiprocessing in [False, True]:
    for num_workers in [1, 5]:
        interface = audinterface.Feature(
            ["rms"],
            process_func=process_func,
            num_workers=num_workers,
            multiprocessing=multiprocessing,
        )
        tsum = 0.
        t0 = time.time()
        df = interface.process_index(db.files)
        t = time.time() - t0
        print(f"{multiprocessing=}, {num_workers=}: {t:.2f} s, "
              f"processing time: {tsum:.2f} s")

multiprocessing=False, num_workers=1: 0.87 s, processing time: 0.06 s
multiprocessing=False, num_workers=5: 0.47 s, processing time: 0.60 s
multiprocessing=True, num_workers=1: 0.40 s, processing time: 0.05 s
multiprocessing=True, num_workers=5: 0.39 s, processing time: 0.00 s

The figure for the last row (multiprocessing) is not correct with this method, of course, but for the outputs with one worker, we see that only a small part of the execution time is spent in process_func and differences might be mainly due to overheads.

hagenw · 2024-04-15T09:53:18Z

I repeated the measurement with opensmile:

import audb
import audmath
import opensmile
import time

db = audb.load("emodb", version="1.4.1")
for multiprocessing in [False, True]:
    for num_workers in [1, 5]: 
        interface = opensmile.Smile(
            num_workers=num_workers,
            multiprocessing=multiprocessing,
        )
        t0 = time.time()
        df = interface.process_index(db.files)
        t = time.time() - t0
        print(f"{multiprocessing=}, {num_workers=}: {t:.2f} s")

and there it does not make a difference if we use multi-processing or not:

multiprocessing=False, num_workers=1: 20.27 s                                                       
multiprocessing=False, num_workers=5: 6.29 s
multiprocessing=True, num_workers=1: 20.32 s
multiprocessing=True, num_workers=5: 6.54 s

But when testing with another feature extractor:

import audb
import audmath
import audmld
import time

db = audb.load("emodb", version="1.4.1")
for multiprocessing in [False, True]:
    for num_workers in [1, 5]: 
        interface = audmld.Mld(
            num_workers=num_workers,
            multiprocessing=multiprocessing,
        )
        t0 = time.time()
        df = interface.process_index(db.files)
        t = time.time() - t0
        print(f"{multiprocessing=}, {num_workers=}: {t:.2f} s")

there is indeed a difference:

multiprocessing=False, num_workers=1: 118.00 s                                                      
multiprocessing=False, num_workers=5: 189.54 s
multiprocessing=True, num_workers=1: 106.39 s
multiprocessing=True, num_workers=5: 46.43 s

So I guess, this indicates that we did some (wrong?) choice in its implementation, resulting to support only multiprocessing?

hagenw added the question Further information is requested label Apr 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multithreading vs. multiprocessing #171

Multithreading vs. multiprocessing #171

hagenw commented Apr 12, 2024 •

edited

Loading

frankenjoe commented Apr 12, 2024

hagenw commented Apr 12, 2024

maxschmitt commented Apr 15, 2024

hagenw commented Apr 15, 2024

Multithreading vs. multiprocessing #171

Multithreading vs. multiprocessing #171

Comments

hagenw commented Apr 12, 2024 • edited Loading

frankenjoe commented Apr 12, 2024

hagenw commented Apr 12, 2024

maxschmitt commented Apr 15, 2024

hagenw commented Apr 15, 2024

hagenw commented Apr 12, 2024 •

edited

Loading