Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel processing locks when the OOM killer comes for a worker #290

Open
mdekstrand opened this issue Dec 15, 2021 · 1 comment
Open
Labels
internals Internal infrastructure (parallelism, math, etc.)

Comments

@mdekstrand
Copy link
Member

When LensKit is working in parallel (e.g. batch.recommend), and the OOM killer takes out a worker, the parent LensKit process will (sometimes) hang instead of terminating.

We should detect this case and abort the entire evaluation if the pool breaks down.

@mdekstrand mdekstrand added the internals Internal infrastructure (parallelism, math, etc.) label Dec 15, 2021
@mdekstrand
Copy link
Member Author

I have tried to reproduce this with processes that invoke os.kill(os.getpid(), 9), and the parent process terminates correctly.

OOM-induced deadlocks in Python multiprocessing seem to be one of the bugs fixed in concurrent.futures.ProcessPoolExecutor in Python 3.7 and newer, and we saw this on Python 3.8.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
internals Internal infrastructure (parallelism, math, etc.)
Projects
None yet
Development

No branches or pull requests

1 participant