Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pyscenic grn in singularity: workers continuously restarting after memory warnings #54

Closed
kverstae opened this issue Feb 27, 2019 · 13 comments
Labels
bug Something isn't working

Comments

@kverstae
Copy link

kverstae commented Feb 27, 2019

Hi,
I am trying to use pySCENIC in a singularity container (version 0.9.5) but I keep getting warnings that eventualy result in errors and workers continuously restarting (see below).
The system I get these errors on has 22gb ram and I used 4 workers.
When I run the exact same thing (same singularity image, same inputs, same command) on a machine with more ram (64gb), everything works perfectly.

During the warnings, I checked the memory usage and it never went over 14gb.
Is there something I can do to make this also work on the machine with 22gb ram? Or is this problem caused by the 'limited' amount of ram available?

distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 732.21 MB -- Worker memory limit: 1.03 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 732.21 MB -- Worker memory limit: 1.03 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 732.21 MB -- Worker memory limit: 1.03 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 732.21 MB -- Worker memory limit: 1.03 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 732.21 MB -- Worker memory limit: 1.03 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 732.21 MB -- Worker memory limit: 1.03 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 732.21 MB -- Worker memory limit: 1.03 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 732.21 MB -- Worker memory limit: 1.03 GB
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Worker process 25440 was killed by unknown signal
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 732.21 MB -- Worker memory limit: 1.03 GB
distributed.nanny - WARNING - Restarting worker
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 732.21 MB -- Worker memory limit: 1.03 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 732.21 MB -- Worker memory limit: 1.03 GB
distributed.worker - WARNING - Worker is at 95% memory usage. Pausing worker.  Process memory: 978.70 MB -- Worker memory limit: 1.03 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 732.21 MB -- Worker memory limit: 1.03 GB
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:36289
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/distributed/comm/tcp.py", line 181, in read
    n_frames = yield stream.read_bytes(8)
  File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 1133, in run
    value = future.result()
tornado.iostream.StreamClosedError: Stream is closed

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/distributed/worker.py", line 1607, in gather_dep
    who=self.address)
  File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 1133, in run
    value = future.result()
  File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 1141, in run
    yielded = self.gen.throw(*exc_info)
  File "/opt/conda/lib/python3.6/site-packages/distributed/worker.py", line 2671, in get_data_from_worker
    max_connections=max_connections)
  File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 1133, in run
    value = future.result()
  File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 1141, in run
    yielded = self.gen.throw(*exc_info)
  File "/opt/conda/lib/python3.6/site-packages/distributed/core.py", line 457, in send_recv
    response = yield comm.read(deserializers=deserializers)
  File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 1133, in run
    value = future.result()
  File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 1141, in run
    yielded = self.gen.throw(*exc_info)
  File "/opt/conda/lib/python3.6/site-packages/distributed/comm/tcp.py", line 201, in read
    convert_stream_closed_error(self, e)
  File "/opt/conda/lib/python3.6/site-packages/distributed/comm/tcp.py", line 130, in convert_stream_closed_error
    raise CommClosedError("in %s: %s" % (obj, exc))
distributed.comm.core.CommClosedError: in <closed TCP>: Stream is closeddistributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:36289
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/distributed/comm/tcp.py", line 181, in read
    n_frames = yield stream.read_bytes(8)
  File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 1133, in run
    value = future.result()
tornado.iostream.StreamClosedError: Stream is closed

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/distributed/worker.py", line 1607, in gather_dep
    who=self.address)
  File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 1133, in run
    value = future.result()
  File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 1141, in run
    yielded = self.gen.throw(*exc_info)
  File "/opt/conda/lib/python3.6/site-packages/distributed/worker.py", line 2671, in get_data_from_worker
    max_connections=max_connections)
  File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 1133, in run
    value = future.result()
  File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 1141, in run
    yielded = self.gen.throw(*exc_info)
  File "/opt/conda/lib/python3.6/site-packages/distributed/core.py", line 457, in send_recv
    response = yield comm.read(deserializers=deserializers)
  File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 1133, in run
    value = future.result()
  File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 1141, in run
    yielded = self.gen.throw(*exc_info)
  File "/opt/conda/lib/python3.6/site-packages/distributed/comm/tcp.py", line 201, in read
    convert_stream_closed_error(self, e)
  File "/opt/conda/lib/python3.6/site-packages/distributed/comm/tcp.py", line 130, in convert_stream_closed_error
    raise CommClosedError("in %s: %s" % (obj, exc))
distributed.comm.core.CommClosedError: in <closed TCP>: Stream is closed

This keeps going until I manualy stop the process.

@ghuls
Copy link
Member

ghuls commented Feb 27, 2019

pySCENIC is quite memory hungry. Try to run with less threads on your 22GiB RAM machine.

The warning comes from dask
distributed.nanny - WARNING - Worker process 25440 was killed by unknown signal

If you run dmesg you probably will see that OOM killer killed one of the python processes (Out of memory: lines)

@kverstae
Copy link
Author

kverstae commented Feb 27, 2019

Even with only 1 thread I get following output:

[kevinve@cp0583 scenic]$ singularity exec pySCENIC_095.sif pyscenic grn --num_workers 1 -o expr_mat.adjacencies.tsv agg1.loom allTFs.txt
2019-02-27 13:31:06,266 - pyscenic.cli.pyscenic - INFO - Loading expression matrix.
2019-02-27 13:31:12,938 - pyscenic.cli.pyscenic - INFO - Inferring regulatory networks.
preparing dask client
parsing input
/opt/conda/lib/python3.6/site-packages/arboreto/algo.py:214: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
  expression_matrix = expression_data.as_matrix()
creating dask graph
1 partitions
computing dask graph
/opt/conda/lib/python3.6/site-packages/distributed/worker.py:2791: UserWarning: Large object of size 1.04 MB detected in task graph: 
  (["('from-delayed-c986605b75c672a32a9441417475c9d8 ... d8', 17803)"],)
Consider scattering large objects ahead of time
with client.scatter to reduce scheduler burden and 
keep data on workers

    future = client.submit(func, big_data)    # bad

    big_future = client.scatter(big_data)     # good
    future = client.submit(func, big_future)  # good
  % (format_bytes(len(b)), s))
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Worker process 30863 was killed by unknown signal
not shutting down client, client was created externally
finished
distributed.nanny - WARNING - Restarting worker
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/distributed/client.py", line 1488, in _gather
    exception = st.exception
AttributeError: exception

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/bin/pyscenic", line 10, in <module>
    sys.exit(main())
  File "/opt/conda/lib/python3.6/site-packages/pyscenic/cli/pyscenic.py", line 402, in main
    args.func(args)
  File "/opt/conda/lib/python3.6/site-packages/pyscenic/cli/pyscenic.py", line 67, in find_adjacencies_command
    network = method(expression_data=ex_mtx, tf_names=tf_names, verbose=True, client_or_address=client)
  File "/opt/conda/lib/python3.6/site-packages/arboreto/algo.py", line 41, in grnboost2
    early_stop_window_length=early_stop_window_length, limit=limit, seed=seed, verbose=verbose)
  File "/opt/conda/lib/python3.6/site-packages/arboreto/algo.py", line 135, in diy
    .compute(graph, sync=True) \
  File "/opt/conda/lib/python3.6/site-packages/distributed/client.py", line 2505, in compute
    result = self.gather(futures)
  File "/opt/conda/lib/python3.6/site-packages/distributed/client.py", line 1656, in gather
    asynchronous=asynchronous)
  File "/opt/conda/lib/python3.6/site-packages/distributed/client.py", line 676, in sync
    return sync(self.loop, func, *args, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/distributed/utils.py", line 277, in sync
    six.reraise(*error[0])
  File "/opt/conda/lib/python3.6/site-packages/six.py", line 693, in reraise
    raise value
  File "/opt/conda/lib/python3.6/site-packages/distributed/utils.py", line 262, in f
    result[0] = yield future
  File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 1133, in run
    value = future.result()
  File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 1141, in run
    yielded = self.gen.throw(*exc_info)
  File "/opt/conda/lib/python3.6/site-packages/distributed/client.py", line 1493, in _gather
    None)
  File "/opt/conda/lib/python3.6/site-packages/six.py", line 693, in reraise
    raise value
concurrent.futures._base.CancelledError: finalize-72c5478fec9086e20724b1d7df3e49d4
distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-2, started daemon)>
distributed.nanny - WARNING - Worker process 30905 was killed by unknown signal

@bramvds bramvds added the bug Something isn't working label Mar 8, 2019
@Liang-Wu-01
Copy link

Liang-Wu-01 commented Nov 1, 2019

Hello, I got the same warning when running pyscenic grn function (version: 0.9.17):
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk. Perhaps some other process is leaking memory? Process memory: 3.42 GB -- Worker memory limit: 4.23 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk. Perhaps some other process is leaking memory? Process memory: 3.14 GB -- Worker memory limit: 4.23 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk. Perhaps some other process is leaking memory? Process memory: 3.11 GB -- Worker memory limit: 4.23 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk. Perhaps some other process is leaking memory? Process memory: 3.42 GB -- Worker memory limit: 4.23 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk. Perhaps some other process is leaking memory? Process memory: 3.14 GB -- Worker memory limit: 4.23 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk. Perhaps some other process is leaking memory? Process memory: 3.11 GB -- Worker memory limit: 4.23 GB

@bramvds
Copy link
Contributor

bramvds commented Nov 4, 2019

Dear Liang,

pySCENIC is known to be quite memory hungry. Does the process block or are you able to get results from the grn step?

Kindest regards.
Bram

@fbrundu
Copy link

fbrundu commented Mar 30, 2021

Hi all,

I have a similar problem running pySCENIC through docker on a 64G machine (on AWS).

distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.scheduler - ERROR - Couldn't gather keys {'finalize-71d032c1a1d0bee1f19ab68b246122c4': ['tcp://127.0.0.1:33991']} state: ['waiting'] workers: ['tcp://127.0.0.1:33991']
NoneType: None
distributed.scheduler - ERROR - Workers don't have promised key: ['tcp://127.0.0.1:33991'], finalize-71d032c1a1d0bee1f19ab68b246122c4
NoneType: None
distributed.client - WARNING - Couldn't gather 1 keys, rescheduling {'finalize-71d032c1a1d0bee1f19ab68b246122c4': ('tcp://127.0.0.1:33991',)}
distributed.nanny - WARNING - Restarting worker

I checked the memory usage but it seems to be constantly around 25% max. I am using 8 workers on an 8-core machine.

> docker stats
CONTAINER ID        NAME                 CPU %               MEM USAGE / LIMIT     MEM %               NET I/O             BLOCK I/O           PIDS
70eeaf3c3c0d        quirky_ardinghelli   798.28%             22.75GiB / 62.04GiB   36.67%              14.6kB / 0B         0B / 0B             207

This process has been running for several days on a large count matrix (55k cells, 20k genes). I used sparse matrix support from pySCENIC to reduce the memory usage.

  • How can I fine-tune the number of workers/memory limit for each worker to avoid workers rescheduling?
  • When a worker is restarted, is the computation resumed on a specific task, or the task is started as new?

Thanks,
Francesco

@cflerin
Copy link
Contributor

cflerin commented Mar 30, 2021

Hi @fbrundu ,

What version of pySCENIC are you using? I would recommend updating to 0.11.0 if you haven't already.

If you have problems with the Dask implementation, I would suggest trying the multiprocessing script (see here). From your logs it might help, unless updating to 0.11.0 fixes this.

But I think maybe the larger problem is that running 55k x 20k expression matrix is simply going to take a very long time with 8 threads. It's hard to say exactly but I would guess something like 3-6 days. You can check the dask dashboard for some kind of progress indicator (see here for an example, but I haven't tried to connect to a running instance in Docker. It might be tricky to access the correct port, usually 8787). If you use the multiprocessing script, it will print the progress right to the terminal.

@fbrundu
Copy link

fbrundu commented Mar 30, 2021

Thanks @cflerin.

I got the pySCENIC docker image version 0.11.0 (from https://hub.docker.com/r/aertslab/pyscenic).

No ports seem to be exposed by the docker container (if I read it correctly).

> docker ps
CONTAINER ID        IMAGE                      COMMAND                  CREATED             STATUS              PORTS               NAMES
70eeaf3c3c0d        aertslab/pyscenic:0.11.0   "pyscenic grn --num_…"   4 days ago          Up 4 days                               quirky_ardinghelli

I also checked with netstat on the AWS instance with similar results.

I'll wait for the 10-day mark, I think, then try the multiprocessing implementation.

Thanks

@fbrundu
Copy link

fbrundu commented Apr 6, 2021

Hi @cflerin

I was able to connect to the Dask dashboard at the 10-day mark:

image

How should I read this plot? From the docs at dask I think this indicates that the progress should be on average 1/2 - 2/3. But I am not sure which of the tasks takes more time to be executed. Am I correct? Should I expect 15-20 days of runtime?

Thanks

Update

I think the bottleneck might be the task "infer_partial_network-from-delayed", if I understood correctly the results from the dashboard. Tasks before that ("csc_matrix" and "from-delayed") have several results in-memory (darker color) while the infer step seems to be the limiting task (right now only 10 results in memory and 9393 / 21668 in total). Therefore, a more appropriate estimate could be ~23 runtime days..

Addendum

For anyone trying to access the dask interface on a docker container and/or on an AWS running instance, this is a bit convoluted, so I add here some ideas on how I managed to do it. This is just a fast workaround, by no means the best/most secure solution, so use at your own risk.

First get the IP address of the docker container within the AWS instance.

# get name of running container
> ~ docker ps
CONTAINER ID        IMAGE                      COMMAND                  CREATED             STATUS              PORTS               NAMES
70eeaf3c3c0d        aertslab/pyscenic:0.11.0   "pyscenic grn --num_…"   10 days ago         Up 10 days                              quirky_ardinghelli
# get IP address (within the AWS instance)
> docker inspect -f '{{range.NetworkSettings.Networks}}{{.IPAddress}}{{end}}' quirky_ardinghelli
172.17.0.2

Check port of web interface within container (usually 8787), and install/start ssh daemon

> docker exec -it quirky_ardinghelli /bin/bash
# within container from now on (root user)
> apt update
> apt install lsof ssh
> service ssh start
> lsof -i -P -n
COMMAND   PID USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
pyscenic    1 root   15u  IPv4 173772      0t0  TCP *:8787 (LISTEN)
pyscenic    1 root   16u  IPv6 173773      0t0  TCP *:8787 (LISTEN)
...

Allow root login on docker with ssh:

  • Install a text editor within the docker container
  • Edit file /etc/ssh/sshd_config by changing option 'PermitRootLogin' to PermitRootLogin yes
  • Change root password with passwd on the docker container

On a separate shell on the AWS instance, map instance's port 8787 to docker port 8787

> ssh -L 8787:172.17.0.2:8787 [email protected]

Change security group on AWS console to allow connection from your local IP to the instance at port 8787 (Inbound rules).

On a local shell, map local port 8787 to AWS instance's port 8787 (which is already mapped to docker container's port 8787):

# Assumptions (change accordingly):
# - <key>.pem is the key you use to log into the AWS instance
# - <user> is the user you use to log into the AWS instance
ssh -i <key>.pem -L 8787:127.0.0.1:8787 <user>@ec2-<IP_instance>.compute-1.amazonaws.com

Now you can access the dashboard at your local address 127.0.0.1:8787.

@cflerin
Copy link
Contributor

cflerin commented Apr 7, 2021

Hi @fbrundu ,

Thanks for the detailed updates on running this in AWS and port mapping! It seems like enabling root access here could be a security issue, but in your use case on AWS it's probably fine -- perhaps I should expose port 8787 in the docker container going forward.

On the progress plot, I think you should go by the from-delayed progress, as this is the main function processing the data (it in turn calls infer_partial_network for each TF-gene combination). I'm surprised at the 20+ day run time estimate. But it is a large matrix...

@fbrundu
Copy link

fbrundu commented Apr 8, 2021

Hi @cflerin,

In the end, the process ran for 13 days (probably there was a latency component at the beginning).
However, after completion of the process, the adjacency matrix in output is an empty file, and the process restarted.
I am not sure what the problem was here since I didn't have any other warnings.
I'll probably use the multiprocessing script...

@fbrundu
Copy link

fbrundu commented Apr 10, 2021

Few additional considerations:

I reran the grn process on a bigger machine (16 cores and 128G vs 8 cores and 64G) - BUT I defined 8 as the number of cores available to pySCENIC. I removed the --sparse flag. The process completed correctly in less than 2 days (with respect to the previous ~13 days).

It is possible that the sparse configuration doesn't work properly in some circumstances (and runtime increases non-linearly with the number of genes). Or, the procedure requires more cores than declared when instantiating the docker container (therefore, there should be some additional free cores to execute it correctly).

Also, I think that the warning of workers restarting might be a good reason to stop the process earlier since probably they will make the entire procedure fail eventually (in my experience).

@cflerin
Copy link
Contributor

cflerin commented Apr 15, 2021

Hi @fbrundu ,

Sorry to hear that your run failed after 13 days! That's super frustrating... It's strange that there was no error. But glad to hear that you finally got it finished. Thanks for the feedback on the sparse usage as well. I'm not sure exactly why this is the case but it's helpful to know this at least.

cflerin added a commit that referenced this issue Apr 15, 2021
- Expose 8787 for dask dashboard (#54)
- Update scanpy
@cflerin
Copy link
Contributor

cflerin commented Apr 15, 2021

I'm going to close this since the original issue was from an older version, and I think this is a problem with Arboreto specifically (aertslab/arboreto/issues/15).

@cflerin cflerin closed this as completed Apr 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants