Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

grnboost2 fails on large datasets #41

Open
dmalzl opened this issue Oct 1, 2024 · 0 comments
Open

grnboost2 fails on large datasets #41

dmalzl opened this issue Oct 1, 2024 · 0 comments

Comments

@dmalzl
Copy link

dmalzl commented Oct 1, 2024

Hi,

I am currently trying to use grnboost2 to infer GRNs from a dataset of around 120k cells but I am unable to get it to run due to some hard limits imposed by dependencies of the dask distributed package (see here). In brief, dask has a hard limit on the size of the dataset (data chunk) it can serialise, which is 4GB. Anything above this will result in a the following error:

distributed.protocol.core - CRITICAL - Failed to Serialize
ValueError: bytes object is too large

To circumvent this, the developers suggest to move data generation into a separate task to make the workers generate their own data locally. So a workaround would be to be able to provide paths for the data files and move the read to the worker to only have to serialise a couple of strings instead of the whole dataset.

I know this may be a bit more to think about especially to figure out the best strategy to do this (e.g. generate a system that makes data chunks ahead of time writes them to files and then lets the workers read the data back in or something) but maybe worthwhile to support larger datasets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant