Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run time very long for large data set #2

Open
nvpatin opened this issue Mar 29, 2024 · 4 comments
Open

Run time very long for large data set #2

nvpatin opened this issue Mar 29, 2024 · 4 comments

Comments

@nvpatin
Copy link
Owner

nvpatin commented Mar 29, 2024

I tested a larger 16S/18S data set and it was taking a very long time to run even the first PCA replicate (hadn't finished after about an hour). The original 16S data set was 7K observations of 131 variables and 18S was 10K observations, while this new set was about 67K observations with 473 variables (25K observations for 18S).

Is there any way to add a multithreading option at least for the PCA calculations?

@EricArcher
Copy link
Collaborator

For clarity: the PCA itself took over an hour to run?
The base PCA function (prcomp()) doesn't have multithreading capabilities. I'm not certain there are parallelizable components of a PCA. If this is really an issue, we may have to find another solution like subsampling for the PCA.
How was the data set larger? More ASVs, samples, or both?

@nvpatin
Copy link
Owner Author

nvpatin commented Mar 29, 2024

I tried again and took a closer look; it seems like it's the Bayesian model that is actually taking a long time, not the PCA itself (see attached screnshot).
Screenshot 2024-03-29 at 1 40 43 PM

The larger data sets had more of both: 67K ASVs in 473 samples for 16S (vs 7K ASVs in 131 samples) and 25K ASVs vs 10K ASVs for 18S.

@EricArcher
Copy link
Collaborator

Phew! That's expected.
I'm working on a different project, parts of which have similar Bayesian model components to this one. They both use the Bernoulli (0/1) switches which I couldn't find a way to code in STAN which is considerably faster. Luckily, I've found someone who knows how to do it and he'll be walking me through that soon. Once I learn how to do that, I'll recode this in STAN as well and it should make a big difference.

@nvpatin
Copy link
Owner Author

nvpatin commented Mar 29, 2024

Sounds good! I'll leave this issue open for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants