Run time very long for large data set #2

nvpatin · 2024-03-29T20:08:40Z

I tested a larger 16S/18S data set and it was taking a very long time to run even the first PCA replicate (hadn't finished after about an hour). The original 16S data set was 7K observations of 131 variables and 18S was 10K observations, while this new set was about 67K observations with 473 variables (25K observations for 18S).

Is there any way to add a multithreading option at least for the PCA calculations?

EricArcher · 2024-03-29T20:30:50Z

For clarity: the PCA itself took over an hour to run?
The base PCA function (prcomp()) doesn't have multithreading capabilities. I'm not certain there are parallelizable components of a PCA. If this is really an issue, we may have to find another solution like subsampling for the PCA.
How was the data set larger? More ASVs, samples, or both?

nvpatin · 2024-03-29T20:48:23Z

I tried again and took a closer look; it seems like it's the Bayesian model that is actually taking a long time, not the PCA itself (see attached screnshot).

The larger data sets had more of both: 67K ASVs in 473 samples for 16S (vs 7K ASVs in 131 samples) and 25K ASVs vs 10K ASVs for 18S.

EricArcher · 2024-03-29T21:11:42Z

Phew! That's expected.
I'm working on a different project, parts of which have similar Bayesian model components to this one. They both use the Bernoulli (0/1) switches which I couldn't find a way to code in STAN which is considerably faster. Luckily, I've found someone who knows how to do it and he'll be walking me through that soon. Once I learn how to do that, I'll recode this in STAN as well and it should make a big difference.

nvpatin · 2024-03-29T23:46:23Z

Sounds good! I'll leave this issue open for now.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run time very long for large data set #2

Run time very long for large data set #2

nvpatin commented Mar 29, 2024

EricArcher commented Mar 29, 2024

nvpatin commented Mar 29, 2024

EricArcher commented Mar 29, 2024

nvpatin commented Mar 29, 2024

Run time very long for large data set #2

Run time very long for large data set #2

Comments

nvpatin commented Mar 29, 2024

EricArcher commented Mar 29, 2024

nvpatin commented Mar 29, 2024

EricArcher commented Mar 29, 2024

nvpatin commented Mar 29, 2024