Comment on the section: 3.2.4. Data distribution check #3

massonix · 2024-07-02T16:55:34Z

Hi,

Thanks for developing this wonderful analysis workflow.

I'm running the pipeline outlined in the notebook LME_Classification.ipynb. This section has the following two plots:

These plots represent the distribution of the mean expression for all genes. The interpretation of it is the following:

In the plot on the right, the expression values start off very low and then rise before dropping down. This pattern suggests potential RNA degradation, which can compromise the reliability and accuracy of downstream analyses. In contrast, the distribution plot on the left shows good-quality gene expression data. Deviations from such distributions may indicate gene degradation, should be carefully investigated and, if necessary, corrected to ensure high-quality data.

This is how my distribution looks like

However, I don't understand how this should be problematic. A common pre-processing step in any RNA-seq analysis is to exclude lowly expressed genes, which do not contain enough information for robust statistical analysis. This is the plot in my R markdown notebook where I choose the expression to exclude genes:

which looks like the plot on the left. Thus, after filtering all I'm left with is highly expressed, reliable genes. What does that have to do with RNA degradation?

If you could explain it it'd be super useful.

Thanks!

Ramon

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comment on the section: 3.2.4. Data distribution check #3

Comment on the section: 3.2.4. Data distribution check #3

massonix commented Jul 2, 2024

Comment on the section: 3.2.4. Data distribution check #3

Comment on the section: 3.2.4. Data distribution check #3

Comments

massonix commented Jul 2, 2024