Hello Biostars,
I have been dealing with an interesting issue regarding 9 RNA-seq green algae samples. These samples come from three conditions, each having 3 biological replicates. To remove rRNA from the samples, rRNA was depleted with varying success in order to keep chloroplast and mitochondrial transcripts.
When estimating the proportion of rRNA per sample, 4 seemed to have a high number of rRNA left over. Varying from 28.5% to 35.5%.
Estimated rRNA proportions per sample:
Chlo_auto_A 28.51%
Chlo_auto_B 9.54%
Chlo_auto_C 2.73%
Chlo_mixo_A 9.02%
Chlo_mixo_B 3.56%
Chlo_mixo_C 35.51%
Chlo_hetero_A 28.69%
Chlo_hetero_B 2.43%
Chlo_hetero_C 29.86%
Moreover, when performing DE very few genes come up being differentially expressed due to the high heterogeneity between the biological replicates. Because of this I created a PCA plot using CPM values. The samples cluster as expected, except auto_A, hetero_A, hetero_C and mixo_C. Which, prior to cleaning had a high rRNA proportion.
Has anyone experienced anything similar? Or have any idea of how I could use this data to perform a DE analysis?
Any and all advice would be appreciated!
It seems that all the samples that don't cluster together are easily explained by higher rRNA proportions. Could you remove all rRNA reads from the samples are rerun the PCA? How many reads are you left with after removing rRNA?
If there are still a reasonable number of reads, and samples cluster together without rRNA, you could still use these samples for DE. The simplest solution is just run your DE workflow, and don't include the rRNA in any downstream analyses.
That was my thinking as well. However, this PCA plot is done with the rRNA reads removed. Here are the numbers of reads left over once the rRNA reads were removed:
I did a DE analysis with this set as well and there are very few differentially expressed genes found. And the variance between the samples within the same condition is higher than between conditions
I guess that if the auto_A and hetero_B samples were exchanged, it would somewhat make sense.
Oh, then that is more strange than I had though if rRNA is already removed. But I agree, it does look like maybe two samples were swapped at some point, but given the clustering on the PCA, it's not so clear. You have plenty of reads left, so I can't think of any obvious issues related to that.
Have you looked at a presence/absence PCA?
I can't think of any other suggestions for now, but I'll write more if something comes to mind.
Can you check the number of genes with counts of zero per sample and correlate that with the rRNA percentage? It could be that the rRNAs consume a lot of reads, leading to dropouts for many genes in those that have high rRNA. Even after removal of the rRNA counts these samples might simply be undersequenced and that drives up variation. Correlate both the total readcount (with rRNA removed) and number of genes with 0 counts against rRNA.
Hi,
here's a table with the total read count, number of genes with a count of 0 and the proportion of rRNA reads:
There seems to be very few genes with count 0, in total there are about 12000 annotated genes. And it does not seem to correlate with the high rRNA proportion. And I believe that there are still enough reads even in the samples where many were removed.
So far I haven't performed presence/absence PCA on data like this.
Just another interesting observation, prior to cleaning the rRNA reads the correlation matrix between two conditions looked like this:
but when the rRNA reads were cleaned the mixo_C sample seemed to show less correlation to other samples of the same condition: