I've recently switched to nf-core pipelines (easier to use on my offline workstation). The latest data I was asked to analyze is a bulk RNA-seq, obtained by using QIAseq UPXome RNA Library Kits on a NextSeq sequencer.
I've used the rnaseq
nf-core pipeline for the analyses. I've got what I think is a lower number of counts than expected.
From raw reads between 10 and 15 millions, I obtained between 2 and 6 millions protein coding counts.
So I decided to check seriously the multiQC files.
The SortMeRNA
steps show that the ribodepletion didn't work as intended. I've around 15% or rRNA reads (up to 30%).
But I failed to assess two results:
1- RSeqC read distribution. For me there is too much intronic reads. UPXome kit is a 3' RNA-seq prep library kits. If I'm correct, we should have more than 50% of Exons (CDS+UTRs). I do not know if it's an issue. Does Salmon take into account the intronic reads in the counting ?
2- Dupradar plots. All the plots I obtained are quite the same. After reading the reference paper, it seems these plots indicated that the libraries (obtained from samples with low quantities) have low complexities. I'm not sure if I'm correct.
But that could be an issue. One proposed way to correct the low numbers of reads is to rerun the libraries to increase the numbers. But if the complexities of the libraries are too low, I'm afraid that the results will not be reliable.
This is a low-input (100 -500 pg RNA) RNAseq prep kit.
Have you checked the alignments using IGV? Do you see pileup of reads in intronic areas or are the alignments uniform across the genome. Since more or less all samples are showing this characteristic it is possible that you may have some DNA contamination in your libraries, which would show alignments spread across genome. It is possible that the libraries themselves are not of good quality.
Thank you for your insights.
I've checked alignments with IGV. Here is an example. Reads are mapped outside the exons positions but within gene positions. I do not see lot of reads spread across the genome (I do not see a lot of reads at all but that another issue between me - the data analyst - and my colleagues at the lab).
I would say "no DNA contamination" but lot of unmature RNA.