Question

Difficulties to assess bulk RNA-seq results

0

Entering edit mode

3 days ago

charles.chapus • 0

I've recently switched to nf-core pipelines (easier to use on my offline workstation). The latest data I was asked to analyze is a bulk RNA-seq, obtained by using QIAseq UPXome RNA Library Kits on a NextSeq sequencer.

I've used the rnaseq nf-core pipeline for the analyses. I've got what I think is a lower number of counts than expected. From raw reads between 10 and 15 millions, I obtained between 2 and 6 millions protein coding counts.

So I decided to check seriously the multiQC files.

The SortMeRNA steps show that the ribodepletion didn't work as intended. I've around 15% or rRNA reads (up to 30%).

But I failed to assess two results:

1- RSeqC read distribution. For me there is too much intronic reads. UPXome kit is a 3' RNA-seq prep library kits. If I'm correct, we should have more than 50% of Exons (CDS+UTRs). I do not know if it's an issue. Does Salmon take into account the intronic reads in the counting ?

2- Dupradar plots. All the plots I obtained are quite the same. After reading the reference paper, it seems these plots indicated that the libraries (obtained from samples with low quantities) have low complexities. I'm not sure if I'm correct.

But that could be an issue. One proposed way to correct the low numbers of reads is to rerun the libraries to increase the numbers. But if the complexities of the libraries are too low, I'm afraid that the results will not be reliable.

Dupradar nf-core RNAseq Rseqc intron • 302 views

ADD COMMENT • link 14 hours ago by charles.chapus • 0

0

Entering edit mode

UPXome kit

This is a low-input (100 -500 pg RNA) RNAseq prep kit.

Have you checked the alignments using IGV? Do you see pileup of reads in intronic areas or are the alignments uniform across the genome. Since more or less all samples are showing this characteristic it is possible that you may have some DNA contamination in your libraries, which would show alignments spread across genome. It is possible that the libraries themselves are not of good quality.

ADD REPLY • link 3 days ago by GenoMax 147k

0

Entering edit mode

Thank you for your insights.

I've checked alignments with IGV. Here is an example. Reads are mapped outside the exons positions but within gene positions. I do not see lot of reads spread across the genome (I do not see a lot of reads at all but that another issue between me - the data analyst - and my colleagues at the lab).

enter image description here

I would say "no DNA contamination" but lot of unmature RNA.

ADD REPLY • link 15 hours ago by charles.chapus • 0

score 0 · Answer 1 · 2024-11-22

On the distribution of reads, yes, I'd say this on the edge of usable. Particularly for 3' end seq. As you say, for total RNA seq, I'd expect to see between 33%-50% of reads map to exons. Here its more like 20%-30%.

Both low input amounts and 3'-end seq reduce the complexity of libraries, so the low complexity is to be expected. Indeed, I don't think that dupradar output will be valid for 3'-end seq, as it probably assumes that reads can come from anywhere in the transcripts, and this assumption affects how it calculates things.

For Salmon, it won't take alignments to introns into account, only alignments to exons. Again, Salmon assumes that reads can come from anywhere in the (mature) transcript with equal probabilities, unless you run it with the --noLengthCorrection option.

However, are you sure this is 3' end seq? The UPXome kit webpage says that it can be used for either whole transcript or 3' end seq, and the distribution of reads you have does not suggest 3' end seq to me (when the majority of reads that do map to exons, should map to 3' UTRs).