Difficulties to assess bulk RNA-seq results
1
0
Entering edit mode
4 weeks ago

I've recently switched to nf-core pipelines (easier to use on my offline workstation). The latest data I was asked to analyze is a bulk RNA-seq, obtained by using QIAseq UPXome RNA Library Kits on a NextSeq sequencer.

I've used the rnaseq nf-core pipeline for the analyses. I've got what I think is a lower number of counts than expected. From raw reads between 10 and 15 millions, I obtained between 2 and 6 millions protein coding counts.

So I decided to check seriously the multiQC files.

The SortMeRNA steps show that the ribodepletion didn't work as intended. I've around 15% or rRNA reads (up to 30%).

But I failed to assess two results:

1- RSeqC read distribution. RSEQC read distribution For me there is too much intronic reads. UPXome kit is a 3' RNA-seq prep library kits. If I'm correct, we should have more than 50% of Exons (CDS+UTRs). I do not know if it's an issue. Does Salmon take into account the intronic reads in the counting ?

2- Dupradar plots. All the plots I obtained are quite the same. Dupradar plot After reading the reference paper, it seems these plots indicated that the libraries (obtained from samples with low quantities) have low complexities. I'm not sure if I'm correct.

But that could be an issue. One proposed way to correct the low numbers of reads is to rerun the libraries to increase the numbers. But if the complexities of the libraries are too low, I'm afraid that the results will not be reliable.

Dupradar nf-core RNAseq Rseqc intron • 459 views
ADD COMMENT
0
Entering edit mode

UPXome kit

This is a low-input (100 -500 pg RNA) RNAseq prep kit.

Have you checked the alignments using IGV? Do you see pileup of reads in intronic areas or are the alignments uniform across the genome. Since more or less all samples are showing this characteristic it is possible that you may have some DNA contamination in your libraries, which would show alignments spread across genome. It is possible that the libraries themselves are not of good quality.

ADD REPLY
1
Entering edit mode

Thank you for your insights.

I've checked alignments with IGV. Here is an example. Reads are mapped outside the exons positions but within gene positions. I do not see lot of reads spread across the genome (I do not see a lot of reads at all but that another issue between me - the data analyst - and my colleagues at the lab).

enter image description here

I would say "no DNA contamination" but lot of unmature RNA.

ADD REPLY
0
Entering edit mode
4 weeks ago

On the distribution of reads, yes, I'd say this on the edge of usable. Particularly for 3' end seq. As you say, for total RNA seq, I'd expect to see between 33%-50% of reads map to exons. Here its more like 20%-30%.

Both low input amounts and 3'-end seq reduce the complexity of libraries, so the low complexity is to be expected. Indeed, I don't think that dupradar output will be valid for 3'-end seq, as it probably assumes that reads can come from anywhere in the transcripts, and this assumption affects how it calculates things.

For Salmon, it won't take alignments to introns into account, only alignments to exons. Again, Salmon assumes that reads can come from anywhere in the (mature) transcript with equal probabilities, unless you run it with the --noLengthCorrection option.

However, are you sure this is 3' end seq? The UPXome kit webpage says that it can be used for either whole transcript or 3' end seq, and the distribution of reads you have does not suggest 3' end seq to me (when the majority of reads that do map to exons, should map to 3' UTRs).

ADD COMMENT
0
Entering edit mode

Thank.

I've high doubts from the start that it's 3' end seq. I had a meeting with the PI this morning about that. She confirmed it, but I asked for some verifications. Indeed the Qualimap gene coverage shows an "uniform" distribution on the transcript.

Qualimap transcript coverage

So it should mean whole transcriptome. In this case, it will explain the low number of reads on exon (even it's still too low). But it also means that Dupradar plots show issues in my librairies.

And even more, the number of reads per sample for whole transcript should be higher. My 5-15 millions STAR-mapped reads are not enough. Sequencing again could solve this issue, but I don't have a lot of confidence in the librairies.

ADD REPLY
0
Entering edit mode

Low complexity is to be expected from a low input library. If you've got an issue with duplication rate, then doing more sequencing is unlikely to help. Generally if the libraries are low input, they are low input for a reason (there just isn't a lot of starting material to be had), and thus a low complexity library is unavoidable. If more lab work was going to be done, I'd suggest making more replicates, rather than trying to make these libraries better.

If you are interested purely in the expression of well charactorised, generally highly expressed, protein coding genes, then you will probably find that there is some signal to be extracted. The standard tools should be fairly robust to low counts, and what you'll find is that you don't get many significant genes, but that better than getting too many that you aren't confident in.

ADD REPLY

Login before adding your answer.

Traffic: 1856 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6