Question

Transcript level quantification and gene length scaling for rRNA depleted libraries

1

Entering edit mode

3 months ago

marc.zimmerli ▴ 10

Hello everyone

I'm trying to do differential gene expression analysis based on paired-end RNA sequencing data generated using a protocol that includes rRNA depletion during library prep. The tools I used are the nextflow nf-core/rnaseq pipeline for (pseudo) gene count quantification based on the raw reads (using STAR-salmon for gene quantification) and DESeq2 for the differential expression analysis. When Importing my gene counts for DESeq I use tximport with 'countsFromAbundance = "lengthScaledTPM"' to account for potential differences in gene lengths.

I then noticed that a lot of my top differentially expressed genes seem to be only detected as such specifically BECAUSE of differences in gene lenghts. As in: the raw gene counts are often similar across samples, but the gene lengths (and thus 'lengthScaledTPM') differ by a lot - sometimes by a factor of 10 or more. When looking at the transcript level data I found that these differences in gene length usually stem from differences in the composition of transcripts contributing to the gene count rather than differences in transcript length. All of that would be fine If I can assume that these differences in splice variant usage are genuine - I think that's the whole point of the length scaling(?).

However, since I'm using rRNA depleted and not poly-A enriched RNA that means I would also expect to capture a significant amount of nascent RNA which has not yet been spliced into a mature transcript. This is also how I interpret a rather high percentage of reads being mapped to intronic regions according to the MultiQC report (around 50% for most samples).

What I am now doubtful about is if and how these nascent transcripts throw off the transcript level quantification. If I understand correctly, the pseudoquantification tries to model what combination of mature transcripts is most likely to produce the the observed reads/alignments or something along those lines? So how would this work If i have RNA molecules in the mix with exon combinations not found in any 'valid' mature transcript?

Is this a problem at all and if so how would I account for this? Or am I simply misunderstanding how one of the tools I'm using works and this is a non-issue.

END OF QUESTION - rest of text is just an example

Example: When ranked by p-value the top differentially expressed gene between my groups of interest happens to be ENSG00000241743 which has one somewhat 'suspicious' annotated transcript which is 350 kilobases long and the rest of the transcript variants are between ca. 200-2000 bases long. Now for most of my samples ONLY this super long transcript is ever detected at counts of ca. 0-100 and for some few samples a single other variant is reported with a pseudocount of ca. 1.The gene level count in these second group of samples is completely different post length scaling because the short 200 base transcript of course counts ca. 1750 times as much as the 350'000 base long variant. Which again would be fine if I can be confident that this difference is genuine.

However, what I'm worried about is that in reality the reads stem from some nascent RNAs in the process of being spliced with an actual length somewhere between 350000 (basically the entire genomic region over which the exons of the gene are streched out) and 200. I assume all RNA fragments containing ANY bases not found in the short mature transcripts would automatically be mapped to the all encompassing super long one and be subject to super restrictive length scaling.

Of course for genes with less 'suspicious' annotations the problem would also be way less pronounced but I think the general process of mapping nascent RNAs to annotations of mature transcripts would pose similar problems in general, no?

Thanks in advance for any input

RNASeq • 581 views

ADD COMMENT • link updated 3 months ago by dsull ★ 7.6k • written 3 months ago by marc.zimmerli ▴ 10

0

Entering edit mode

Interesting observation. Can you add plots to illustrate this? For example for this gene, the salmon output per samples, the tximport results for it, raw and normalized counts the the average tx length estimates?

ADD REPLY • link 3 months ago by ATpoint 88k

0

Entering edit mode

Haven’t had time to think about this in its entirety but lncRNAs are poorly annotated (in part because their biology is tricky) and TPM length-correction is known to not be entirely accurate either. Introns are not part of the standard transcriptome annotation, so it’s only the exonic mappings that are considered in quantification. But yes, I’m pretty sure you’re seeing an artifact for that gene.

ADD REPLY • link 3 months ago by dsull ★ 7.6k