I am curious whether trimming of reads of RNA-Seq dataset affects the final outcome of gene expression or not? If so, then is there a reason behind it? Say, for example, low quality read trimming results in fewer number of genes. How is this justified?
this is indeed quite a debated topic. It will pay off to look into recent literature about this.
here is one to start already: Alignment and mapping methodology influence transcript abundance estimation
Interesting article. I know different alignment and mapping methods vary in terms of accuracy, but my question here is more about the trimming of reads, and how it affects the genes. Can I get your insight on that?
yes, you are right. I meant to include a different paper : Read trimming is not required for mapping and quantification of RNA-seq reads at the gene level or bit older one: Trimming of sequence reads alters RNA-Seq gene expression estimates
that does not mean I agree with those publications though (they're just some pointers for further reading) and I personally would also live by the advise of ATpoint
I personally do not feel well giving data to tools that are contaminated with adapters. Easy to check with
fastqc
. I always trim if the adapter contamination curve it produces is not basically < 1%. Call it paranoid, but that is how I do it. Better save than sorry.I totally agree! Feeding raw data into tools for downstream analysis without trimming is not advisable, especially when the reads contain adapter sequences in it. I agree with you, and I too trim my fastq data based on per base sequence quality and adapter contamination, and then check the quality using FASTQC. But I was wondering how does trimming the low-quality reads affect the final gene list? What if trimming removes some of the genes of my interest?
I do not see how this would be possible. Trimming is done on reads. I cannot think of way how trimming would remove counts f a specific gene. I fear you are overthinking this entire trimming issue, and frankly, if you start worrying about these kinds of things like a certain standard procedure might introduce some very specific biases then you will have a hard time and constant headache. I would simply apply standard procedures and then later, and only if you have strong evidence that something is going completely wrong, question these procedures and start digging. If a gene that you expect is not showing up then there are probably many potential reasons, but adapter trimming is most likely none of it.
@ATpoint
Thanks for clarifying that! However, I have another question. I have performed Trimmomatic on my paired-end reads (SLIDING WINDOW 4:20) and I am confused with the output I have got. The number of transcripts/ transcripts IDs post-trimming are exactly the same as the number of transcripts pre-trimming. Trimming makes reads shorter, right? Correct me if I'm wrong, but doesn't trimming of reads result in less number of transcripts?
Also, I have done Trimmomatic with different parameters (SLIDING WINDOW 2:20, 10:20). But the results post-trimming are exactly the same for all! It's almost as if trimming didn't have any affect on the reads.
Can you please help me understand the reason behind this?
If there is nothing tro trim, then input=output.
Would that be the case even if I use different trimming options- Sliding window, MINLEN? I am trimming in the first place because I have some low quality reads in my library. That would mean trimmomatic removes those reads, and I interpreted that with FASTQC report post-trimming. Shouldn't it be reflected on the number of transcripts as well?
When you say transcripts... what do you mean exactly? After aligning the newly trimmed reads the number of alignments to transcripts is the same.
Yes I would say if any trimming occurred it would be very suspicious to see EXACTLY the same number of alignments. If you mean the total number of genes detected (e.g. genes with > 1 reads) is the same... then that doesn't seem too unusual. Your trimming should just remove low-quality reads... but any genes that are considered expressed should have more than just a handful of reads.
Why don't you check the logs and see how many reads were removed by the trimming? If it wasn't very many then this makes sense. Either way to make sure that you aren't making a mistake (e.g. using the old files) somewhere you should just check the actual COUNTS (read alignments mapped to features) because in that case they should differ (assuming at least 1 read was tossed).
Yes, the number of alignments of newly trimmed reads to the transcripts is the same. It gives the same number of alignments even after read trimming.
The total number of genes, however, is the same, which you say is not unusual. Trimmomatic hasn't produced a log file, so I cannot say how many number of reads are removed by trimming. But the number of read alignments are the same.
It is also good to keep in mind the overall alignment parameters you are using.
A common artifact that I notice in RNA-seq experiments are coverage spikes. Overtrimming or undertrimming could both contribute to the formation of these artifacts. However, I would agree with ATpoint that it is better to be safe than sorry thus I would always recommend being conservative and removing adapter sequences that you know are present.
And in fact if you read the paper someone linked above you would note that the alignment rates only change by less than 10% in their case study with simulated reads. I am a big believer in that if your results change drastically after a making small change to your analysis (e.g. trimming vs not) then the signal probably isn't as strong as you think it is.
I would be extremely skeptically of genes that were present in a DGE analysis but disappeared after you trimmed your known adapters. Unless something in the logs suggests that there was a problem...
My trimviz tool can trace the 'fate' of trimmed reads in your alignments, if that helps:
https://github.com/MonashBioinformaticsPlatform/trimviz
You can see the alignment fate of trimmed reads in trimviz FQ, by adding in the bam file and reference genome (in addition to the pre-trimmed and post-trimmed fastq files) and using 'diff' mode - it will then highlight differences between the trimmed part of the read and the genome flanking the trim point. If there are not many differences, then trimming to that extent was not necessary.