I am new to learning about RNAseq analysis and am confused as to what exactly sequencing depth refers to. For example, if I need to calculate "how deep each sample was sequenced" does this refer to the total number of paired end reads that came out of the sequencer OR the number of paired end reads that mapped?
Depth is commonly a term used for genome or exome sequencing and means the number of reads covering each position. But that is for RNA-seq totally pointless since the coverage pattern is so uneven due to differences in expression.
More commonly, in RNA-seq the term "number of reads" is used, for example,10 million reads or 100 million reads. If all goes well a high percentage of the reads should align, ~90%. That makes the number of reads out of the sequencer not so different from the number of mapped reads. I would report the number of reads sequenced unless it's very different from what is aligned. But then you should also figure out why it's such a big difference.
Thank you for the clear explanation! I was wondering because I've got some data to play around with in which most of the samples have around a 70% alignment rate (not too surprising as the genome quality isn't very good). In this case, say I have 30 million total reads of which 70% mapped... in this case my number of reads would be 21 million?
Number of reads is still 30M out of which 70% mapped. Why the rest did not is something you could investigate. They could be rRNA, contamination or just plain who knows what (though that fraction should generally be very small).
Right. I'm asking though because I'm playing around with different packages, mostly just to learn, and for one (CQN package) it specifically asks for a vector containing "... the sizeFactors which simply tells us how deep each sample was
sequenced". So in my example, would I use 30M or 21M for this?
Those sizeFactors would only matter if there is a big difference between samles, say one sequenced to 20M reads and another to 80M reads. If the alignment fraction is similar for all - it again doesn't really matter.
Got it! My alignments range from around 69% to around 73%, so pretty similar? So for this specific example, I could set the sizeFactors to NULL and it would be fine? Thank you so much!!
Seems pretty similar indeed. I don't know about this particular package so I don't know about setting it to NULL, but setting the sizeFactor to the total number of reads (sequenced or aligned - whatever) might be the most accurate.
Since they removed genes with 0 counts in all samples they are only considering those that have mapped reads. As @Wouter points out if the numbers are unbalanced across the dataset, then you would need to account for that.
I want to give you a very intuitive but maybe not very accurate explanation: You can imagine each base (A/G/C/T) is a grain of rice. Suppose you sequenced a big of rice (say 100M bases), and then fill them into a very narrow but very long rice box with width = 1 grain of rice. The rice height in the box would be the sequence depth. For WGS, the length would be length of whole genome. For RNAseq, people might only consider reads mapped onto exons, and also only consider the length as union of exon length.
"how deep each sample was sequenced" means "how many reads (or pairs)" were sequenced for each sample. Some results will vary with the number of sequenced reads (the sequencing depth), therefore it is an important information.
Hi Carmacae,
If an answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted.
Cheers,
Wouter