Hi Biostars,
Is it legitimate to sum up raw read counts from technical replicates of RNAseq, and use these summed counts for DE analysis. Would appreciate detailed and justified answers.
Thanks
Hi Biostars,
Is it legitimate to sum up raw read counts from technical replicates of RNAseq, and use these summed counts for DE analysis. Would appreciate detailed and justified answers.
Thanks
Technical reproducibility in RNA-seq is considered to be excellent (provided that the same kit/lab/...) is used. So yes, technical replicates can be combined. I think the best stage to do this is as fastq or bam, although I can't think of problems of just adding the read counts.
My 2 cents: no, it's very bad practice to do that.
Detailed explanation:
technical replicates are different RNASeq runs performed on the same sample, they have to be averaged. You are taking a snapshot of the transcriptional profile of your sample, and you're sequencing it twice to avoid biases and to reduce the chance to fall for sequencing errors.
The same can be said for biological replicates, but in that case you are sequencing more than one sample, to avoid batch effects and sample preparation errors to bias your downstream analysis.
There are two equally respectable opinions, the ones that we just brought up. I gave you detailed explanation for my opinion, I am truly convinced of what I say. But I am eager to hear the reasons for which you could add them, perhaps I can change my mind if provided with enough evidence (ain't that what science is about?)
I guess it also depends on when you make technical replicates:
Are those the same cells from which you isolate twice, or the same RNA from which you created two libraries, or the same library sequenced in two runs, or even the same library sequenced in multiple lanes on the same sequencer?
Wouldn't adding up and averaging result in the same, provided that you normalize for total readcount before doing the rest of your analysis? You just get bigger numbers - as if you sequenced deeper.
But the amount of sequenced reads has to be connected to the amount of transcripts that you have in your sample, which ultimately defines the expression concept. To me, adding technical replicates means logically taking the transcripts of the same sample twice. I totally get your point, I'm more on a phylosophical / theoretical dimension now.
Agree and I think what important is not the seq. depth itself and the numbers we get, but rather the proportionality between them, since in the end what we compare are the relative expression values, which are basically proportions. I think it would not be wrong if you sum up read counts that come form more or less similar library sizes (you catch lowly expressed genes (which are the most of the genes) in the same proportions), but you would violate proportionality if you sum up very different library sizes.
But calculating the relative expression values of the average of two replicates is the same as calculating the relative expression values of the sum of two replicates...
In the hypothetical situation that one replicate is sequenced double as deep as the other, we have the following read counts:
raw counts
geneA 2 4
geneB 3 6
geneC 5 10
total 10 20
Averaging means:
geneA 3
geneB 4.5
geneC 7.5
total 15
Proportional average means:
geneA 0.2
geneB 0.3
geneC 0.5
Sum means:
geneA 6
geneB 9
geneC 15
total 30
Proportional average means:
geneA 0.2
geneB 0.3
geneC 0.5
In your example, gene counts and totals are exactly proportional to each other twice. In real experiment, it's not always the case - imagine you did DE analysis with 10 mln reads and got 100 DE genes. If the proportionality always is the same in theory if you do the same analysis with 20 million reads you will get the exactly same 100 genes. Then it wouldn't make sense to generate more read depth:) This is of course specific to genome size, for example for human you can have a look on these paper https://academic.oup.com/bioinformatics/article-lookup/doi/10.1093/bioinformatics/btt688, after a certain depth we don't see many changes, but with lower depth you can skew proportions because of lowly expressed genes, and I think this is where the problem comes. Please correct me I am wrong, would be glad to discuss.
You are very right - what my example didn't include is variability due to sampling (a Poisson distribution).
Note that sequencing deeper will give you a better estimate of the true proportions of a transcript in the total mixture (since the variance of a Poisson distribution depends on the mean).
Indeed, sequencing deeper will influence the differential expression analysis - exactly because your abundance estimate is getting more accurate with increased depth.
Technical replicates in my case will be the same library sequenced on the same machine but on the different days. For our experiment we need to reach 20mln reads depth for a given sample (biological replicate). For one of the samples we reached only 10 mln. And now need to decide ether to make a new run and seqeunce 10 mln reads (this will be a technical replicate) and add this to existing replicate, or make a run with 20 mln reads and use only these data.
For our experiment we need to reach 20mln reads depth for a given sample
I think, if the funding allows you to do it, making a 20 million reads run is better. When in one year you'll be sending the paper, the revisors could easily asked you 1) why did you sum the read counts 2) why did you sequence twice 10x instead of once 20x. I think it's better to be safe than sorry :)
The new comment in this discussion just made me realize that we had this thread ~10 months ago here, and now me and @grant are sitting next to each other in the same lab. XD
Science: it's a small world!
Apart from that, maybe it would be interesting to tell the readers what, in the end, you did with this experiment (I also don't know!) :)
Hehehe indeed very small :) That particular experiment was quite complex and in the first round of sequencing we tried to sequence as deep as possible. But eventually we still had to resequence some samples using the same libraries, so we didn't avoid technical replicates. However, in the end instead of having 20mln reads (as was planned), we got around 30-35 in total and decided to use all those technical replicates:) From the data analysis point I have first ensured that technical replicates were similar by calculating correlations and making PCA plots (and they actually were very similar). Then I have summed up the counts and performed dif. analysis with DESeq2:)
How are you losing on it?
Sequencing twice the same sample is not the same as sequencing it once but twice as deep. The rare transcripts are gonna pop out only in the second one, to my experience.
Sequencing twice the same sample is not the same as sequencing it once but twice as deep. The rare transcripts are gonna pop out only in the second one, to my experience.
That doesn't make sense to me, because you are sampling from a distribution of molecules and the likelihood of "catching" low abundant transcripts is directly proportional to the "amount of sampling" you perform, whether once 20M molecules or twice 10M molecules.
It's all probabilistic, let's say you have a 1/10mln chance to get one read of the given gene we're talking about for every single read. Now the likelihood catching at least one read of the gene is 1 - (9 999 999 / 10 000 000)^5 000 000 = 0.3934694 for 5M reads, 1 - (9 999 999 / 10 000 000)^10 000 000 = 0.6321206 for 10M reads and 1 - (9 999 999 / 10 000 000)^20 000 000 = 0.8646647 for 20M. I think Wouter was thinking this way, and at first I tended to agree with him.
But, this kind of probabilistic maths assumes that the probabilities don't change as a function of time during the sequencing, which might be false depending on the technology or the library construction method. I'm not sure how this might be modeled or conceived on the level of cells, molecules and ultimately the technology, but there might be some Monty Hall-problemesque weirdness going on. Maybe an exhaustion phenomenon with the sample in question. Anyway, I'm just a beginning bioinformatician so this was an interesting discussion to follow.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
hi, that's a great thread on this topic, thanks! Too bad I didn't find it when I had this question :) can you move your reply to answers so I can accept it?
It has been moved
great, cheers!