Is it OK to remove duplicated reads in RNA-seq before testing for differential expression?
Is it OK to remove duplicated reads in RNA-seq before testing for differential expression?
The general consensus seems to be to NOT remove duplicates from RNA-seq. Read the biostar discussions:
and this seqanswers thread and the other threads it links to.
The general consensus I see from those discussions, rather than "no", is "it depends". The first link: Malachi Griffith writes: "Observing high rates of read duplicates in RNA-seq libraries is common. It may not be an indication of poor library complexity caused by low sample input or over-amplification. It might be caused by such problems but it is often because of very high abundance of a small number of genes." Second link: Istvan Albert writes: "My personal opinion is to investigate the duplication rates and remove them if there is indication that these are artificial ones." 3rd link: Ketil writes: "I think the number of duplicates depend on many factors, so it is hard to give any general and useful rules of thumb. Usually, duplicates are correlated with too little sample material, and/or difficulties in the lab." And the seqanswers thread goes both ways.
But are type I/II errors from duplication rates expression/quartile dependent?
Coefficient of Variation is often the lowest in the most abundant genes, which are also the genes with the highest probability of duplicates (optical, amplification, or true duplicates) by virtue of the amount of data they consume. Variance is often artificially inflated for these genes in the end anyways in most regularization protocols for differential expression, so the conclusions from DE analyses should be largely the same... In contrast, genes with low expression stand to be the most affected by duplicates for this exact same reason (i.e. their variance is artificially reduced by regularization AND duplicates are a larger proportion of their count totals to begin with, thus increasing type I error). The odds of these arguments affecting opinions is low, but I think it's a conversation worth having.
I would recommend re-running differential expression analyses with/without deduplication and then comparing the resulting gene sets for each quartile of expression. My educated guess is that conclusions for the top quartile would be fairly consistent (do to the low CoV) while results could be quite different in genes with low expression where a larger percentage of the counts may be optical or technical duplicates (in comparison to genes in the top quartile).
Edit: rephrasing and grammar
This is an interesting comment and I'd like to understand it better. Are you using the word "regularization" in the statistical sense, for example as in "Tikhonov regularization"? If so, can you describe the statistical methods you have in mind? Also, how do you know that there are more duplicates per read for low-expressed genes?
Short answer - NO!
You are checking for differential expression of genes, number of reads mapping to a gene is a measure of its expression. By removing reads, you will bias the true expression measurements.
Well I have not worked with paired-end data so far, can't say about them. But I produced 'tag' files to track the number of copies of 'duplicate reads' for single-end seq. data. The very highly expressed genes tend to have more of such 'duplicate' reads. We also did validations with qPCR that confirm the higher expression. So I don't think you can say duplicate reads result purely from PCR amplification artifacts.
I really depends. I have seen alignments with so many same reads at a certain position (5 and 3 prime exactly matching), that it were for sure PCR duplicates (on other positions in the alignment there was no such coverage). However, you might loose information about strongly expressed genes this way. I would check for un/equal coverage for every gene/contig.
If you have paired-end reads, I definitely think you should remove duplicates (alignments that start at the same locations at both read 1 and read 2). These are very unlikely to occur by chance because of the variation in fragment size.
If you have a small amount of RNA going into the experiment, you will have run a lot of PCR cycles before sequencing and the representation of some fragments will have become very biased. Duplicate removal is a way to mitigate this effect although it will not solve it.
It continues to baffle me why people keep saying that you have to keep duplicates because you will lose information - but apparently it's perfectly fine to get grossly distorted read counts because of amplification artifacts! There is no way to avoid bias completely.
The problem is that your first premise that "These are very unlikely to occur by chance because of the variation in fragment size" is true for DNA but NOT necessarily true for RNA. If you have a short RNA (i.e., small transcript, miRNA, etc) that is very highly expressed there might be many, many legitimate duplicate copies of that RNA with exactly the same fragment size/position. The size of that transcript might even be such that effectively no fragmentation occurs. Also fragmentation may not occur in a totally random fashion across the transcript. By removing those duplicates you may be introducing a new source of "grossly distorted read counts". As, you say, there is no way to avoid bias completely. But, depending on the complexity and quality of your library you may be better off living with a small amount of amplification artifacts.
OK, but for miRNA etc you wouldn't do paired-end sequencing anyway; what I had in mind was the typical mRNA case. Anyway, your point is noted, and again underlines that one has to think about the biases, and not just blindly follow a rule, such as "always remove duplicates" or "never remove duplicates".
The EXPRESS pipeline (berkeley) does this for you. It calculates a coverage distribution and then removes the "spikes" from that distribtion. Because duplicates are expected in RNA Seq, removing them "blindly" is a bad idea. You should be smoothing, not removing.
How much of an issue is read duplicates in RNA seq experiments, both single fragment and PE sequenced? Is it affecting a large proportion of the data? What are the duplicate rates like for an average human sample experiment? Is it in the order of 1% duplicates? Or 10% duplicates?
The rate of natural duplication of the reads coming from a transcript will be a function of the length of the transcript, the expression levels of the transcript and the overall coverage of the sample. Hence there is no rule that you could use a priori to separate natural duplicates (that should be kept) from artificial duplicates (that should be removed)
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
how do you define a duplicated read? If you delete duplicate reads, how will you get information on expression levels?
Duplicated read is not the same a duplicated transcript. The level of natural read duplications will be a function of coverage and expression levels. And since most of the time the latter is not known we can only estimate the natural duplication rate.
Reads which are exactly the same and on the same strand. They get aligned at the same place and looks like multiple copies of one read. So, I assume the answer is NO?