How much of an issue is the fact that reads can be duplicates in RNA seq experiments?
Both single fragment and PE sequenced, is read duplication affecting a large proportion of the data? What are the duplicate rates like for an average human sample experiment? Is it in the order of 1% duplicates? Or 10% duplicates?
I have seen it mentioned in another question here in Biostars, I would like to get a feeling of how important this is in the field:
The "average duplication rate" is not useful in this context (or probably any context). It varies, depending on your amount of genetic material, amplification protocol, and sequencing methodology; furthermore, even for a supposedly fixed protocol, it still varies wildly and can easily exceed 1000% in some experiments. Duplicates should never be removed in any quantitative experiment, such as RNA-seq. Also, as much as possible, amplification should be avoided in quantitative experiments. If there is no amplification, duplicate removal should not be performed.
Duplicates should never be removed in any quantitative experiment, such as RNA-seq. Mmm... I think it's a debatable issue. In my experience removing duplicates from most pull-down or enrichment experiments (e.g. ChIP-Seq, FAIRE-Seq, etc.) gives better signal to noise. On the other hand, enrichment experiments are expected to generate duplicates. But definitively RNA-Seq should not have duplicates removed.
EDIT: My apologies, this should have been a comment to Brian's answer. Clicked the wrong button!
Most ChIP-Seq and related methods highly recommend removal of potential duplicates (including the ENCODE SOPs), as well as variant calling procedures. I noticed the GATK RNA-Seq protocol recommends duplicate removal as well. So it truly depends on the procedure, but I agree in these cases removal/marking is helpful in better signal to noise.
Interesting. I disagree on theoretical grounds with removing duplicates in anything quantitative - meaning the number of reads mapped to a locus is the ultimate output (or a linear function of the ultimate output), as in RNA-seq. Variant-calling is not quantitative. I'm only somewhat familiar with ChIP-Seq, so I don't know whether the size or shape of the peaks is more important. But, with high enough coverage, duplicate removal will destroy both the size and shape of your peaks, so it should not be done. With low coverage... it shouldn't be necessary, but might be useful if you used very high amplification.
That said - if you amplify to the point that duplicate-removal improves your experimental results in a quantitative experiment, I would say that your entire experiment has already been compromised.
In general we mark duplicates (e.g. do not remove them) and only for data from WGS/exome expts or from analyses where amplification artifacts might be a problem (ChIP-Seq for example). I believe some folks also do this for some single-cell analyses where amplification may be used.
Keep in mind the methods used to detect duplicates (such as Picard) are actually assessing potential PCR duplicates based on sequence alignment position and CIGAR string, so your chances of having false-positive 'duplicates' goes up quite dramatically for high-coverage data (as seen w/ some regions using RNA-Seq). I recall reading elsewhere (possibly seqanswers) that true PCR duplicates, assessed using random barcoding, are actually quite a bit rarer than predicted using these methods, particularly if amplification is kept to a minimum.
Also, IIRC optical duplicates are now noted and removed during the run for newer versions of Illumina's pipeline, so these aren't as commonly detected as they had been with older versions of the CASAVA pipeline.
Maybe it's very late to return this topic back, but I was searching for answers and after reading here I found a very fresh article in Nature Methods about duplicates from PCR. They said duplicates are artefacts that should be removed and suggest using softwares such as Picard's MarkDuplicates or samtools rmdup to have this task done easily. Hope this informations still useful... ;)
Good find. However, the paper mostly discusses single-cell RNA-seq.
Regarding normal bulk RNA-seq, they say:
In the Illumina labs, the team also experimented with purposefully
generating a large number of PCR duplicates. The team compared data
from unique reads and duplicates—'good' and 'bad' data. “Essentially—I
was even a little surprised by this—you couldn't really tell the
difference; the good and the bad data were identical,” says Schroth.
This experimental outcome reinforced his notion that under certain
conditions, such as typical RNA-seq assays, PCR duplicates are not
problematic.
Very interesting paper, thanks for pointing this out!