Should We Remove Duplicated Reads In Rna-Seq ?
5
18
Entering edit mode
12.2 years ago

Is it OK to remove duplicated reads in RNA-seq before testing for differential expression?

rna-seq read • 57k views
ADD COMMENT
0
Entering edit mode

how do you define a duplicated read? If you delete duplicate reads, how will you get information on expression levels?

ADD REPLY
4
Entering edit mode

Duplicated read is not the same a duplicated transcript. The level of natural read duplications will be a function of coverage and expression levels. And since most of the time the latter is not known we can only estimate the natural duplication rate.

ADD REPLY
0
Entering edit mode

Reads which are exactly the same and on the same strand. They get aligned at the same place and looks like multiple copies of one read. So, I assume the answer is NO?

ADD REPLY
37
Entering edit mode
12.2 years ago

The general consensus seems to be to NOT remove duplicates from RNA-seq. Read the biostar discussions:

and this seqanswers thread and the other threads it links to.

ADD COMMENT
5
Entering edit mode

The general consensus I see from those discussions, rather than "no", is "it depends". The first link: Malachi Griffith writes: "Observing high rates of read duplicates in RNA-seq libraries is common. It may not be an indication of poor library complexity caused by low sample input or over-amplification. It might be caused by such problems but it is often because of very high abundance of a small number of genes." Second link: Istvan Albert writes: "My personal opinion is to investigate the duplication rates and remove them if there is indication that these are artificial ones." 3rd link: Ketil writes: "I think the number of duplicates depend on many factors, so it is hard to give any general and useful rules of thumb. Usually, duplicates are correlated with too little sample material, and/or difficulties in the lab." And the seqanswers thread goes both ways.

ADD REPLY
1
Entering edit mode

But are type I/II errors from duplication rates expression/quartile dependent?

Coefficient of Variation is often the lowest in the most abundant genes, which are also the genes with the highest probability of duplicates (optical, amplification, or true duplicates) by virtue of the amount of data they consume. Variance is often artificially inflated for these genes in the end anyways in most regularization protocols for differential expression, so the conclusions from DE analyses should be largely the same... In contrast, genes with low expression stand to be the most affected by duplicates for this exact same reason (i.e. their variance is artificially reduced by regularization AND duplicates are a larger proportion of their count totals to begin with, thus increasing type I error). The odds of these arguments affecting opinions is low, but I think it's a conversation worth having.

I would recommend re-running differential expression analyses with/without deduplication and then comparing the resulting gene sets for each quartile of expression. My educated guess is that conclusions for the top quartile would be fairly consistent (do to the low CoV) while results could be quite different in genes with low expression where a larger percentage of the counts may be optical or technical duplicates (in comparison to genes in the top quartile).

Edit: rephrasing and grammar

ADD REPLY
0
Entering edit mode

This is an interesting comment and I'd like to understand it better. Are you using the word "regularization" in the statistical sense, for example as in "Tikhonov regularization"? If so, can you describe the statistical methods you have in mind? Also, how do you know that there are more duplicates per read for low-expressed genes?

ADD REPLY
0
Entering edit mode

Yes. I should have said that the general consensus seems to be to not blindly remove duplicates.

ADD REPLY
7
Entering edit mode
12.2 years ago
VS ▴ 740

Short answer - NO!

You are checking for differential expression of genes, number of reads mapping to a gene is a measure of its expression. By removing reads, you will bias the true expression measurements.

ADD COMMENT
3
Entering edit mode

I don't agree at all. By not removing duplicates, you bias the expression measurements towards PCR artifacts, especially from paired-end data.

ADD REPLY
3
Entering edit mode

Well I have not worked with paired-end data so far, can't say about them. But I produced 'tag' files to track the number of copies of 'duplicate reads' for single-end seq. data. The very highly expressed genes tend to have more of such 'duplicate' reads. We also did validations with qPCR that confirm the higher expression. So I don't think you can say duplicate reads result purely from PCR amplification artifacts.

ADD REPLY
0
Entering edit mode

OK, that's an interesting finding actually. Sure, duplicate reads are not always from PCR amplication artifacts, but sometimes they are! Probably you should ideally go on a case-by-case.

ADD REPLY
0
Entering edit mode

But highly expressed genes will have higher number of reads in general - I would be very interested in ratios (dupl/all) for highly/medium/low expressed genes if you could provide this information please.

ADD REPLY
7
Entering edit mode
12.2 years ago

I really depends. I have seen alignments with so many same reads at a certain position (5 and 3 prime exactly matching), that it were for sure PCR duplicates (on other positions in the alignment there was no such coverage). However, you might loose information about strongly expressed genes this way. I would check for un/equal coverage for every gene/contig.

ADD COMMENT
6
Entering edit mode
12.2 years ago

If you have paired-end reads, I definitely think you should remove duplicates (alignments that start at the same locations at both read 1 and read 2). These are very unlikely to occur by chance because of the variation in fragment size.

If you have a small amount of RNA going into the experiment, you will have run a lot of PCR cycles before sequencing and the representation of some fragments will have become very biased. Duplicate removal is a way to mitigate this effect although it will not solve it.

It continues to baffle me why people keep saying that you have to keep duplicates because you will lose information - but apparently it's perfectly fine to get grossly distorted read counts because of amplification artifacts! There is no way to avoid bias completely.

ADD COMMENT
7
Entering edit mode

The problem is that your first premise that "These are very unlikely to occur by chance because of the variation in fragment size" is true for DNA but NOT necessarily true for RNA. If you have a short RNA (i.e., small transcript, miRNA, etc) that is very highly expressed there might be many, many legitimate duplicate copies of that RNA with exactly the same fragment size/position. The size of that transcript might even be such that effectively no fragmentation occurs. Also fragmentation may not occur in a totally random fashion across the transcript. By removing those duplicates you may be introducing a new source of "grossly distorted read counts". As, you say, there is no way to avoid bias completely. But, depending on the complexity and quality of your library you may be better off living with a small amount of amplification artifacts.

ADD REPLY
0
Entering edit mode

OK, but for miRNA etc you wouldn't do paired-end sequencing anyway; what I had in mind was the typical mRNA case. Anyway, your point is noted, and again underlines that one has to think about the biases, and not just blindly follow a rule, such as "always remove duplicates" or "never remove duplicates".

ADD REPLY
5
Entering edit mode
11.0 years ago
earonesty ▴ 250

The EXPRESS pipeline (berkeley) does this for you. It calculates a coverage distribution and then removes the "spikes" from that distribtion. Because duplicates are expected in RNA Seq, removing them "blindly" is a bad idea. You should be smoothing, not removing.

ADD COMMENT
0
Entering edit mode

How much of an issue is read duplicates in RNA seq experiments, both single fragment and PE sequenced? Is it affecting a large proportion of the data? What are the duplicate rates like for an average human sample experiment? Is it in the order of 1% duplicates? Or 10% duplicates?

ADD REPLY
0
Entering edit mode

The rate of natural duplication of the reads coming from a transcript will be a function of the length of the transcript, the expression levels of the transcript and the overall coverage of the sample. Hence there is no rule that you could use a priori to separate natural duplicates (that should be kept) from artificial duplicates (that should be removed)

ADD REPLY
0
Entering edit mode
I have seen genome-wide duplication rates up to 50% in our own data (50bp SE) and think this is nothing unusual.
ADD REPLY

Login before adding your answer.

Traffic: 1661 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6