I am doing differential gene expression analysis by next-generation sequencing. NGS generates read duplicates and there are several programs available for the removal of such duplicates. I guess the removal of these duplicates may affect the final results as a large dynamics is an advantage of NGS over microarray. There is rare report in literature and I ask help from anyone knowing this topic well. Thanks
ADD COMMENT
• link
updated 2.9 years ago by
Ram
44k
•
written 12.5 years ago by
fanx
▴
80
1
Entering edit mode
I'd refer you to this reply from seqanswers by lh3. In short, for DGE analysis, I wouldn't remove PCR duplicates. There is no way of knowing if it is a PCR duplicate or it is just because of the number of fragments that were identical. Of course paired-end helps resolve this up to a certain extent. It would be even more unlikely to have a fragment that has identical start and end.
However, most of the pipelines constructed so far deal with removal of duplicates for SNP calling and not for DGE. And I think this is the way to go. But then, I also understand this could be very subjective.
Thanks Arun and Istvan Albert! Both paired-end and read distribution are helpful to sovle this issue at certain level. However, for meta-sequencing without references, read distribution is not possible. I spoke several guys in the field and there is no clear answer. I think the final asnwer may depend on a series of model experiments, including the estimation of several parameters like coverage depth, initial template amount and many others.
Read duplication may be natural (the same DNA fragment occurs and is sequenced twice) or artificial (during the sequencing procedure a copy of the same read is created and sequenced).
Some approaches are more sensitive to read duplication than others. I have also noticed that samples coming from labs with less experience with NGS library preparation typically produce very large rates of read duplications (80% or more!). Perhaps this is due to producing insufficient DNA that later needs to be amplified for the protocol.
My personal opinion is to investigate the duplication rates and remove them if there is indication that these are artificial ones (rates are way above what a natural duplication level would be). That being said very accurate ChIP-Seq type technologies (like ChIP-Exo) could produce very high rates of natural duplicates, often undistinguishable from artificial ones.
Looking at the read distribution around high duplication sites are a way to evaluate wether that location is naturally or artificially enriched. A natural site would exhibit a smoother distribution at the site, with roughly equal number of reads on both strands. An artificial site tends to show heavy imbalances by strand, with most reads being exactly the same rather than showing a distribution around the site.
Nice comments. Can you throw some more light on how would you go with detecting the artificial duplicates with the natural ones. After reading your comments I got that 1) The artificial duplicates will only exist for one strand and not for the other. But I didn't get what you mean by "around the site". Is there any way we can look into the neighbouring region to didtunguish between the artificial vs natural duplicates.
Hi, I'm studying the same problem analyzing single-end data from Illumina sequencing. What is the experimental or computational reason why "artificial duplicates tends to show heavy imbalances by strand"?
I meant "site" as a location of the genome that could produce natural duplicates. For example a binding site that may have a high level of occupancy in a ChIP-Seq experiment or a short gene that is very highly expressed. For whole genome sequencing via random DNA shearing there are some simple formulas (those that describe coverage) to estimate the likelihood of high coverage to occur. The higher the coverage the more likely that you will get natural duplicates.
Could you point to some of the formulas you mentioned above? Today I got a request to determine if the duplicates in some high duplicate samples are artifacts, and your answer if very helpful!
Thanks a lot,
Anna
ADD REPLY
• link
updated 2.9 years ago by
Ram
44k
•
written 10.2 years ago by
Anna S
▴
520
2
Entering edit mode
Look for the Lander Waterman equation and you'll find the formula of coverage distribution. That being said it is usually way too optimistic and would only be valid for random shotgun sequencing (not chip-seq or rna-seq). Natural duplicates tend to ratchet up and down in smooth patterns, like steps on both sides of a highly covered region. Artificial duplicates shoot up as huge tower in just a single location.
I'd refer you to this reply from seqanswers by lh3. In short, for DGE analysis, I wouldn't remove PCR duplicates. There is no way of knowing if it is a PCR duplicate or it is just because of the number of fragments that were identical. Of course paired-end helps resolve this up to a certain extent. It would be even more unlikely to have a fragment that has identical start and end.
However, most of the pipelines constructed so far deal with removal of duplicates for SNP calling and not for DGE. And I think this is the way to go. But then, I also understand this could be very subjective.
Thanks Arun and Istvan Albert! Both paired-end and read distribution are helpful to sovle this issue at certain level. However, for meta-sequencing without references, read distribution is not possible. I spoke several guys in the field and there is no clear answer. I think the final asnwer may depend on a series of model experiments, including the estimation of several parameters like coverage depth, initial template amount and many others.