Duplicates In The Context Of Deep Sequencing
1
4
Entering edit mode
12.0 years ago
KCC ★ 4.1k

Common advice in DNA-seq experiments is to remove duplicate reads. These are presumed to be optical or PCR duplicates. However, when samples are sequenced deeply (more than 10X), it becomes completely reasonable for reads to be duplicated. If we stick with the idea of throwing away duplicates, it effectively limits the sequencing depth to 1X.

In cases where there is a dramatic shift to higher GC from input and a strongly skewed distribution, I think clearly one feels inclined to remove the duplicates.

However, in many of the deep sequencing datasets I work with, I see very little shift to higher GC with histograms of GC content that are very nearly symmetric and quite close to the Input.

In these cases, I often feel that I should stick with leaving the duplicated reads in. On the other hand, for certain regions of the genome, I see huge numbers of tags leading to overlapping-tag counts in the tens of thousands. These seem not to represent genuine biology.

What solutions are there for us who would like to use deep sequencing, but what a prrincipled way to filter out some of these clear artifacts?

sequencing duplicates • 10k views
ADD COMMENT
0
Entering edit mode

Read this post. You misunderstood the purpose of duplicate removal. As to GC bias, you can barely detect the bias by comparing the GC content of the genome and that of reads. For Illumina, typical GC bias denotes significantly lower coverage at extremely high or low GC.

ADD REPLY
0
Entering edit mode

@lh3:Sorry to be a bother. Can you specify the aspect I am misunderstanding?

ADD REPLY
0
Entering edit mode

Removing duplicates has nothing to do with GC content. Read that post first. EDIT: don't read the first few. Finish the whole thread. There are quite a lot of information mentioned by others and myself.

ADD REPLY
0
Entering edit mode

One of the early posts says "The main purpose of removing duplicates is to mitigate the effects of PCR amplification bias introduced during library construction". The first thing I say is "Common advice in DNA-seq experiments is to remove duplicate reads. These are presumed to be optical or PCR duplicates. "

ADD REPLY
0
Entering edit mode

@lh3: As to your comment about GC content. PCR duplicates cause a change in the distribution of GC content of tags. Look at this paper: http://nar.oxfordjournals.org/content/early/2012/02/08/nar.gks001.long .... "This empirical evidence strengthens the hypothesis that PCR is the most important cause of the GC bias". We can quibble about whether it's the most important cause, but it seems reasonable to consider it a contributor to the distribution of GC.

ADD REPLY
0
Entering edit mode

Also. I do get that there are other reasons in play like increasing sensitivity and specificity of peak calls ... at least that's what I get from this paper ... http://www.nature.com/nmeth/journal/v9/n6/full/nmeth.1985.html

ADD REPLY
0
Entering edit mode

@lh3: Also, I just wanted to say I have actually read that thread before. Rereading it now, I'm thinking I probably learned about optical duplicates from you!

ADD REPLY
5
Entering edit mode
12.0 years ago
lh3 33k

You are talking about two largely unrelated issues: duplicate and GC bias. Removing duplicates has little to do with GC bias. For resequencing, we remove duplicates mainly to reduce recurrent errors. When the duplicate rate is high, duplicates lead to higher false positive rate for SNP calling and also for SV detection. Given ~30X paired-end data, duplicate removal barely affects non-duplicates (I showed the formula on how to compute that). For 10X single-end data, duplicate removal will only remove a small fraction of non-duplicate (the formula is even simpler).

Then Illumina GC bias. Library construction is a major cause of the bias, but it is not the only one. Sequencing machines also produce bias to a lesser extent. When the same library is sequenced on different machines/runs, you may observe GC difference sometimes. The typical way to see the GC bias is to compute the coverage in different GC bins. Because for human, only a tiny fraction of the genome are in >80% or <20% GC, you can barely see the bias by comparing GC content in different GC bins -- the distribution is dominated by the vast majority of normal regions. Nonetheless, although there are few GC-extreme regions, the bias leads to false CNV calls, breaks de novo assemblies and causes false negatives. It also makes it much harder to sequence some small genomes with extreme GC. It would be good if there were no such bias.

EDIT: You are actually talking about a third issue, weird regions in alignment, which has little to do with duplicate or GC bias. Most of these regions are caused by the imperfect reference genome and some by structural variations.

ADD COMMENT
1
Entering edit mode

@lhs: Thanks for your reply. Can you clarify your formula, 0.28*m/s/L? In particular, where does the 0.28 come from and what order is m/s/L evaluated in? is it m/(s/L) or (m/s)/L?

ADD REPLY
1
Entering edit mode

0.28 is 1/[2 sqrt(pi)]

This is obtained from a model that assumes uniform distribution of fragment starting positions and Gaussian distribution p of fragment lengths l with standard deviation s. Under these assumptions, and with some simplifications (genome much longer than fragment, s less than one third of the mean fragment length), the probability that two independent fragments will have the same starting position and the same length l just by chance is (m/L) p^2 (where p is the Gaussian value for the fragment length l, and assumption L >> l holds - otherwise the denominator is L - l). When you marginalize p^2 over all fragment lengths l, the result is 1/[2 s sqrt(pi)] = 0.28/s regardless of the mean fragment length (approximately, but very close). This is how one can arrive at 0.28 m/(sL). More details can be found in samtools supporting documentation (see https://lh3lh3.users.sourceforge.net/download/samtools.pdf , page 5).

You can easily check the accuracy of the result:

L <- 1e6 
n <- 2e4 
l <- 300 
s <- 10 

set.seed( 0 ) 
x <- sample( 1:( L - l - 3 * s ), size=n, replace=TRUE ) 
y <- as.integer( x + l + rnorm( n=n, mean=0, sd=s ) ) 

# Observed fraction of unique sequences:
all_frags <- paste( as.character( x ), '_', as.character( y ), sep='' )
table( table( all_frags ) )[[ '1' ]] / n
# [1] 0.9994 - Notice: this model completely excludes fragments that have duplicates, 
# even the unique representatives from the duplicated subset of fragments,
# and only accounts for true singletons! Actual deduping procedures keep 
# one copy from each set of duplicated reads, so the resulting unique fraction 
# will be somewhat higher than this estimate: 
length( unique( all_frags ) ) / n
# [1] 0.9997

# Predicted fraction of unique (singleton!) sequences based on lh3's formula: 
1 - n / ( 2 * sqrt( pi ) * s * L ) 
# [1] 0.9994358 # Quite close!

# Prediction based on direct integration: 
p <- dnorm( 0:( l + 3 * s ), mean=l, sd=s ) # Distribution of fragment lengths 
1 - ( n / L ) * sum( p^2 ) # This gives the same numerical result as lh3's formula: 
# [1] 0.9994358 # Same as lh3
ADD REPLY

Login before adding your answer.

Traffic: 2470 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6