Hi all,
I have data coming from sequencing ancient DNA data. The genome in question is mitochondrial. Libraries were sometimes PE, sometimes SE. With duplicates we get very high coverage ~1000-2000x. After remving duplicates we still can maintain ~150x on average. However the data loss is substantial and questions the amount of sequencing used, and then wasted with Picard duplicates removal, which does remove substantial amount of data <- one thing for sure, mitochondrion is very short ~16700 so given the amount of sequencing, it's only natural, that reads are found as duplicates. The other thing is that with aDNA, there is no guarantee of target concentration nor it's completeness (some fragments may just not be present). So this situation represents the most extreme situation for Heng's Li simple equation:
dups = 0.5*m/N m - sequencing reads N - DNA molecules before amplification http://seqanswers.com/forums/showthread.php?t=6854
Finally mitochondrial sequences are to be assembled to a samples consensus, and SNPs to be determined. Is it wise to remove duplicates from those data sets? Neither the consensus nor SNP calls change with duplicates removal/non removal.
Also worth mentioning is the fact, that following enrichment of mitochondrion, the target is in minority, with most of the sequences steming from bacterial contamination.
So. Removing, not removing? Or any other way to try and estimate it's amount (just theoretical equations?) To support myself I went with preseq library complexity estimation also to look for some correlations in case, I won't find out any definite answer to this PCR duplicates problem
Kind regards
First it all depends on the question you are trying to answer, regarding SNPs removing duplication (Marking duplication) will help in finding true positive snps. In case of assembly you need not to remove duplication (except PCR duplication and optical duplicate )
it;s actually both. First obtaining consensus assembly, and then applying the variants. Mt is haploid thankfully. Either way I find the take away message from your answer, that I should remove PCR and optical duplicates (I'm using Picard for this) even if my coverage is sky high. Many thanks for prompt reply