Question

Removing duplicates in high coverage ancient DNA mitochondrial data

0

Entering edit mode

8.2 years ago

stolarek.ir ▴ 700

Hi all,

I have data coming from sequencing ancient DNA data. The genome in question is mitochondrial. Libraries were sometimes PE, sometimes SE. With duplicates we get very high coverage ~1000-2000x. After remving duplicates we still can maintain ~150x on average. However the data loss is substantial and questions the amount of sequencing used, and then wasted with Picard duplicates removal, which does remove substantial amount of data <- one thing for sure, mitochondrion is very short ~16700 so given the amount of sequencing, it's only natural, that reads are found as duplicates. The other thing is that with aDNA, there is no guarantee of target concentration nor it's completeness (some fragments may just not be present). So this situation represents the most extreme situation for Heng's Li simple equation:

dups = 0.5*m/N m - sequencing reads N - DNA molecules before amplification http://seqanswers.com/forums/showthread.php?t=6854

Finally mitochondrial sequences are to be assembled to a samples consensus, and SNPs to be determined. Is it wise to remove duplicates from those data sets? Neither the consensus nor SNP calls change with duplicates removal/non removal.

Also worth mentioning is the fact, that following enrichment of mitochondrion, the target is in minority, with most of the sequences steming from bacterial contamination.

So. Removing, not removing? Or any other way to try and estimate it's amount (just theoretical equations?) To support myself I went with preseq library complexity estimation also to look for some correlations in case, I won't find out any definite answer to this PCR duplicates problem

Kind regards

aDNA duplicates Picard • 2.8k views

ADD COMMENT • link updated 8.2 years ago by Brice Sarver ★ 3.8k • written 8.2 years ago by stolarek.ir ▴ 700

1

Entering edit mode

First it all depends on the question you are trying to answer, regarding SNPs removing duplication (Marking duplication) will help in finding true positive snps. In case of assembly you need not to remove duplication (except PCR duplication and optical duplicate )

ADD REPLY • link 8.2 years ago by Medhat 9.8k

0

Entering edit mode

it;s actually both. First obtaining consensus assembly, and then applying the variants. Mt is haploid thankfully. Either way I find the take away message from your answer, that I should remove PCR and optical duplicates (I'm using Picard for this) even if my coverage is sky high. Many thanks for prompt reply

ADD REPLY • link 8.2 years ago by stolarek.ir ▴ 700

score 1 · Answer 1 · 2016-09-06

1

Entering edit mode

8.2 years ago

Brice Sarver ★ 3.8k

If your coverage is extremely high, additional information is not providing you more support for your conclusion - it's just duplicated data. You can identify duplicates and remove them down to an appropriate coverage (say, 30X or greater).

For de novo assembly, I recommend downsampling to somewhere around 60X or less. This is because many assemblers expect a certain coverage and will begin to split contigs that have excessive coverage. I have done a lot of mitochondrial assemblies and I noticed this all the time. For example, if my average coverage was 200X and some regions dipped below 100X (still very high), the contigs would be split at that point. Some assemblers will allow you to specify coverage a priori and attempt to alleviate this.

ADD COMMENT • link 8.2 years ago by Brice Sarver ★ 3.8k

1

Entering edit mode

I think 50x should be optimal

from this paper

Identification of Optimum Sequencing Depth Especially for De Novo Genome Assembly of Small Genomes Using Next Generation Sequencing Data

ADD REPLY • link 8.2 years ago by Medhat 9.8k

0

Entering edit mode

yea, in aDNA coverage bumps are huge (due to variation in sequence survival), so I employed simple sequence consensus. I think it's justified by the fact that mt is haploid and doesn't undergo recombination.

Greatfull for supporting my logic that I developed, that additional ultra high coverage doesn't bring anything new (95% of the sequence is covered at least at 50X after dups removal, so tat settles it). Thanks for the exact numbers!

ADD REPLY • link 8.2 years ago by stolarek.ir ▴ 700

0

Entering edit mode

Hi, Is Sequence coverage calculated as (total reads length/reference genome length) before or after removing duplicated and contained reads?

I am doing repeat identification based on the frequency of reads to flag the read as repeat if its frequency is higher that the double of coverage and when I tried to calculate the coverage before removing duplication it is too high so no read match as repeat while I know that genome got repetitive regions with length more that read length

ADD REPLY • link 5.8 years ago by sherifmagdy2007 • 0