Question

Very High Percentage Of Reads Are Pcr Duplicates - Iontorrent

1

Entering edit mode

11.3 years ago

Davy ▴ 410

Hi All, Recently I have been given some targeted iontorrent sequencing data to play with. It's not large amount of data only ~ 18,000 unpaired reads. I have aligned the reads with BWA, pretty much the same as I have always done with illumina fastq files. (about 80% aligned, which seemed a bit low, but whatever, I pushed on).

I then went on to mark the PCR duplicates with picard. After looking at the metrics file and then using flagstat on the resulting bam file a large portion (>70%) of the reads are duplicates. This doesn't seem quite right to me, and I was just wondering if anyone has come across this before or might have any suggestions as to what to do next. (Sure I can't use the data after removing over 70% of it, can I???)

Here is the output of flagstat before and after marking the duplicates:

>samtools flagstat sample002.s.bam
17795 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
14258 + 0 mapped (80.12%:-nan%)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (-nan%:-nan%)
0 + 0 with itself and mate mapped
0 + 0 singletons (-nan%:-nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

After marking with Picard

java -Xmx8g -jar MarkDuplicates.jar I=sample002.s.bam O=sample002.ds.bam M=./metrics/sample002.markdups_metrics.txt AS=true VALIDATION_STRINGENCY=LENIENT
    >samtools flagstat sample002.ds.bam
    17795 + 0 in total (QC-passed reads + QC-failed reads)
    13064 + 0 duplicates
    14258 + 0 mapped (80.12%:-nan%)
    0 + 0 paired in sequencing
    0 + 0 read1
    0 + 0 read2
    0 + 0 properly paired (-nan%:-nan%)
    0 + 0 with itself and mate mapped
    0 + 0 singletons (-nan%:-nan%)
    0 + 0 with mate mapped to a different chr
    0 + 0 with mate mapped to a different chr (mapQ>=5)

Cheers, Davy

sequencing pcr samtools • 9.1k views

ADD COMMENT • link 11.3 years ago by Davy ▴ 410

1

Entering edit mode

Not used ion torrent myself but would be curious to see what the fastqc report looked like (quality of data). 80% may be due to poor data and may need some trimming to map more, although BWA-mem trims automatically. Which BWA algorithm have you used?

ADD REPLY • link updated 5.1 years ago by Ram 44k • written 11.3 years ago by rob234king ▴ 610

0

Entering edit mode

I used the standard bwa aln in version 0.6.2. The fastQC reports showed the tails of the reads to quite low quality, so I will try BWA-mem to see if the alignment quality improves.

Cheers.

ADD REPLY • link updated 5.1 years ago by Ram 44k • written 11.3 years ago by Davy ▴ 410

score 1 · Answer 1 · 2013-08-28

1

Entering edit mode

11.3 years ago

arno.guille ▴ 420

Probably your library have been constructed by AmpliSeq which includes PCR. In other words, your result is normal. Don't eliminate duplicates with Picard in this case.

ADD COMMENT • link 11.3 years ago by arno.guille ▴ 420

0

Entering edit mode

Can you explain why I wouldn't need to remove the duplicates? If there is an error early on in the PCR cycle won't that propagate and cause spurious SNP detection, in addition to artificially inflating the read depth?

ADD REPLY • link 11.3 years ago by Davy ▴ 410

1

Entering edit mode

With Ion torrent and specially with target sequencing, it's normal to have a lot of duplicates. That's why the markduplicate step removes a lot of reads and if you do it, you will miss too much true SNP and INDEL. On the contrary, on Whole Exome Sequencing, you expect to have very few duplicates, and in this case it's appropriate to remove duplicates. For the alignment i suggest you to use bwasw which is specially designed for long reads

ADD REPLY • link 11.3 years ago by arno.guille ▴ 420

2

Entering edit mode

I agree that the high number of "PCR duplicates" is probably normal if you have a high coverage over a small region (just compute the probability of having two reads starting and ending in the same exact position...). The decision if to keep or remove them is hard and depends on the experimental design. Keeping them can cause the emergence of false positive SNPs in the case you suggested (early error in PCR, and I observed one such instance). If your coverage is high and you have individual data (not pooled) I don't think removing them should cause loss of SNPs, but I am not 100% sure. The best thing would be to check...

ADD REPLY • link 11.3 years ago by Fabio Marroni ★ 3.0k

0

Entering edit mode

Thanks Fabio and Arno. I will continue to seek opinions, but this does make sense to me, so I will continue on with the pipeline for now. Cheers!

ADD REPLY • link 11.3 years ago by Davy ▴ 410