Question

Problem using PicardTools MarkDuplicates: Value Was Put Into Pairinfomap More Than Once

1

Entering edit mode

9.4 years ago

anp375 ▴ 190

High, I'm having a problem using PicardTools MarkDuplicates. I had two fastq files for forward and reverse reads. They were both from the same lane, etc. I aligned them with bwa mem using the -M option and got the bam file. I used PicardTools FixMateInformation, SortSam, and samtools index. When I use MarkDuplicates, I get this error:

Exception in thread "main" net.sf.picard.PicardException: Value was put into PairInfoMap more than once. 4: Sample_AD_096:HWI-ST1341:97:C7CKRACXX:8:2201:11012:61476 at net.sf.picard.sam.CoordinateSortedPairInfoMap.ensureSequenceLoaded(CoordinateSortedPairInfoMap.java:124) at net.sf.picard.sam.CoordinateSortedPairInfoMap.remove(CoordinateSortedPairInfoMap.java:78) at net.sf.picard.sam.DiskReadEndsMap.remove(DiskReadEndsMap.java:61) at net.sf.picard.sam.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:407) at net.sf.picard.sam.MarkDuplicates.doWork(MarkDuplicates.java:150) at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:177) at net.sf.picard.sam.MarkDuplicates.main(MarkDuplicates.java:134) [E::hts_open] fail to open file 'Sample_AD_096_b37_aln-pe.bam'

When I grepped the read name, I got 4 lines back: Primary forward read, primary reverse read, separated by a SAMFlag value of 80 or 16. Secondary forward read, secondary reverse read separated by a SAMFlag value of 80 or 16.

HWI-ST1341:97:C7CKRACXX:8:2201:11012:61476 131 2 233666062 6 0 35S66M 5 154316645 0 CCCATCGTCGGAGCTGGTGGCACAGAGGTTGT ACTGACCCTGTCTCTACAAATAATAAAAAAGCCAGGCATGGTGGTATGCACCTGTAGTCCCAGCTACTT BBBFFFFF FFFFFIFFFBFFFFFIFBFFBFFFFFIFFFFFII<ffffiiifffiiiiiiffffffbbbbffffbbbbbbfffffffff ffffff<bbfffb="" sa:z:5,154316645,+,39m62s,60,0;="" md:z:66="" rg:z:sample_ad_096="" n="" m:i:0="" mq:i:60="" as:i:66="" xs:i:44<="" strong="">

HWI-ST1341:97:C7CKRACXX:8:2201:11012:61476 115 2 233666062 6 0 32S69M 5 154316648 0 ATCGTCGGAGCTGGTGGCACAGAGGTTGTACT GACCCTGTCTCTACAAATAATAAAAAAGCCAGGCATGGTGGTATGCACCTGTAGTCCCAGCTACTTGGG 77FFBFFF FFFFFFFFFFFFFBBBBBFFFFFFB7BBFIIIFFFFFBFFFFFFFIIFBFFFFFFIIIFBFIF<f<fbfbbfffff<fbf ffffffffffbbb="" sa:z:5,154316648,-,36m65s,60,0;="" md:z:69="" rg:z:sample_ad_096="" n="" m:i:0="" mq:i:60="" as:i:69="" xs:i:46<="" strong="">

HWI-ST1341:97:C7CKRACXX:8:2201:11012:61476 387 5 154316645 6 0 39M62H 2 233666062 0 CCCATCGTCGGAGCTGGTGGCACAGAGGTTGT ACTGACC BBBFFFFFFFFFFIFFFBFFFFFIFBFFBFFFFFIFFFF SA:Z:2,233666062,+,35S66M,60,0;M D:Z:39 RG:Z:Sample_AD_096 NM:i:0 MQ:i:60 AS:i:39 XS:i:0

HWI-ST1341:97:C7CKRACXX:8:2201:11012:61476 371 5 154316648 6 0 36M65H 2 233666062 0 ATCGTCGGAGCTGGTGGCACAGAGGTTGTACT GACC 77FFBFFFFFFFFFFFFFFFFBBBBBFFFFFFB7BB SA:Z:2,233666062,-,32S69M,60,0;M D:Z:36 RG:Z:Sample_AD_096 NM:i:0 MQ:i:60 AS:i:36 XS:i:0

I tried isolating primary alignments only, with samtools view -b -F 0x4 -F 0x100 -F 0x800 both.bam > primary.bam, but got the same error even though a grep showed only the primary forward and reverse reads separated by a value of 80 or 16. I did remember to index this. I don't know how to solve this. I have two fastq files so I can't use MergeBamAlignment. Should I combine the fastq files into one? My grep output from the primaries only:

HWI-ST1341:97:C7CKRACXX:8:2201:11012:61476 131 2 233666062 6 0 35S66M 5 154316645 0 CCCATCGTCGGAGCTGGTGGCACAGAGGTTGT ACTGACCCTGTCTCTACAAATAATAAAAAAGCCAGGCATGGTGGTATGCACCTGTAGTCCCAGCTACTT BBBFFFFF FFFFFIFFFBFFFFFIFBFFBFFFFFIFFFFFII<ffffiiifffiiiiiiffffffbbbbffffbbbbbbfffffffff ffffff<bbfffb="" sa:z:5,154316645,+,39m62s,60,0;="" md:z:66="" rg:z:sample_ad_096="" n="" m:i:0="" mq:i:60="" as:i:66="" xs:i:44<="" strong="">

HWI-ST1341:97:C7CKRACXX:8:2201:11012:61476 115 2 233666062 6 0 32S69M 5 154316648 0 ATCGTCGGAGCTGGTGGCACAGAGGTTGTACT GACCCTGTCTCTACAAATAATAAAAAAGCCAGGCATGGTGGTATGCACCTGTAGTCCCAGCTACTTGGG 77FFBFFF FFFFFFFFFFFFFBBBBBFFFFFFB7BBFIIIFFFFFBFFFFFFFIIFBFFFFFFIIIFBFIF<f<fbfbbfffff<fbf ffffffffffbbb="" sa:z:5,154316648,-,36m65s,60,0;="" md:z:69="" rg:z:sample_ad_096="" n="" m:i:0="" mq:i:60="" as:i:69="" xs:i:46<="" strong="">

PicardTools Picard MarkDuplicates Pairinfomap • 4.7k views

ADD COMMENT • link updated 9.4 years ago by Matt Shirley 10k • written 9.4 years ago by anp375 ▴ 190

score 2 · Answer 1 · 2016-03-28

It looks like you have actual duplicate reads based on the read name and not the flags. e.g.:

HWI-ST1341:97:C7CKRACXX:8:2201:11012:61476 131 2 233666062 6 0 35S66M 5 154316645 0 ...  
HWI-ST1341:97:C7CKRACXX:8:2201:11012:61476 115 2 233666062 6 0 32S69M 5 154316648 0 ...

My guess is that you had some newish data from a HiSeq where the read pair identifiers come after the read names, separated by a space:

HWI-ST1341:97:C7CKRACXX:8:2201:11012:61476 1:x:xx:xxxxx 131 2 233666062 6 0 35S66M 5 154316645 0 ...  
HWI-ST1341:97:C7CKRACXX:8:2201:11012:61476 2:x:xx:xxxxx 115 2 233666062 6 0 32S69M 5 154316648 0 ...

One of your pre-processing steps probably stripped everything in the read name after the space, and now Picard thinks these are duplicates.

I forgot what the exact format for the 1:x:xx:xxxxx portion is, but you can find the definition from illumina. It will be something like 1:Y:0:ATCGAT which is read number:passing filter:control flag:barcode.