Question

Why Did Samtools Rmdup Find So Many Duplicates In My Data?

2

Entering edit mode

13.8 years ago

Haiping ▴ 110

Hi. I used samtools rmdup to cancel PCR dup in sorted.bam of my data. But some of the genes in my data lost most of the align reads(see below) during this process, Is it right? And the more the reads aligned to a gene, the higher percitage of reads that be delete during rmdup. Why the command cancel so many reads. It seems impossible to find so many dup in my data.

before after

235132 15438

2410 1535

1740 1489

2926 2493

636 548

2666 2258

1866 1581

2390 2009

1040 885

8019 3467

1668 1418

2218 1928

2011 1730

4902 1924

120 103

14634 4432

25263 3206

1047 844

36094 4895

9222 6558

177499 19560

390835 25276

240 195

samtools next-gen sequencing duplicates • 5.8k views

ADD COMMENT • link updated 13.8 years ago by Jeremy Leipzig 23k • written 13.8 years ago by Haiping ▴ 110

0

Entering edit mode

what type of experiment did you attempt? WGS, capture?

ADD REPLY • link 13.8 years ago by Drio ▴ 920

0

Entering edit mode

The following work is to call SNPs. Any comments?

ADD REPLY • link 13.8 years ago by Haiping ▴ 110

Ram · Answer 1 · 2011-07-06

4

Entering edit mode

13.8 years ago

Jeremy Leipzig 23k

samtools rmdup is based on POS, it doesn't really care what your sequence is

two distinct sequences with identical mapping positions:
samtools view /tmp/b.bam 
HWI-ST431_52:1:1:5514:60320/1   0   hsa-let-7a-1    5   28  26M75H  *   0   0   ATGAGGTAGTAGGTTGTATAGTTATC  HHHFHHHHHFHHHHHGFGGFHHHEHH  PG:Z:novoalign  AS:i:60 UQ:i:60 NM:i:2  MD:Z:23T1A0
HWI-ST431_52:1:1:5514:60321/1   0   hsa-let-7a-1    5   28  26M75H  *   0   0   GGGGGGGGGGGGGGGGGGGGGGGGGG  HHHFHHHHHFHHHHHGFGGFHHHEHH  PG:Z:novoalign  AS:i:60 UQ:i:60 NM:i:2  MD:Z:23T1A0

only the first sequence is kept:
samtools rmdup -s /tmp/b.bam /tmp/b.nodup.bam
[bam_rmdupse_core] 1 / 2 = 0.5000 in library '  '
samtools view /tmp/b.nodup.bam 
HWI-ST431_52:1:1:5514:60320/1   0   hsa-let-7a-1    5   28  26M75H  *   0   0   ATGAGGTAGTAGGTTGTATAGTTATC  HHHFHHHHHFHHHHHGFGGFHHHEHH  PG:Z:novoalign  AS:i:60 UQ:i:60 NM:i:2  MD:Z:23T1A0

ADD COMMENT • link 13.8 years ago by Jeremy Leipzig 23k

1

Entering edit mode

Thanksn for your comments. I also would like to knwo if there any differences between samtools rmdup and fastx_collapser? Thanks all!

ADD REPLY • link 13.8 years ago by Haiping ▴ 110

0

Entering edit mode

So it means if the coverage of my data is high, I should skip samtools rmdup during SNP calling to prevent the lost of many informations. Is there any another powerful sofeware that I can use to delete the dup or actually I don't need to do this?

ADD REPLY • link 13.8 years ago by Haiping ▴ 110

0

Entering edit mode

Impossible to say without knowing more about your data. In general, removing duplicates is a good idea. If you lose 85% of your data during duplicate removal, you're probably looking at some artifact.

ADD REPLY • link 13.8 years ago by Marvin ▴ 900

0

Entering edit mode

Actually, I combined the genome resequencing data from 10 samples. Each of sample just cover 2 fold for reference. Is it the reasons why samtools cancel so many reads? If the data are from the same sample with low coverage, it seems impossible to see so many duplicate.

ADD REPLY • link 13.8 years ago by Haiping ▴ 110

0

Entering edit mode

And also. is there any software that care about not only the positions but also sequences for rmdup? Thanks all.

ADD REPLY • link 13.8 years ago by Haiping ▴ 110

0

Entering edit mode

There is the fastx_collapser http://hannonlab.cshl.edu/fastx_toolkit/commandline.html#fastx_collapser_usage

ADD REPLY • link updated 5.5 years ago by Ram 45k • written 13.8 years ago by Jeremy Leipzig 23k