Why Did Samtools Rmdup Find So Many Duplicates In My Data?
1
2
Entering edit mode
13.5 years ago
Haiping ▴ 110

Hi. I used samtools rmdup to cancel PCR dup in sorted.bam of my data. But some of the genes in my data lost most of the align reads(see below) during this process, Is it right? And the more the reads aligned to a gene, the higher percitage of reads that be delete during rmdup. Why the command cancel so many reads. It seems impossible to find so many dup in my data.

before after

235132 15438

2410 1535

1740 1489

2926 2493

636 548

2666 2258

1866 1581

2390 2009

1040 885

8019 3467

1668 1418

2218 1928

2011 1730

4902 1924

120 103

14634 4432

25263 3206

1047 844

36094 4895

9222 6558

177499 19560

390835 25276

240 195

samtools next-gen sequencing duplicates • 5.5k views
ADD COMMENT
0
Entering edit mode

what type of experiment did you attempt? WGS, capture?

ADD REPLY
0
Entering edit mode

The following work is to call SNPs. Any comments?

ADD REPLY
4
Entering edit mode
13.5 years ago

samtools rmdup is based on POS, it doesn't really care what your sequence is

two distinct sequences with identical mapping positions:
samtools view /tmp/b.bam 
HWI-ST431_52:1:1:5514:60320/1   0   hsa-let-7a-1    5   28  26M75H  *   0   0   ATGAGGTAGTAGGTTGTATAGTTATC  HHHFHHHHHFHHHHHGFGGFHHHEHH  PG:Z:novoalign  AS:i:60 UQ:i:60 NM:i:2  MD:Z:23T1A0
HWI-ST431_52:1:1:5514:60321/1   0   hsa-let-7a-1    5   28  26M75H  *   0   0   GGGGGGGGGGGGGGGGGGGGGGGGGG  HHHFHHHHHFHHHHHGFGGFHHHEHH  PG:Z:novoalign  AS:i:60 UQ:i:60 NM:i:2  MD:Z:23T1A0

only the first sequence is kept:
samtools rmdup -s /tmp/b.bam /tmp/b.nodup.bam
[bam_rmdupse_core] 1 / 2 = 0.5000 in library '  '
samtools view /tmp/b.nodup.bam 
HWI-ST431_52:1:1:5514:60320/1   0   hsa-let-7a-1    5   28  26M75H  *   0   0   ATGAGGTAGTAGGTTGTATAGTTATC  HHHFHHHHHFHHHHHGFGGFHHHEHH  PG:Z:novoalign  AS:i:60 UQ:i:60 NM:i:2  MD:Z:23T1A0
ADD COMMENT
1
Entering edit mode

Thanksn for your comments. I also would like to knwo if there any differences between samtools rmdup and fastx_collapser? Thanks all!

ADD REPLY
0
Entering edit mode

So it means if the coverage of my data is high, I should skip samtools rmdup during SNP calling to prevent the lost of many informations. Is there any another powerful sofeware that I can use to delete the dup or actually I don't need to do this?

ADD REPLY
0
Entering edit mode

Impossible to say without knowing more about your data. In general, removing duplicates is a good idea. If you lose 85% of your data during duplicate removal, you're probably looking at some artifact.

ADD REPLY
0
Entering edit mode

Actually, I combined the genome resequencing data from 10 samples. Each of sample just cover 2 fold for reference. Is it the reasons why samtools cancel so many reads? If the data are from the same sample with low coverage, it seems impossible to see so many duplicate.

ADD REPLY
0
Entering edit mode

And also. is there any software that care about not only the positions but also sequences for rmdup? Thanks all.

ADD REPLY
0
Entering edit mode
ADD REPLY

Login before adding your answer.

Traffic: 2169 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6