Entering edit mode
10.3 years ago
devinliao0918
▴
40
Does anyone have some experience about using Samtools to remove duplicate reads in Bam file?
I tried the command line samtools rmdup -sS in.bam out.bam
but obtained the following error message:
[bam_rmdupse_core] 78733374 / 149832972 = 0.5255 in library '2810277'
As Pierre mentioned that it is just a log. It tells that 52.55 % of the aligned reads in your bam file for the library "2810277" were duplicates and got removed.
Does the log tell that 52.55% of the aligned reads are retained instead of being removed? You see, I checked the file size and found that the ratio of the size of out.bam to that of in.bam is around 52.55%.
One more question, do I need to sort the in.bam file before running the command line "samtools rmdup -sS in.bam out.bam"?
Does the log tell that 52.55% of the aligned reads are retained instead of being removed?
It tells that 52.55% of the reads were removed because they were duplicate of other reads.
You see, I checked the file size and found that the ratio of the size of out.bam to that of in.bam is around 52.55%.
The size of input and output bam file should not be used to evaluate how many reads were removed. You can count the number of reads in your input and output bam file by "samtools view -c".
do I need to sort the in.bam file before running the command line "samtools rmdup -sS in.bam out.bam"?
Yes it should be done on a sorted bam file. I have never tried it on the unsorted bam file but I assume samtools should throw an error if it is unsorted.
One more question: is it very common to remove 52.55% of the aligned reads because they are duplicates of other reads? If so, why people put these reads in the original Bam file since they are going to be discarded in many downstream application or analysis?
E.g. I know to use GATK call variants, the default filter will exclude reads that are masked as duplicates.