Samtools: remove reads that are masked as duplicates
1
0
Entering edit mode
10.3 years ago

Does anyone have some experience about using Samtools to remove duplicate reads in Bam file?

I tried the command line samtools rmdup -sS in.bam out.bam but obtained the following error message:

[bam_rmdupse_core] 78733374 / 149832972 = 0.5255 in library '2810277'
next-gen • 5.5k views
ADD COMMENT
0
Entering edit mode
It's not an error message, it's juts a log:
fprintf(stderr, "[bam_rmdupse_core] %lld / %lld = %.4lf in library '%s'\n", (long long)q->n_removed,
                    (long long)q->n_checked, (double)q->n_removed/q->n_checked, kh_key(aux, k));
ADD REPLY
0
Entering edit mode

As Pierre mentioned that it is just a log. It tells that 52.55 % of the aligned reads in your bam file for the library "2810277" were duplicates and got removed.

ADD REPLY
0
Entering edit mode

Does the log tell that 52.55% of the aligned reads are retained instead of being removed? You see, I checked the file size and found that the ratio of the size of out.bam to that of in.bam is around 52.55%.

One more question, do I need to sort the in.bam file before running the command line "samtools rmdup -sS in.bam out.bam"?

ADD REPLY
0
Entering edit mode

Does the log tell that 52.55% of the aligned reads are retained instead of being removed?

It tells that 52.55% of the reads were removed because they were duplicate of other reads.

You see, I checked the file size and found that the ratio of the size of out.bam to that of in.bam is around 52.55%.

The size of input and output bam file should not be used to evaluate how many reads were removed. You can count the number of reads in your input and output bam file by "samtools view -c".

do I need to sort the in.bam file before running the command line "samtools rmdup -sS in.bam out.bam"?

Yes it should be done on a sorted bam file. I have never tried it on the unsorted bam file but I assume samtools should throw an error if it is unsorted.

ADD REPLY
0
Entering edit mode

One more question: is it very common to remove 52.55% of the aligned reads because they are duplicates of other reads? If so, why people put these reads in the original Bam file since they are going to be discarded in many downstream application or analysis?

E.g. I know to use GATK call variants, the default filter will exclude reads that are masked as duplicates.

ADD REPLY
0
Entering edit mode
10.3 years ago
Ian 6.1k

Did you sort your BAM file first? The help page suggests '<input.srt.bam>' as input.

That might explain the error message.

ADD COMMENT
0
Entering edit mode

Do you know how to check whether a Bam file is sorted or not? I guess the in.bam file is already sorted.

ADD REPLY

Login before adding your answer.

Traffic: 2078 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6