Efficient Collapsing Of Bam Reads By Sequence
3
1
Entering edit mode
11.9 years ago
user ▴ 950

Is there an efficient utility out there for collapsing BAM files by sequence? I.e. keep only one of each sequence read (ideally with some constraints on which quality score reads to keep when there are multiple identical sequence reads with distinct quality scores)? thanks.

To clarify, I'd like to be able to only remove duplicates if their sequences are identical - so keep reads with the same alignment position if they have distinct sequences.

rna-seq sequence sam samtools • 5.5k views
ADD COMMENT
3
Entering edit mode
ADD COMMENT
0
Entering edit mode

The downside to using this is that the BAM files generated could only be used with GATK's tools. And (correct me if I'm wrong) but I thought this was part of the GATK v2 code that isn't open.

At least there is a specification.

ADD REPLY
2
Entering edit mode
11.9 years ago

Well you have the rmdup command of samtools:

Remove potential PCR duplicates: if multiple read pairs have identical external coordinates, 
only retain the pair with highest mapping quality.
ADD COMMENT
1
Entering edit mode

I've seen several references to rmdup being not recommended by its own creators etc. but if it's not recommended, why is it part of samtools still? If the goal is to remove PCR duplicates, then why isn't Picard's program identical to samtools's rmdup? What does it mean for one to be better than the other, if they are advertised to do the same thing, which is remove PCR duplicate reads (reads on same strand, position, regardless of sequence)?

ADD REPLY
0
Entering edit mode

If I am not mistaken, the creator of this tool (Li Heng, user lh3) actually recommends using Picard MarkDuplicates instead of samtools rmdup.

ADD REPLY
0
Entering edit mode

Interesting, I did not know that.

ADD REPLY
0
Entering edit mode

rmdup doesn't look at the sequences themselves Why Did Samtools Rmdup Find So Many Duplicates In My Data?

ADD REPLY
0
Entering edit mode

indeed, it all depends what the OP really needs - people often use the two concepts interchangeably (identical reads vs reads that map to the same location) though I agree that these are not the same

ADD REPLY
2
Entering edit mode
11.9 years ago

I had better experience with Picard's MarkDuplicates than with samtools rmdup: http://picard.sourceforge.net/command-line-overview.shtml#MarkDuplicates

However, I did not find exact explanation how it works. I noticed in IGV that it removed exactly same reads (sequences) at exactly same positions and that samtools flagstat got better, but did not deeply examined what else it did though. Hope this helps.

ADD COMMENT
0
Entering edit mode

If they remove the same reads, then what made the experience with MarkDuplicates better?

ADD REPLY

Login before adding your answer.

Traffic: 2053 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6