Question

How does samblaster use library tag information if at all?

1

Entering edit mode

10.5 years ago

Carlos Borroto ★ 2.1k

After seeing a couple of mentions by none other than @lh3 about samblaster, I decided to try it out. I'm in the middle of a massive data processing for a large cohort and picard's markduplicate is taking a good chunk of the processing time.

My main question is, how does samblaster use the library(LB) read group tag? The author mentions the input SAM needs to be sorted by read group id, which makes me think marking duplicates is limited to only reads coming from the same '@RG ID'. In our case we resequence the same sample library a few times. It is my understanding you need to mark duplicate within all the data coming from the same library, not just read group id.

Imagine this situation.

sample: S; library: S; sequence runs: 1, 2

In order to use samblaster I would map with with something like:

bwa mem -r '@RG\tID:S.1\tSM:S\tPL:ILLUMINA\tPU:1\tLB:S' index S.1.r1.fq S.1.r2.fq | samblaster | samtools view -Sb - > S.1.out.bam
bwa mem -r '@RG\tID:S.2\tSM:S\tPL:ILLUMINA\tPU:2\tLB:S' index S.2.r1.fq S.2.r2.fq | samblaster | samtools view -Sb - > S.2.out.bam

In this case I would not be marking duplicates within all the data coming from the same library, even if samblaster correctly uses the LB tag. Do you see a way of using piping(data streaming) but still marking duplicates correctly in this situation?

Thanks,
Carlos.

samblaster markduplicates • 3.8k views

ADD COMMENT • link updated 3.2 years ago by Ram 45k • written 10.5 years ago by Carlos Borroto ★ 2.1k

0

Entering edit mode

Another question: Does MarkDuplicates use both ID and LB to match to mark reads as duplicate? Or just LB?

ADD REPLY • link updated 3.2 years ago by Ram 45k • written 10.5 years ago by brentp 24k

0

Entering edit mode

That's a good question. I assumed picard uses LB only, but I have no evidence for that.

ADD REPLY • link 10.5 years ago by Carlos Borroto ★ 2.1k

Ram · Answer 1 · 2015-03-05

samblaster currently ignores both the LB and RG tags. The input file must be grouped by QNAME (often also called "read id"). That is, the file need not be sorted by QNAME so long as all the alignments for a given QNAME are in contiguous lines in the input file. This is the natural order for the output of essentially all aligners.

I hope this answers your questions.

Greg