Question

sam/bam file handling -> new tools?

5

Entering edit mode

10.2 years ago

Richard ▴ 600

Hi all,

We are thinking about ways to make our production pipeline run faster. Right now we're settled on aligner, but everything to get us from the SAM/BAM creation up to a sorted, merged, duplicate marked BAM could be updated.

We'll need tools to help us:

Sort
merge
mark duplicates
flagstats
make bam indices (.bai)

On my list of tools to evaluate, I have some combination of the following:

samtools
picard
sambamba
samblaster

Are there any tools that I am missing? Are there any combinations of tools that people find particularly effective?

Right now, our current workflow involves

Align with Bwa
Convert to BAM and sort (samtools)
Duplicate Mark BAM (Picard)
Merge and Duplicate mark all the lanes for a sample (Picard)

Looking forward to your suggestions!

sam markduplicates sort bam • 5.4k views

ADD COMMENT • link updated 3.0 years ago by Ram 45k • written 10.2 years ago by Richard ▴ 600

Ram · Answer 1 · 2015-03-18

Interesting question.

Five years ago, we need multiple libraries and and multiple Illumina runs to sequence a human individual to high coverage. At that time, the reads were short. The base quality was not good. The indel models used by the SNP callers were primitive. The best practice was appropriate in the old time.

Now is very different. We typically sequence one human sample from one library in one run. Reads are longer. Quality is more calibrated. The indel models of modern callers are much better. Nowadays, I usually recommend:

Use Unix pipes and avoid generating temporary files as much as possible. Brad Chapman and several others do pre filtering, mapping, SAM-to-BAM conversion, mark duplicate and sorting with one command line.
Use samblaster for mark duplicate, which is much easier and faster when two ends of a pair are grouped together. Note that for samblaster to work, you need to map one library in one go.
Personally, I found sambamba is faster for sorting than samtools, but I have not done careful comparisons. Sambamba also generates the BAM index while sorting. This could save a wall-clock hour or so.
Skip base recalibration. The gatk implementation is slow and most of times, it is not necessary any more. I have also seen recalibration lead to slightly unexpected results.
Skip gatk indel realignment if you use gatk-hc, freebayes and platypus. These modern callers realign reads on the fly while calling SNPs/INDELs. The gatk team have shown that realignment has almost no effect when we use gatk-hc. In addition, for >100bp reads, false variants caused by indels are usually not a big concern. Given 100bp reads, even older callers like samtools and gatk-ug work reasonably well without realignment. It should be noted that I actually think a more sophisticated indel realigner might still be able to improve indel calling, but no public tools are available for now.

Following these practices, you can typically get an "analysis-ready" 30X BAM in <24 hours on 8-16 CPU cores.

As to other tools, I think vt is for VCF processing not for BAM processing (right?). elPrep sounds very interesting, but it seems to require quite a lot of RAM not everyone has. In addition, BGI has a GPU-powered pipeline which is much faster. You need to use SOAP for mapping. ADAM provides massively parallelized MarkDuplicates, realignment and variant calling, though probably it does not save CPU time. The SNAP group from UC Berkeley has shown me a pipeline that does SNAP mapping, deduping and sorting in one go in very short time.

score 1 · Answer 2 · 2015-03-18

1

Entering edit mode

10.2 years ago

Pierre Lindenbaum 166k

GATK : realign , recalibrate ... see Best Practices: http://gatkforums.broadinstitute.org/discussion/1186/best-practice-variant-detection-with-the-gatk-v4-for-release-2-0-retired

ADD COMMENT • link 10.2 years ago by Pierre Lindenbaum 166k

Ram · Answer 3 · 2015-03-18

1

Entering edit mode

10.1 years ago

Ying W ★ 4.3k

You might also consider elPrep as a replacement for samtools/Picard (supposed to be fast, have not used it myself)

I would also recommend you look into what speedseq does since it might be doing something similar to what you are trying to achieve

Keep in mind that using Gnu Parallel - Parallelize Serial Command Line Programs Without Changing Them and Piping With Samtools, Bwa And Bedtools could also be help you increase speed.

ADD COMMENT • link 10.1 years ago by Ying W ★ 4.3k

0

Entering edit mode

+1 for GNU parallel. Great tool.

I'll check out elPrep and speedseq

ADD REPLY • link updated 3.0 years ago by Ram 45k • written 10.1 years ago by Richard ▴ 600

score 0 · Answer 4 · 2015-03-18

0

Entering edit mode

10.1 years ago

brentp 24k

you should also use vt normalize on your BAM files to left-align and trim.

See this paper: http://bioinformatics.oxfordjournals.org/content/early/2015/02/19/bioinformatics.btv112.abstract

for the difference it can make.

ADD COMMENT • link 10.1 years ago by brentp 24k