Efficient And Fastest Way To Sort Large (>100Gb) Bam Files?
8
13
Entering edit mode
12.7 years ago
Rm 8.3k

What are fastest possible ways to sort the large bam files (greater then 100GB) (and With new version of highseq's; data will further increase in size...)

samtools sort -m 6000000000

Picard's SortSam

Any others tools or multicore/parallel versions? and best possible options?

How are large sequencing centers dealing with this problem

sort bam samtools picard • 39k views
ADD COMMENT
15
Entering edit mode
12.7 years ago
lh3 33k

A utility from novocraft is the best, but its license expires in 15 days, if I remember correctly. There is a multithreaded sort I implemented as a weekend project (in the "mt" branch), but it is not as efficient because fully parallelizing needs a reimplementation, while I only have the time for small modifications. With the samtools-mt, you can:

samtools sort -@ 8 -m 4G

Note that each thread will use about 4-4.5GB of memory in this setting, so make sure you have enough memory when pushing up -m and -@.

Sorting a 100GB BAM takes a little more than 1 day as I remember. Multithreaded samtools saves you about half a day with 4-8 cores. Novocraft is even better.

ADD COMMENT
2
Entering edit mode

I may be wrong: novosort seems to be binary free...

ADD REPLY
1
Entering edit mode

I actually I was correct on the novosort licensing. When the license expires, mutli-threading will be switched off, though its single-threaded version is faster than samtools' single-threaded sort.

ADD REPLY
1
Entering edit mode

I believe that nowadays the "mt" branch of samtools is now merged into the trunk at github. I post this comment to warn other readers because I spend quite some time unsuccesfully trying to find the "mt" branch.

ADD REPLY
0
Entering edit mode

I mean the free license expires in 15 days, if I am correct.

ADD REPLY
0
Entering edit mode

thanks @lh3: I need to test novosoft on multiple threads and also need to see if sorted bam is compatible with GATK and other VC tools

ADD REPLY
0
Entering edit mode

I actually I was correct on the novosort licensing. When the license expires, you will be left with a single-threaded version. I do not know how much novosort costs, but it is really great.

ADD REPLY
3
Entering edit mode
10.5 years ago
Richard ▴ 590

I know I'm late to the party but I just tried this tool: http://lomereiter.github.io/sambamba/

and it sorted by bam 4 times faster than samtools sort.

ADD COMMENT
2
Entering edit mode
10.2 years ago
Sparks ▴ 70

Hi,

The latest Novosort can also Mark/Remove duplicates while sorting. This adds negligible time to the sort so effectively you get mark duplicates for free. Marking is similar to Picard process.

Best, Colin

ADD COMMENT
2
Entering edit mode
10.2 years ago
Charles Plessy ★ 2.9k

One more tool to consider if the other ones do not give you satisfaction: biobambam. Disclaimer: I have not tested it yet.

ADD COMMENT
1
Entering edit mode
12.7 years ago
Ying W ★ 4.3k

I think i saw on the samtools listserv someone saying the best way would be to split up the bam into uncompressed bams, sort it and then merge it back together. That way the sorting would be done in parallel.

Edit: actually nevermind, see here: https://sourceforge.net/mailarchive/message.php?msg_id=28795254

ADD COMMENT
0
Entering edit mode

@ying: In that case in the pipelines best to split the initial reads and go up to merge bams: but i have noticed even merging step also takes many hours; or day or more....

ADD REPLY
0
Entering edit mode

It is interesting. Do we need sort again after we merge sorted small bam?

ADD REPLY
1
Entering edit mode
12.4 years ago
Sparks ▴ 70

Hi, Latest Novosort has trial license until 1st September, after that it's $299/yr for license. Latest Version has:

  1. Properly sets @HD
  2. Sort & Merge and Index in one run.
  3. Picard like handling of @PG & @RG when mergeing (i.e numeric suffix to avoid duplicates)
  4. Ability to add/replace @RG
  5. Name sorting (string compare only)
  6. Coordinate sorting can include strand
  7. Smart handling of @SQ can merge files with different @SQ entries or order.
  8. Can set compression level for temporary and final files.

Cheers, Colin

ADD COMMENT
0
Entering edit mode
8.2 years ago
Sparks ▴ 70

Latest Novosort can also Mark/Remove duplicates while sorting at virtually no performance loss compared to just sorting.

Duplicate detection also allows use of molecular bar code tags in the signature.

ADD COMMENT
0
Entering edit mode
7.4 years ago
cmdcolin ★ 4.0k

One more option by DNANexus that tries to obtain a speedup by using a RocksDB backend (trying to improve file I/O from what I can gather) http://devblog.dnanexus.com/faster-bam-sorting-with-samtools-and-rocksdb/ uses a fork of samtools on github

ADD COMMENT

Login before adding your answer.

Traffic: 1652 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6