samtools merge extremely slow
1
0
Entering edit mode
9.0 years ago
ehzed ▴ 40

Hello,

For parallel processing I split the original fastq files into smaller ones (1000000 reads per file) and performed trimming, alignment, post processing, etc. Now I want to merge them (487 files) for base recalibration and SNP calling. I'm finding that samtools merge is extremely slow, and I've looked up ways to improve efficiency on github (like this one https://github.com/samtools/samtools/issues/203). They claimed that they were able to merge 4500 bam files in less than 3 hours after making this code change, although it's true they didn't specify the size of each bam file. For me, each bam file is about 1.8 MB, and it is still not complete even after 30 hours. I was wondering if something is wrong or if I there is another tool that's faster? Thanks!

next-gen snp samtools genome alignment • 5.2k views
ADD COMMENT
1
Entering edit mode
9.0 years ago

Is it making progress? Is the output file growing, and the process using CPU cycles according to top? Make sure it's still running and you didn't run out of space or something.

Also, I think you meant 1.8GB. 1.8MB would be pretty small for 1M reads. But normally I would expect that to finish fairly quickly, in an hour or so.

ADD COMMENT
0
Entering edit mode

Hi Brian, it just finished running, after nearly 40 hours, and I did made a mistake, each file is about 180MB. Is the long running time indicative that there is something wrong with my files? And do you mind clarifying the point about CPU cycles? I'm not really sure what exactly I should look for...so far I've been running everything on a cluster. Thanks!

ADD REPLY
1
Entering edit mode

If you ssh into the node where your job is running (ask your admin how to do that), you can run the command "top" which will give a list of processes running on that node. They are automatically sorted by CPU usage, by default. If you see a process named "samtools" with your username, and a number greater than zero (hopefully, 100 or so) in the utilization column, that means your job is working correctly. There may be tons of random processes in the list; typically, yours would be at or near the top. (...when it's running. It won't be there now, since it finished.)

Whether the long time indicated a problem with your input is hard to say. If it finished successfully, then no. Again, ask your admin how to determine whether a job finished successfully (rather than timing out, for example). But, it does sound like your cluster either has major bottlenecks, or is oversubscribed. It's important to differentiate between time spent in qw (waiting to run) versus r (running). If the job was waiting to run for the first 39 hours, then everything's fine, the cluster was just busy, so you had to wait in line. If it actually ran for 40 hours, then your cluster may have a serious hardware problem.

ADD REPLY
0
Entering edit mode

Thanks for that detailed reply, it did finish successfully (we will get error messages and be notified if a job was killed) and the job did indeed run for 40 hours. I will further investigate to see if this is a cluster issue or something else. Again, thanks so much!

ADD REPLY

Login before adding your answer.

Traffic: 2706 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6