Question

Best Practices / Faster Samtools Mpileup, Any Tips?

5

Entering edit mode

12.8 years ago

Kevin ▴ 640

I need to run samtools mpileup on 38 individuals for whole genome sequencing. I intend to parallelize the process by splitting by chromosomes. I thought of splitting by regions to get more parallel chunks but I was told that each mpileup process consumes quite a fair bit of memory and it will segfault if it runs out of memory.

I am looking for tips on how to speedup the mpileup calls as I think from past experiences, it took 2 weeks for mpileup calls on 100 individuals for chr1.

I also separate ref.fa for male and female subjects. Is it alright if I were to use the male ref.fa for all idv ?

Cheers

samtools mpileup • 18k views

ADD COMMENT • link updated 4.7 years ago by wulj2 ▴ 50 • written 12.8 years ago by Kevin ▴ 640

0

Entering edit mode

What kind of hardware do you have access to?

ADD REPLY • link 12.8 years ago by Alex Paciorkowski 3.5k

0

Entering edit mode

20 linux servers with ~ 32 Gb ram ... but not a whole lot of hdd space .. which is the bummer .. the shared HPC resource has way more cores but only 500 Gb scratch .. which is just enough for one chromosome from 38 idv i guess

ADD REPLY • link 12.8 years ago by Kevin ▴ 640

0

Entering edit mode

why not code your own pileup module in C/C++ with htslib, in which you will get more control on what data to pileup, how to pileup, and what to do with the pileup directly on the fly?

ADD REPLY • link 4.7 years ago by wulj2 ▴ 50

score 13 · Answer 1 · 2012-07-11

13

Entering edit mode

12.8 years ago

David Langenberger 11k

Using xargs, you can parallelize your call by chromosomes (-P is the number of parallel processes started parallel - don't use more than the number of cores you have):

samtools view -H yourFile.bam | grep "\@SQ" | sed 's/^.*SN://g' | cut -f 1 | xargs -I {} -n 1 -P 24 sh -c "samtools mpileup -BQ0 -d 100000 -uf yourGenome.fa -r {} yourFile.bam | bcftools view -vcg - > tmp.{}.vcf"

To merge the results afterwards, you might want to do something like this (not the best way):

samtools view -H yourFile.bam | grep "\@SQ" | sed 's/^.*SN://g' | cut -f 1 | perl -ane 'system("cat tmp.$F[0].bcf >> yourFile.vcf");'

I use xargs quite a lot, but I cannot say anything about the memory usage of mpileup. So, you should check the memory usage, before using to many parallel jobs.

ADD COMMENT • link 12.8 years ago by David Langenberger 11k

1

Entering edit mode

There is a small error

samtools view -H yourFile.bam | grep "\@SQ" | sed 's/^.*SN://g' | cut -f 1 | perl -ane 'system("cat tmp.$F[0].bcf >> yourFile.vcf");'

should be

samtools view -H yourFile.bam | grep "\@SQ" | sed 's/^.*SN://g' | cut -f 1 | perl -ane 'system("cat tmp.$F[0].vcf >> yourFile.vcf");'

ADD REPLY • link 6.6 years ago by vellamike ▴ 10

0

Entering edit mode

If there was Biostars Gold I would give it to you, very nice piece of code here.

ADD REPLY • link 7.8 years ago by mbcouger • 0

score 0 · Answer 2 · 2012-08-29

Ok just learnt about the new calling by region option in mpileup.

Calling SNPs/INDELs in small regions (see http://samtools.sourceforge.net/mpileup.shtml )

 vcfutils.pl splitchr -l 500000 ref.fa.fai | xargs -i \     echo samtools mpileup
 -C50 -m3 -F0.0002 -DSuf ref.fa -r {} -b bam.list \| bcftools \     view -bcvg - \> part-{}.bcf

http://massgenomics.org/2012/03/5-things-to-know-about-samtools-mpileup.html

Random position retrieval that works. One of the most powerful features of mpileup is that you can specify a region with -r chrom:start-stop and it will report pileup output for the specified position(s). The old pileup command had this option, but took a long time because it looked at all positions and just reported the ones within your desired region. Instead, mpileup leverages BAM file indexing to retrieve data quite rapidly: In my experience, it takes about 1 second to retrieve the pileup for several samples at any given position in the human genome. Multi-sample, rapid random access has lots of uses for bio-informaticians; for example, I can retrieve all bases observed in all samples at a variant of interest to look at the evidence in each sample.