Best Practices / Faster Samtools Mpileup, Any Tips?
2
5
Entering edit mode
12.4 years ago
Kevin ▴ 640

I need to run samtools mpileup on 38 individuals for whole genome sequencing. I intend to parallelize the process by splitting by chromosomes. I thought of splitting by regions to get more parallel chunks but I was told that each mpileup process consumes quite a fair bit of memory and it will segfault if it runs out of memory.

I am looking for tips on how to speedup the mpileup calls as I think from past experiences, it took 2 weeks for mpileup calls on 100 individuals for chr1.

I also separate ref.fa for male and female subjects. Is it alright if I were to use the male ref.fa for all idv ?

Cheers

samtools mpileup • 17k views
ADD COMMENT
0
Entering edit mode

What kind of hardware do you have access to?

ADD REPLY
0
Entering edit mode

20 linux servers with ~ 32 Gb ram ... but not a whole lot of hdd space .. which is the bummer .. the shared HPC resource has way more cores but only 500 Gb scratch .. which is just enough for one chromosome from 38 idv i guess

ADD REPLY
0
Entering edit mode

why not code your own pileup module in C/C++ with htslib, in which you will get more control on what data to pileup, how to pileup, and what to do with the pileup directly on the fly?

ADD REPLY
13
Entering edit mode
12.4 years ago

Using xargs, you can parallelize your call by chromosomes (-P is the number of parallel processes started parallel - don't use more than the number of cores you have):

samtools view -H yourFile.bam | grep "\@SQ" | sed 's/^.*SN://g' | cut -f 1 | xargs -I {} -n 1 -P 24 sh -c "samtools mpileup -BQ0 -d 100000 -uf yourGenome.fa -r {} yourFile.bam | bcftools view -vcg - > tmp.{}.vcf"

To merge the results afterwards, you might want to do something like this (not the best way):

samtools view -H yourFile.bam | grep "\@SQ" | sed 's/^.*SN://g' | cut -f 1 | perl -ane 'system("cat tmp.$F[0].bcf >> yourFile.vcf");'

I use xargs quite a lot, but I cannot say anything about the memory usage of mpileup. So, you should check the memory usage, before using to many parallel jobs.

ADD COMMENT
1
Entering edit mode

There is a small error

samtools view -H yourFile.bam | grep "\@SQ" | sed 's/^.*SN://g' | cut -f 1 | perl -ane 'system("cat tmp.$F[0].bcf >> yourFile.vcf");'

should be

samtools view -H yourFile.bam | grep "\@SQ" | sed 's/^.*SN://g' | cut -f 1 | perl -ane 'system("cat tmp.$F[0].vcf >> yourFile.vcf");'
ADD REPLY
0
Entering edit mode

If there was Biostars Gold I would give it to you, very nice piece of code here.

ADD REPLY
0
Entering edit mode
12.2 years ago
Kevin ▴ 640

Ok just learnt about the new calling by region option in mpileup.

Calling SNPs/INDELs in small regions (see http://samtools.sourceforge.net/mpileup.shtml )

 vcfutils.pl splitchr -l 500000 ref.fa.fai | xargs -i \     echo samtools mpileup
 -C50 -m3 -F0.0002 -DSuf ref.fa -r {} -b bam.list \| bcftools \     view -bcvg - \> part-{}.bcf

http://massgenomics.org/2012/03/5-things-to-know-about-samtools-mpileup.html

Random position retrieval that works. One of the most powerful features of mpileup is that you can specify a region with -r chrom:start-stop and it will report pileup output for the specified position(s). The old pileup command had this option, but took a long time because it looked at all positions and just reported the ones within your desired region. Instead, mpileup leverages BAM file indexing to retrieve data quite rapidly: In my experience, it takes about 1 second to retrieve the pileup for several samples at any given position in the human genome. Multi-sample, rapid random access has lots of uses for bio-informaticians; for example, I can retrieve all bases observed in all samples at a variant of interest to look at the evidence in each sample.

ADD COMMENT
0
Entering edit mode

I confirm it is about 1 second per lookup now, but that adds up when I have 1000s of SNPs per bacterial genome, and 100s of genomes... but thanks for the info, it's very useful.

ADD REPLY

Login before adding your answer.

Traffic: 1859 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6