I need to run samtools mpileup on 38 individuals for whole genome sequencing. I intend to parallelize the process by splitting by chromosomes. I thought of splitting by regions to get more parallel chunks but I was told that each mpileup process consumes quite a fair bit of memory and it will segfault if it runs out of memory.
I am looking for tips on how to speedup the mpileup calls as I think from past experiences, it took 2 weeks for mpileup calls on 100 individuals for chr1.
I also separate ref.fa for male and female subjects. Is it alright if I were to use the male ref.fa for all idv ?
Cheers
What kind of hardware do you have access to?
20 linux servers with ~ 32 Gb ram ... but not a whole lot of hdd space .. which is the bummer .. the shared HPC resource has way more cores but only 500 Gb scratch .. which is just enough for one chromosome from 38 idv i guess
why not code your own pileup module in C/C++ with htslib, in which you will get more control on what data to pileup, how to pileup, and what to do with the pileup directly on the fly?