I'm running the GATK on 500 samples to call variants in a few megabases of hg18. I am finding that it's going surprisingly slowly. For instance, I have UnifiedGenotyper running on some 1kb regions at the moment, and many have been running over 12 hours without completion. This could be because parts of the regions I'm targeting for caling were capture-targetted, and the pile up of illumina reads aligned to those regions can be very deep. So my next experiment is to try to mitigate the effect of these deeply covered regions by running GATK with a relatively low -dcov
value, say around 50. If this could be expected to substantially affect its accuracy, I would be grateful to learn about it.
Here are the options I'm running GATK with, in case I'm doing something silly:
-T UnifiedGenotyper -glm BOTH -L $region \
-R .../human_b36_both.chr.fasta -o $outpath
-I <bamfile> -I <bamfile> ...
Also, I understand there's a markov chain underlying the UG's calls. I suspect slow convergence might be the main factor. Is there an option to tell UG to punt on a site after a certain length of markov chain?
are you putting all 500 .bams through UG at the same time?
Yes. Is that too much?
I was going to recommend posting at getsatisfaction.com but it seems you already have. I thought your command line might be too long, or that you'd maxed out the memory, but that doesn't seem likely having seen your gsa post.
Cross-ref: http://getsatisfaction.com/gsa/topics/speeding_up_the_unifiedgenotyper_on_493_samples