Here to bother you again :-)
I currently use GATK a lot to analyze sequencing data, but many steps took really long time.
In the wiki on GATK parallelism, they recommend to use scatter/gather to speed up. However, I dont fully understand how to do it.
First, how to merge the results of scatter/gather ? For example, codes from GATK wiki:
gsa1> java -jar GenomeAnalysisTK -R human.fasta -T UnifiedGenotyper -I my.bam -L chr1:1-125,000,000 -o my.1.vcf &
gsa1> java -jar GenomeAnalysisTK -R human.fasta -T UnifiedGenotyper -I my.bam -L chr1:125,000,001-249,250,621 -o my.2.vcf &
and wiki posted: "When these two jobs finish, I just merge the two VCFs together and I've got a complete data set in half the time".
But there are headers in VCF files, how to automatically merge these VCF files? Same problems for BAM file, but I found "MergeSamFiles" in Picard, is it a solution for merging bam files? Will it handle different header files in BAM files?
Second, to specify multiple chro, should I use
-L chr1 chr2 chr3
or
-L chr1
-L chr2
Thanks?
What kind of computational resources do you have access to? If you have a cluster which is compatible with it you might want to check out using GATK Queue (http://www.broadinstitute.org/gatk/guide/topic?name=tutorials), which can handle scatter/gatter automatically.