To speed up variant calling (SNP / Indel) on a set of large BAM files I want to do the variant calling in paralell on a Sun Grid Engine cluster.
I already managed to split the BAMs by reference (chromosome) using https://github.com/pezmaster31/bamtools/ . Afterward I use vcf tools to concatenate the chromosome VCF files to one sample VCF file.
These bamfiles are still relatively big so I would like to split the BAMS further on large N polymer sequences in the reference. I already have a file with the location of these large N polymer sequences in our reference.
Is there a tool that already has the functionality to split BAM files by N polymer sequences?
Or what would be the easiest way to use this region information to split BAM files?
Is I/O your limiting factor here? if not, just place a single bam on a network-accessible disk, and use samtools to extract the reads you need. (samtools view file.bam 2:1234-5678 | variantcaller)
I can't pipe the samtools output, I need to pass a bam file path to an 3rd party variant caller.
Not a direct answer but GATK 2 has support for reduced bams now so they might reduce your bam sizes for multi sample calling.
but of course if your issue is getting more chunks to parallelise then I think the samtools -L bedfile seems like a good idea