Hello,
I'm new to the community, so hello everybody! :-) I tried to find an answer to my question but I didn't managed to find anything so... here it comes.
I'm trying to runGATK HaplotypeCaller on a single human sample. In order to speed things up I want to split the analysis in several independent chunks. For this, I'm using GATK SplitInterval to generate an arbitrary number of chunks. If I run the HaplotypeCaller with this intervals, I get most of the results (compared against a single HaplotypeCaller call without intervals), but a few are missing or are slightly different (the ones that crosses interval boundaries. The "solution" is to add some interval padding. When adding that padding, the results are identical, but then I cannot merge the files with just a GatherVCFs: I need to decompress the file and take care of duplicates (that span over the padding interval, approximately).
Which is the usual way to perform this operations being able to use the quick GatherVCFs and not include the duplicate lines?
Another secondary issue of multiple splitting is that most of the time I get a block telling me there are no variants at the end and at the beginning of the file, and these are not joined, creating a bigger file than needed. This is related to the initial issue: When using interval padding I did not found a way to ask for a strict range when the block crosses the boundary: either I'm getting the requested position and a little bit more (the whole block) using the -r/-R options of bcftools or I'm getting only the inner blocks (so less bases than requested) when using -t/-T options.
Any help with those issues will be very welcomed :-) I'm just entering the world of VCF files and almost all tools are kind of new to me.
Thanks a lot!