What is the best way to find intervals for parallelization of joint variant calling?
In general I can think of 3 ways:
- Split on poly-N regions in the reference genome. e.g. via picard ScatterIntervalsByNs
- Split on regions without mapped reads
- Split every 1Mbp, book-ended or overlapping
With improvements in sequencing quality it is becoming more difficult to find areas that are obviously safe to split the analysis on.
- Less and smaller poly-N regions
- More regions of the genome can have reads mapped (if using e.g. (long) reads of variable length)
This leads to the risk of not much parallelization (i.e. max per chromosome) or splitting right in the middle of a variant.
What is the best way to parallelize joint variant calling?
(e.g. GATK GenomicDBImport and GenotypeGVCFs on many GVCFs)?
Without losing variants or getting double variants?
I am currently stuck because for some reference genomes it is difficult to find the following in a systematic computational way
as you said, you could use ScatterIntervalsByNs, use the regions with sharing 0 coverage or sharing a too high coverage. For human, there are some blacklisted regions defined by ENCODE.
I prefer to use the fasta for this. BAM coverage analysis is not fully deterministic, it changes with the set of BAM files under analysis and is computational expensive if looking at many samples/bam files. Some non human model organism now have really high quality reference genomes without many (long) poly-N regions.