Question

Best way to find intervals for parallelization of joint variant calling

2

Entering edit mode

4.4 years ago

William ★ 5.3k

What is the best way to find intervals for parallelization of joint variant calling?

In general I can think of 3 ways:

Split on poly-N regions in the reference genome. e.g. via picard ScatterIntervalsByNs
Split on regions without mapped reads
Split every 1Mbp, book-ended or overlapping

With improvements in sequencing quality it is becoming more difficult to find areas that are obviously safe to split the analysis on.

Less and smaller poly-N regions
More regions of the genome can have reads mapped (if using e.g. (long) reads of variable length)

This leads to the risk of not much parallelization (i.e. max per chromosome) or splitting right in the middle of a variant.

What is the best way to parallelize joint variant calling?

(e.g. GATK GenomicDBImport and GenotypeGVCFs on many GVCFs)?

Without losing variants or getting double variants?

variants parallel • 1.9k views

ADD COMMENT • link updated 4.4 years ago by Pierre Lindenbaum 166k • written 4.4 years ago by William ★ 5.3k

score 0 · Answer 1 · 2020-11-26

0

Entering edit mode

4.4 years ago

Pierre Lindenbaum 166k

I wrote a tool named: FaidxSplitter http://lindenb.github.io/jvarkit/FaidxSplitter.html , I'm not sure if it fulfill your needs.

it takes as input:

the indexed reference genome
a BED of gaps where I don't want any calling
a BED of genes/regions where I don't want any 'cut'.
w the size of each region
x the overlap of between to region

the output is a BED file with all the regions to be called.

I then call the variants for each bed and I concatenate the vcf with bcftools, removing the duplicates.

ADD COMMENT • link 4.4 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

I am currently stuck because for some reference genomes it is difficult to find the following in a systematic computational way

a BED of gaps where I don't want any calling

ADD REPLY • link 4.4 years ago by William ★ 5.3k

0

Entering edit mode

as you said, you could use ScatterIntervalsByNs, use the regions with sharing 0 coverage or sharing a too high coverage. For human, there are some blacklisted regions defined by ENCODE.

ADD REPLY • link 4.4 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

I prefer to use the fasta for this. BAM coverage analysis is not fully deterministic, it changes with the set of BAM files under analysis and is computational expensive if looking at many samples/bam files. Some non human model organism now have really high quality reference genomes without many (long) poly-N regions.

ADD REPLY • link 4.4 years ago by William ★ 5.3k