Best way to find intervals for parallelization of joint variant calling
1
2
Entering edit mode
4.0 years ago
William ★ 5.3k

What is the best way to find intervals for parallelization of joint variant calling?

In general I can think of 3 ways:

  1. Split on poly-N regions in the reference genome. e.g. via picard ScatterIntervalsByNs
  2. Split on regions without mapped reads
  3. Split every 1Mbp, book-ended or overlapping

With improvements in sequencing quality it is becoming more difficult to find areas that are obviously safe to split the analysis on.

  • Less and smaller poly-N regions
  • More regions of the genome can have reads mapped (if using e.g. (long) reads of variable length)

This leads to the risk of not much parallelization (i.e. max per chromosome) or splitting right in the middle of a variant.

What is the best way to parallelize joint variant calling?

(e.g. GATK GenomicDBImport and GenotypeGVCFs on many GVCFs)?

Without losing variants or getting double variants?

variants parallel • 1.7k views
ADD COMMENT
0
Entering edit mode
4.0 years ago

I wrote a tool named: FaidxSplitter http://lindenb.github.io/jvarkit/FaidxSplitter.html , I'm not sure if it fulfill your needs.

it takes as input:

  • the indexed reference genome
  • a BED of gaps where I don't want any calling
  • a BED of genes/regions where I don't want any 'cut'.
  • w the size of each region
  • x the overlap of between to region

the output is a BED file with all the regions to be called.

I then call the variants for each bed and I concatenate the vcf with bcftools, removing the duplicates.

ADD COMMENT
0
Entering edit mode

I am currently stuck because for some reference genomes it is difficult to find the following in a systematic computational way

a BED of gaps where I don't want any calling

ADD REPLY
0
Entering edit mode

as you said, you could use ScatterIntervalsByNs, use the regions with sharing 0 coverage or sharing a too high coverage. For human, there are some blacklisted regions defined by ENCODE.

ADD REPLY
0
Entering edit mode

I prefer to use the fasta for this. BAM coverage analysis is not fully deterministic, it changes with the set of BAM files under analysis and is computational expensive if looking at many samples/bam files. Some non human model organism now have really high quality reference genomes without many (long) poly-N regions.

ADD REPLY

Login before adding your answer.

Traffic: 1932 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6