Question

A question on GenomicsDBImport (GATK)

0

Entering edit mode

5.2 years ago

Laven9 • 0

I am now trying to use GenomicsDBImport (GATK). I have a .bed file for my WES sequencing. Should I split the .bed file into small .bed files in order to make it fast (it is much faster if only <=100 intervals is given) or should I run each chromosome every time to produce a complete GVCFs?

GenomicsDBImport • 3.5k views

ADD COMMENT • link 5.2 years ago by Laven9 • 0

0

Entering edit mode

I still have questions:
1) If it is better if I used the .Bed file the company offered than "wgs_calling_regions.hg38.interval_list"?
2) Once I split the .Bed file into smaller ones, it would create multiple "gendb://GDB "(s). How can I imerge them to run the "CreateSomaticPanelOfNormals"? Can simply adding more -V gendb://GDB work?

ADD REPLY • link updated 5.2 years ago by GenoMax 147k • written 5.2 years ago by Laven9 • 0

0

Entering edit mode

Thank you, Pierre and Nicolas! It do help me a lot!

ADD REPLY • link 5.2 years ago by Laven9 • 0

score 0 · Answer 1 · 2019-09-19

0

Entering edit mode

5.2 years ago

Pierre Lindenbaum 164k

split, it will be faster + parallelizable.

For hg38, the broad provides a list of intervals: https://console.cloud.google.com/storage/browser/genomics-public-data/resources/broad/hg38/v0 "wgs_calling_regions.hg38.interval_list"

ADD COMMENT • link 5.2 years ago by Pierre Lindenbaum 164k

score 0 · Answer 2 · 2019-09-19

As you have exome sequencing my strategy would be to :

split your target interval file (so the regions targeted by your exome kit) using unix split
For each piece execute an instance of genomicsDBimport (followed by genotypeGVCFs) in parallel. If you have a cluster working with slurm you could easily use a job-array for this. Otherwise xargs -P nthreads should also work (replace nthreads by the number of split files and CPU you have). Ideally the number of split files should be the same as the number of available CPUs
Merge all resulting vcfs using GATK GatherVCFs