A question on GenomicsDBImport (GATK)
2
0
Entering edit mode
5.2 years ago
Laven9 • 0

I am now trying to use GenomicsDBImport (GATK). I have a .bed file for my WES sequencing. Should I split the .bed file into small .bed files in order to make it fast (it is much faster if only <=100 intervals is given) or should I run each chromosome every time to produce a complete GVCFs?

GenomicsDBImport • 3.5k views
ADD COMMENT
0
Entering edit mode

I still have questions:
1) If it is better if I used the .Bed file the company offered than "wgs_calling_regions.hg38.interval_list"?
2) Once I split the .Bed file into smaller ones, it would create multiple "gendb://GDB "(s). How can I imerge them to run the "CreateSomaticPanelOfNormals"? Can simply adding more -V gendb://GDB work?

ADD REPLY
0
Entering edit mode

Thank you, Pierre and Nicolas! It do help me a lot!

ADD REPLY
0
Entering edit mode
5.2 years ago

split, it will be faster + parallelizable.

For hg38, the broad provides a list of intervals: https://console.cloud.google.com/storage/browser/genomics-public-data/resources/broad/hg38/v0 "wgs_calling_regions.hg38.interval_list"

ADD COMMENT
0
Entering edit mode
5.2 years ago

As you have exome sequencing my strategy would be to :

  1. split your target interval file (so the regions targeted by your exome kit) using unix split
  2. For each piece execute an instance of genomicsDBimport (followed by genotypeGVCFs) in parallel. If you have a cluster working with slurm you could easily use a job-array for this. Otherwise xargs -P nthreads should also work (replace nthreads by the number of split files and CPU you have). Ideally the number of split files should be the same as the number of available CPUs
  3. Merge all resulting vcfs using GATK GatherVCFs
ADD COMMENT

Login before adding your answer.

Traffic: 2510 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6