Question

GATK genomicsDBimport intervals for WGS

1

Entering edit mode

5.7 years ago

Nicolas Rosewick 11k

We have a bunch of WGS samples and would like to import them in genomicsDBimport before joint genotyping. We are for this project interested in coding sequences. Is it better :

To use -L with gencode coding sequences annotation and put --merge-input-intervals to TRUE
To split the analysis and execute one instance of genomicsDBimport per chromosome (e.g. -L chr1). My idea would be to use a job-array on my local slurm cluster (one job per chromosome). But what about the merging ? Should I put the same --genomicsdb-workspace-path for all jobs then ?

version of GATK : 4.1

Thank

gatk genomicsdbimport • 7.1k views

ADD COMMENT • link updated 6 months ago by Sd • 0 • written 5.7 years ago by Nicolas Rosewick 11k

2

Entering edit mode

I had asked a similar question to GATK help/discussion community. From the answers I gathered, looks like it is not recommended to have discontinuous intervals. Actually, they suggested that it would be best that the smallest interval is one whole chromosome. This would avoid problems at the edges of different intervals because GATK is doing local assemblies for each variant site. For merging, I would merge the results at final joint-called VCF level.

ADD REPLY • link 5.6 years ago by Vitis ★ 2.6k

0

Entering edit mode

Hello, can you give more details about WGS interval? Do I need to run genomicsDBimport command seperately for each chromosome? If yes do I need to use different workspace(--genomicsdb-workspace-path)?

ADD REPLY • link 5.5 years ago by MatthewP ★ 1.4k

0

Entering edit mode

Running these steps for each chromosome is largely because there is no enough computational resources for running the entire genome in one shot. If you do run them separately, I think you need to run it in separate commands and use different workspace path.

ADD REPLY • link 5.5 years ago by Vitis ★ 2.6k

0

Entering edit mode

Will this also be the case for exome data? Ideally I'd like to run all chr's at once too.

A second question, what would the syntax be for the X and Y chr's - Is it chrX, chrY or X, Y?

ADD REPLY • link 4.7 years ago by Maverick77 • 0

0

Entering edit mode

If you do have to do them all separately, can they all be gathered up and easily studied together when joint-called using GenotypeGVCFs?

ADD REPLY • link 4.7 years ago by Maverick77 • 0

0

Entering edit mode

It depends on the version of reference genome you used. It should match the name of the chromosome in the reference genome.

ADD REPLY • link 3.4 years ago by samuelandjw ▴ 260

0

Entering edit mode

did you ever get a final answer to this?

ADD REPLY • link 3.4 years ago by cocchi.e89 ▴ 290

0

Entering edit mode

I also have similar question. I sliced the genomic bed file with 50kb windows and 1kb padding into ~700 bed files; each bed file contains 90 windows. I want to run GenomicsDBImport for each of these interval bed files separately and create a database for ~1500 WGS GVCF files and store in my database directory using --genomicsdb-workspace-path command. For example, I use chr1-0_chr1-4411000.bed file and create a database for this bed file by --genomicsdb-workspace-path /GenomicsDBImport/my_databases/chr1-0_chr1-4411000 and create chr1-0_chr1-4411000 directory. At the end, I will have ~700 directories. Then, for each of these ~700 databases, I will run GenotypeGVCFs for these ~700 databases separately and mege all outputs after. Do you think it is possible to do this way? Or do you have any suggestions?

Note, I ran interval for example for whole chr22 and took me very long time to finish. I created smaller bed files to run in parallel to decrease computation time.

This is the head of one of my interval file: chr1-0_chr1-4411000.bed (includes 90 lines).

chr1    0       50000
chr1    49000   99000
chr1    98000   148000
chr1    147000  197000
chr1    196000  246000
chr1    245000  295000
chr1    294000  344000
chr1    343000  393000
chr1    392000  442000

GenomicsDBImport commands:

/gatk-4.4.0.0/./gatk --java-options "-Xms20G -Xmx20G -DGATK_STACKTRACE_ON_USER_EXCEPTION=true" GenomicsDBImport \
--genomicsdb-workspace-path /GenomicsDBImport/my_databases/chr1-0_chr1-4411000 \
--intervals /GenomicsDBImport/intervals/chr1-0_chr1-4411000.bed \
--tmp-dir /GenomicsDBImport/GDB_TMP/ \
--sample-name-map /GenomicsDBImport/sample_name_map.txt \
--batch-size 90 \
--reader-threads 4

ADD REPLY • link 6 months ago by Sd • 0