Question

GATK GenomicsDBImport update database error using snakemake

0

Entering edit mode

4 months ago

Peter Chung ▴ 210

I am new to use snakemake and now I am able to apply it in the GATK GenomicsDBImport steps combining 500 genotype vcf files. now I have 200 more genotype vcf files to combine so I tried the GenomicsDBImport genomicsdb-update-workspace-path argument, and I have an error and the script will delete my previous database as well. I think I can combine them all in once by combine700 genotype vcf files together, but I would like to know how to increment the gvcf files to the database, below is my script.

This one is how I updated the database:

  # Snakefile
  import os

  # Define the path to the GATK binary and Java options
  GATK_PATH = "/opt/conda/envs/gvcf/bin/gatk"
  JAVA_OPTIONS = "-Xmx16g -Xms16g -XX:ParallelGCThreads=8"

  # Define the sample name map file
  SAMPLE_MAP = "batch1_1.tsv"

  # Define the temporary directory
  TMP_DIR = "/home/tmp"

  # Define the list of chromosomes to process
  CHROMOSOMES = ["21", "22"]

  rule all:
      input:
          expand("/data2/chr{chrom}_db", chrom=CHROMOSOMES)

  rule genomics_db_import:
      input:
          sample_map=SAMPLE_MAP,
      output:
          directory("/data2/chr{chrom}_db"),
      params:
          gatk=GATK_PATH,
          java_options=JAVA_OPTIONS,
          chrom="{chrom}",
          batch_size=50,
          tmp_dir=TMP_DIR,
          reader_threads=20,
          consolidate=True,
      shell:
          "{params.gatk} --java-options '{params.java_options}' \
          GenomicsDBImport \
          --genomicsdb-update-workspace-path {output} \
          --batch-size {params.batch_size} \
          -L chr{params.chrom} \
          --sample-name-map {input.sample_map} \
          --tmp-dir {params.tmp_dir} \
          --reader-threads {params.reader_threads} \
          --consolidate {params.consolidate}"

and then i ran it and update chr21 and chr22 directory:

  snakemake -s genomicdbUpdate.smk --cores 16 &

The error message is:

A USER ERROR has occurred: We require an existing valid workspace when incremental import is set

  ***********************************************************************
  Set the system property GATK_STACKTRACE_ON_USER_EXCEPTION (--java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true') to print the stack                        trace.
  09:12:09.255 INFO  IntervalArgumentCollection - Processing 46709983 bp from intervals
  09:12:09.256 INFO  GenomicsDBImport - Done initializing engine
  09:12:09.257 INFO  GenomicsDBImport - Shutting down engine
  [July 5, 2024 at 9:12:09 AM UTC] org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport done. Elapsed time: 0.01 minutes.
  Runtime.totalMemory()=17179869184
  ***********************************************************************

  A USER ERROR has occurred: We require an existing valid workspace when incremental import is set

  ***********************************************************************
  Set the system property GATK_STACKTRACE_ON_USER_EXCEPTION (--java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true') to print the stack                        trace.
  [Fri Jul  5 09:12:09 2024]
  Error in rule genomics_db_import:
      jobid: 2
      input: batch1_1.tsv
      output: /data2/chr22_db
      shell:
          /opt/conda/envs/gvcf/bin/gatk --java-options '-Xmx16g -Xms16g -XX:ParallelGCThreads=8'         GenomicsDBImport         --genomi                       csdb-update-workspace-path /data2/chr22_db         --batch-size 50         -L chr22         --sample-name-map batch1_1.tsv         --tmp                       -dir /home/tmp         --reader-threads 20         --consolidate True
          (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

  Removing output files of failed job genomics_db_import since they might be corrupted:
  /data2/chr22_db
  [Fri Jul  5 09:12:09 2024]
  Error in rule genomics_db_import:
      jobid: 1
      input: batch1_1.tsv
      output: /data2/chr21_db
      shell:
          /opt/conda/envs/gvcf/bin/gatk --java-options '-Xmx16g -Xms16g -XX:ParallelGCThreads=8'         GenomicsDBImport         --genomi                       csdb-update-workspace-path /data2/chr21_db         --batch-size 50         -L chr21         --sample-name-map batch1_1.tsv         --tmp                       -dir /home/tmp         --reader-threads 20         --consolidate True
          (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

  Removing output files of failed job genomics_db_import since they might be corrupted:
  /data2/chr21_db
  Shutting down, this might take some time.
  Exiting because a job execution failed. Look above for error message
  Complete log: .snakemake/log/2024-07-05T091205.769438.snakemake.log

I can run the shell script independently without error, however I put it in snakemake it cannot recognize my exisiting database (they are in same name and directory), is there anyone have experience, please advice, thanks.

GenomicsDBImport snakemake gatk • 340 views

ADD COMMENT • link updated 4 months ago by Jeremy Leipzig 22k • written 4 months ago by Peter Chung ▴ 210

score 0 · Answer 1 · 2024-07-08

I think it wants to see that /data2/chr{chrom}_db exist prior to running, so make {output} more like sentinel flag

output: directory("/data2/chr{chrom}_db/i_am_done"),

shell:
mkdir -p /data2/chr{wildcards.chrom}_db && \
"{params.gatk} --java-options '{params.java_options}' \
          GenomicsDBImport \
          --genomicsdb-update-workspace-path {output} \
          --batch-size {params.batch_size} \
          -L chr{params.chrom} \
          --sample-name-map {input.sample_map} \
          --tmp-dir {params.tmp_dir} \
          --reader-threads {params.reader_threads} \
          --consolidate {params.consolidate}" && \
    touch {output}