GATK GenomicsDBImport update database error using snakemake
1
0
Entering edit mode
4 months ago
Peter Chung ▴ 210

I am new to use snakemake and now I am able to apply it in the GATK GenomicsDBImport steps combining 500 genotype vcf files. now I have 200 more genotype vcf files to combine so I tried the GenomicsDBImport genomicsdb-update-workspace-path argument, and I have an error and the script will delete my previous database as well. I think I can combine them all in once by combine700 genotype vcf files together, but I would like to know how to increment the gvcf files to the database, below is my script.

This one is how I updated the database:

  # Snakefile
  import os

  # Define the path to the GATK binary and Java options
  GATK_PATH = "/opt/conda/envs/gvcf/bin/gatk"
  JAVA_OPTIONS = "-Xmx16g -Xms16g -XX:ParallelGCThreads=8"

  # Define the sample name map file
  SAMPLE_MAP = "batch1_1.tsv"

  # Define the temporary directory
  TMP_DIR = "/home/tmp"

  # Define the list of chromosomes to process
  CHROMOSOMES = ["21", "22"]

  rule all:
      input:
          expand("/data2/chr{chrom}_db", chrom=CHROMOSOMES)

  rule genomics_db_import:
      input:
          sample_map=SAMPLE_MAP,
      output:
          directory("/data2/chr{chrom}_db"),
      params:
          gatk=GATK_PATH,
          java_options=JAVA_OPTIONS,
          chrom="{chrom}",
          batch_size=50,
          tmp_dir=TMP_DIR,
          reader_threads=20,
          consolidate=True,
      shell:
          "{params.gatk} --java-options '{params.java_options}' \
          GenomicsDBImport \
          --genomicsdb-update-workspace-path {output} \
          --batch-size {params.batch_size} \
          -L chr{params.chrom} \
          --sample-name-map {input.sample_map} \
          --tmp-dir {params.tmp_dir} \
          --reader-threads {params.reader_threads} \
          --consolidate {params.consolidate}"

and then i ran it and update chr21 and chr22 directory:

  snakemake -s genomicdbUpdate.smk --cores 16 & 

The error message is:

A USER ERROR has occurred: We require an existing valid workspace when incremental import is set

  ***********************************************************************
  Set the system property GATK_STACKTRACE_ON_USER_EXCEPTION (--java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true') to print the stack                        trace.
  09:12:09.255 INFO  IntervalArgumentCollection - Processing 46709983 bp from intervals
  09:12:09.256 INFO  GenomicsDBImport - Done initializing engine
  09:12:09.257 INFO  GenomicsDBImport - Shutting down engine
  [July 5, 2024 at 9:12:09 AM UTC] org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport done. Elapsed time: 0.01 minutes.
  Runtime.totalMemory()=17179869184
  ***********************************************************************

  A USER ERROR has occurred: We require an existing valid workspace when incremental import is set

  ***********************************************************************
  Set the system property GATK_STACKTRACE_ON_USER_EXCEPTION (--java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true') to print the stack                        trace.
  [Fri Jul  5 09:12:09 2024]
  Error in rule genomics_db_import:
      jobid: 2
      input: batch1_1.tsv
      output: /data2/chr22_db
      shell:
          /opt/conda/envs/gvcf/bin/gatk --java-options '-Xmx16g -Xms16g -XX:ParallelGCThreads=8'         GenomicsDBImport         --genomi                       csdb-update-workspace-path /data2/chr22_db         --batch-size 50         -L chr22         --sample-name-map batch1_1.tsv         --tmp                       -dir /home/tmp         --reader-threads 20         --consolidate True
          (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

  Removing output files of failed job genomics_db_import since they might be corrupted:
  /data2/chr22_db
  [Fri Jul  5 09:12:09 2024]
  Error in rule genomics_db_import:
      jobid: 1
      input: batch1_1.tsv
      output: /data2/chr21_db
      shell:
          /opt/conda/envs/gvcf/bin/gatk --java-options '-Xmx16g -Xms16g -XX:ParallelGCThreads=8'         GenomicsDBImport         --genomi                       csdb-update-workspace-path /data2/chr21_db         --batch-size 50         -L chr21         --sample-name-map batch1_1.tsv         --tmp                       -dir /home/tmp         --reader-threads 20         --consolidate True
          (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

  Removing output files of failed job genomics_db_import since they might be corrupted:
  /data2/chr21_db
  Shutting down, this might take some time.
  Exiting because a job execution failed. Look above for error message
  Complete log: .snakemake/log/2024-07-05T091205.769438.snakemake.log

I can run the shell script independently without error, however I put it in snakemake it cannot recognize my exisiting database (they are in same name and directory), is there anyone have experience, please advice, thanks.

GenomicsDBImport snakemake gatk • 340 views
ADD COMMENT
0
Entering edit mode
4 months ago

I think it wants to see that /data2/chr{chrom}_db exist prior to running, so make {output} more like sentinel flag

output: directory("/data2/chr{chrom}_db/i_am_done"),

shell:
mkdir -p /data2/chr{wildcards.chrom}_db && \
"{params.gatk} --java-options '{params.java_options}' \
          GenomicsDBImport \
          --genomicsdb-update-workspace-path {output} \
          --batch-size {params.batch_size} \
          -L chr{params.chrom} \
          --sample-name-map {input.sample_map} \
          --tmp-dir {params.tmp_dir} \
          --reader-threads {params.reader_threads} \
          --consolidate {params.consolidate}" && \
    touch {output}
ADD COMMENT

Login before adding your answer.

Traffic: 2929 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6