I am new to use snakemake and now I am able to apply it in the GATK GenomicsDBImport steps combining 500 genotype vcf files. now I have 200 more genotype vcf files to combine so I tried the GenomicsDBImport genomicsdb-update-workspace-path argument, and I have an error and the script will delete my previous database as well. I think I can combine them all in once by combine700 genotype vcf files together, but I would like to know how to increment the gvcf files to the database, below is my script.
This one is how I updated the database:
# Snakefile
import os
# Define the path to the GATK binary and Java options
GATK_PATH = "/opt/conda/envs/gvcf/bin/gatk"
JAVA_OPTIONS = "-Xmx16g -Xms16g -XX:ParallelGCThreads=8"
# Define the sample name map file
SAMPLE_MAP = "batch1_1.tsv"
# Define the temporary directory
TMP_DIR = "/home/tmp"
# Define the list of chromosomes to process
CHROMOSOMES = ["21", "22"]
rule all:
input:
expand("/data2/chr{chrom}_db", chrom=CHROMOSOMES)
rule genomics_db_import:
input:
sample_map=SAMPLE_MAP,
output:
directory("/data2/chr{chrom}_db"),
params:
gatk=GATK_PATH,
java_options=JAVA_OPTIONS,
chrom="{chrom}",
batch_size=50,
tmp_dir=TMP_DIR,
reader_threads=20,
consolidate=True,
shell:
"{params.gatk} --java-options '{params.java_options}' \
GenomicsDBImport \
--genomicsdb-update-workspace-path {output} \
--batch-size {params.batch_size} \
-L chr{params.chrom} \
--sample-name-map {input.sample_map} \
--tmp-dir {params.tmp_dir} \
--reader-threads {params.reader_threads} \
--consolidate {params.consolidate}"
and then i ran it and update chr21 and chr22 directory:
snakemake -s genomicdbUpdate.smk --cores 16 &
The error message is:
A USER ERROR has occurred: We require an existing valid workspace when incremental import is set
***********************************************************************
Set the system property GATK_STACKTRACE_ON_USER_EXCEPTION (--java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true') to print the stack trace.
09:12:09.255 INFO IntervalArgumentCollection - Processing 46709983 bp from intervals
09:12:09.256 INFO GenomicsDBImport - Done initializing engine
09:12:09.257 INFO GenomicsDBImport - Shutting down engine
[July 5, 2024 at 9:12:09 AM UTC] org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport done. Elapsed time: 0.01 minutes.
Runtime.totalMemory()=17179869184
***********************************************************************
A USER ERROR has occurred: We require an existing valid workspace when incremental import is set
***********************************************************************
Set the system property GATK_STACKTRACE_ON_USER_EXCEPTION (--java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true') to print the stack trace.
[Fri Jul 5 09:12:09 2024]
Error in rule genomics_db_import:
jobid: 2
input: batch1_1.tsv
output: /data2/chr22_db
shell:
/opt/conda/envs/gvcf/bin/gatk --java-options '-Xmx16g -Xms16g -XX:ParallelGCThreads=8' GenomicsDBImport --genomi csdb-update-workspace-path /data2/chr22_db --batch-size 50 -L chr22 --sample-name-map batch1_1.tsv --tmp -dir /home/tmp --reader-threads 20 --consolidate True
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
Removing output files of failed job genomics_db_import since they might be corrupted:
/data2/chr22_db
[Fri Jul 5 09:12:09 2024]
Error in rule genomics_db_import:
jobid: 1
input: batch1_1.tsv
output: /data2/chr21_db
shell:
/opt/conda/envs/gvcf/bin/gatk --java-options '-Xmx16g -Xms16g -XX:ParallelGCThreads=8' GenomicsDBImport --genomi csdb-update-workspace-path /data2/chr21_db --batch-size 50 -L chr21 --sample-name-map batch1_1.tsv --tmp -dir /home/tmp --reader-threads 20 --consolidate True
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
Removing output files of failed job genomics_db_import since they might be corrupted:
/data2/chr21_db
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: .snakemake/log/2024-07-05T091205.769438.snakemake.log
I can run the shell script independently without error, however I put it in snakemake it cannot recognize my exisiting database (they are in same name and directory), is there anyone have experience, please advice, thanks.