I've extracted chromosome 4 from a whole genome bam file as follows:
samtools view -h "$BAM" chr4 > "$EXT/temp/"$PREFIX"_chr4.sam"
samtools view -bS "$EXT"/temp/$PREFIX"_chr4.sam" > "$EXT"/temp/$PREFIX"_chr4.bam"
Then added read groups, as required by GATK
picard AddOrReplaceReadGroups I="$BAM" O="$EXT"/temp/$PREFIX"_chr4_rg.bam" RGID=4 RGLB=lib1 RGPL=ILLUMINA RGPU=unit1 RGSM=20
Index the bam:
samtools index "$BAM"
Download the reference chromosome 4, index it and create the dictionary for gatk:
curl https://hgdownload.soe.ucsc.edu/goldenPath/hg19/chromosomes/chr4.fa.gz | gunzip -c > ~/refs/hg19/chr4.fa
samtools faidx $REF
bwa index $REF
bowtie2-build $REF $REF
picard CreateSequenceDictionary REFERENCE="$REF" OUTPUT=~/refs/hg19/$ACC.dict
Then I run the normal GATK variant calling with this, which works.
gatk HaplotypeCaller -R $REF -I "$BAM" -O "$DIR"/gatk/$PREFIX"_chr4_HaplotypeCaller.vcf"
But when I run it with the GVCF option, I get the error below:
gatk HaplotypeCaller -R $REF -I "$BAM" -O "$DIR"/gatk/$PREFIX"_chr4_HaplotypeCallerPGT.vcf" -ERC GVCF
The important line in the error seems to be:
A USER ERROR has occurred: Contig: chrM not found in reference dictionary.
Please check that you are using a compatible reference for your data.
Reference Contigs: [chr4]
But here is the full error:
$ gatk HaplotypeCaller -R $REF -I "$BAM" -O "$DIR"/gatk/$PREFIX"_chr4_HaplotypeCallerPGT.vcf" -ERC GVCF
Using GATK jar /Users/michaelflower/opt/anaconda3/envs/gatk4/share/gatk4-4.2.3.0-0/gatk-package-4.2.3.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /Users/michaelflower/opt/anaconda3/envs/gatk4/share/gatk4-4.2.3.0-0/gatk-package-4.2.3.0-local.jar HaplotypeCaller -R /Users/michaelflower/refs/hg19/chr4.fa -I /Volumes/Seagate Expansion Drive/temp/125QiPSC_chr4_rg.bam -O /Users/michaelflower/Documents/ACL/Research/Projects/HTT ASO/2021.11.17 125Q vcf query/gatk/125QiPSC_chr4_HaplotypeCallerPGT.vcf -ERC GVCF
15:51:48.424 INFO NativeLibraryLoader - Loading libgkl_compression.dylib from jar:file:/Users/michaelflower/opt/anaconda3/envs/gatk4/share/gatk4-4.2.3.0-0/gatk-package-4.2.3.0-local.jar!/com/intel/gkl/native/libgkl_compression.dylib
Nov 17, 2021 3:51:48 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
INFO: Failed to detect whether we are running on Google Compute Engine.
15:51:48.827 INFO HaplotypeCaller - ------------------------------------------------------------
15:51:48.828 INFO HaplotypeCaller - The Genome Analysis Toolkit (GATK) v4.2.3.0
15:51:48.828 INFO HaplotypeCaller - For support and documentation go to https://software.broadinstitute.org/gatk/
15:51:48.828 INFO HaplotypeCaller - Executing as michaelflower@Michaels-MacBook-Pro-2.local on Mac OS X v10.16 x86_64
15:51:48.828 INFO HaplotypeCaller - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_302-b08
15:51:48.828 INFO HaplotypeCaller - Start Date/Time: 17 November 2021 15:51:48 GMT
15:51:48.829 INFO HaplotypeCaller - ------------------------------------------------------------
15:51:48.829 INFO HaplotypeCaller - ------------------------------------------------------------
15:51:48.829 INFO HaplotypeCaller - HTSJDK Version: 2.24.1
15:51:48.829 INFO HaplotypeCaller - Picard Version: 2.25.4
15:51:48.829 INFO HaplotypeCaller - Built for Spark Version: 2.4.5
15:51:48.829 INFO HaplotypeCaller - HTSJDK Defaults.COMPRESSION_LEVEL : 2
15:51:48.830 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
15:51:48.830 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
15:51:48.830 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
15:51:48.830 INFO HaplotypeCaller - Deflater: IntelDeflater
15:51:48.830 INFO HaplotypeCaller - Inflater: IntelInflater
15:51:48.830 INFO HaplotypeCaller - GCS max retries/reopens: 20
15:51:48.830 INFO HaplotypeCaller - Requester pays: disabled
15:51:48.830 INFO HaplotypeCaller - Initializing engine
15:51:49.333 INFO HaplotypeCaller - Done initializing engine
15:51:49.335 INFO HaplotypeCallerEngine - Tool is in reference confidence mode and the annotation, the following changes will be made to any specified annotations: 'StrandBiasBySample' will be enabled. 'ChromosomeCounts', 'FisherStrand', 'StrandOddsRatio' and 'QualByDepth' annotations have been disabled
15:51:49.341 INFO HaplotypeCallerEngine - Standard Emitting and Calling confidence set to 0.0 for reference-model confidence output
15:51:49.341 INFO HaplotypeCallerEngine - All sites annotated with PLs forced to true for reference-model confidence output
15:51:49.359 INFO NativeLibraryLoader - Loading libgkl_utils.dylib from jar:file:/Users/michaelflower/opt/anaconda3/envs/gatk4/share/gatk4-4.2.3.0-0/gatk-package-4.2.3.0-local.jar!/com/intel/gkl/native/libgkl_utils.dylib
15:51:49.500 WARN NativeLibraryLoader - Unable to find native library: native/libgkl_pairhmm_omp.dylib
15:51:49.500 INFO PairHMM - OpenMP multi-threaded AVX-accelerated native PairHMM implementation is not supported
15:51:49.501 INFO NativeLibraryLoader - Loading libgkl_pairhmm.dylib from jar:file:/Users/michaelflower/opt/anaconda3/envs/gatk4/share/gatk4-4.2.3.0-0/gatk-package-4.2.3.0-local.jar!/com/intel/gkl/native/libgkl_pairhmm.dylib
15:51:49.712 INFO IntelPairHmm - Flush-to-zero (FTZ) is enabled when running PairHMM
15:51:49.713 WARN IntelPairHmm - Ignoring request for 4 threads; not using OpenMP implementation
15:51:49.713 INFO PairHMM - Using the AVX-accelerated native PairHMM implementation
15:51:49.780 INFO ProgressMeter - Starting traversal
15:51:49.781 INFO ProgressMeter - Current Locus Elapsed Minutes Regions Processed Regions/Minute
15:51:49.834 INFO VectorLoglessPairHMM - Time spent in setup for JNI call : 0.0
15:51:49.834 INFO PairHMM - Total compute time in PairHMM computeLogLikelihoods() : 0.0
15:51:49.834 INFO SmithWatermanAligner - Total compute time in java Smith-Waterman : 0.00 sec
15:51:49.835 INFO HaplotypeCaller - Shutting down engine
[17 November 2021 15:51:49 GMT] org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCaller done. Elapsed time: 0.02 minutes.
Runtime.totalMemory()=493879296
***********************************************************************
A USER ERROR has occurred: Contig: chrM not found in reference dictionary.
Please check that you are using a compatible reference for your data.
Reference Contigs: [chr4]
***********************************************************************
Set the system property GATK_STACKTRACE_ON_USER_EXCEPTION (--java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true') to print the stack trace.
(gatk4)
That's true, I've added the script above, but I re downloaded just chr4 reference and used that. Shall I use the whole hg19 reference genome then?
yes
Under way, I'll let you know how it goes
Unfortunately it's still not working with the full hg19 reference. The vcf file produced only has one line:
I didn't make the original bam file, but the people who did say they used hg19 from UCSC, http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips
The steps I used to make the vcf file above were:
Download hg19 reference, index it and create dictionary for GATK:
Then the steps above, including variant calling with:
But this gives the following output:
please, read the error message: the problem is unrelated to your original question.
set
-XmX
see https://stackoverflow.com/questions/37335/how-to-deal-with-java-lang-outofmemoryerror-java-heap-space-errorand "4. Adding Java arguments" in https://gatk.broadinstitute.org/hc/en-us/articles/360035531892-GATK4-command-line-syntax
Thank you, I've had a good read of those pages. I added the argument
But unfortunately I still get the
java.lang.OutOfMemoryError: Java heap space
error