Entering edit mode
3.0 years ago
michael.flower.14
▴
200
I've got sequencing data for a small 500 bp amplicon from a few samples. GATK best principles suggest running VariantRecalibrator on the GVCF files I generate. I'm trying to get this working, but I get an error about "Found annotations with zero variances". Reading the gatk manual and other posts (eg1, eg2), I believe this is because I don't have enough samples in my vcf file. I don't know how to 'pad' my vcf file with data from 1000G or exac though! Any help much appreciated.
Here's my pipeline to get to this point, and the error I see at the bottom.
#====================================================
# Alignment with BWA mem
#====================================================
bwa mem -M -t 1 -R $(echo "@RG\tID:$ID\tSM:$ID"_"$SM\tLB:$ID"_"$SM\tPL:ILLUMINA") "$REF" "$R1" "$R2" > "$OUT"/results/alignment/${SN}_bwa.sam
samtools view -Sb "$OUT"/results/alignment/${SN}_bwa.sam > "$OUT"/results/alignment/${SN}_bwa.bam
#====================================================
# Sort and mark duplicated with picard
#====================================================
picard SortSam I="$OUT"/results/alignment/${SN}_bwa.bam O="$OUT"/results/alignment/${SN}_bwa_sorted.bam SORT_ORDER=coordinate USE_JDK_DEFLATER=true USE_JDK_INFLATER=true
picard MarkDuplicates I="$OUT"/results/alignment/${SN}_bwa_sorted.bam O="$OUT"/results/alignment/${SN}_bwa_sorted_marked.bam M="$OUT"/results/alignment/marked_dup_metrics.txt USE_JDK_DEFLATER=true USE_JDK_INFLATER=true
#====================================================
# Run base-recalibrator and applyBSQR to recalibrate base qualities
#====================================================
# Base calibration
gatk --java-options "-Xmx4g" BaseRecalibrator -I "$OUT"/results/alignment/${SN}_bwa_sorted_marked.bam -R "$REF" --known-sites "$knownINDELS" -O "$OUT"/results/alignment/recal_data.table
gatk --java-options "-Xmx4g" AnalyzeCovariates -bqsr "$OUT"/results/alignment/recal_data.table -plots "$OUT"/results/alignment/AnalyzeCovariates.pdf
gatk --java-options "-Xmx4g" ApplyBQSR -R "$REF" -I "$OUT"/results/alignment/${SN}_bwa_sorted_marked.bam --bqsr-recal-file "$OUT"/results/alignment/recal_data.table -O "$OUT"/results/alignment/${SN}_bwa_sorted_marked_recalibrated.bam
#====================================================
# Index the bam file
#====================================================
samtools index "$OUT"/results/alignment/${SN}_bwa_sorted_marked_recalibrated.bam
#====================================================
# Run HaplotypeCaller
#====================================================
gatk --java-options "-Xmx4g" HaplotypeCaller \
--intervals "$INTERVALS" \
-R "$REF" \
-I "$OUT"/results/alignment/${SN}_bwa_sorted_marked_recalibrated.bam \
-O "$OUT"/results/variants/${SN}_g.vcf.gz \
-ERC GVCF
#====================================================
# Genotype GVCFs
#====================================================
gatk --java-options "-Xmx4g" GenotypeGVCFs --allow-old-rms-mapping-quality-annotation-data -R "/Volumes/Seagate Expansion Drive/refs/hg38/gatk download/Homo_sapiens_assembly38.fasta" -V "$OUT"/results/variants/cohort.g.vcf.gz -O "$OUT"/results/variants/cohort.vcf.gz
#====================================================
# Normalise and index the vcf
#====================================================
# Normalise indels
bcftools norm "$OUT"/results/variants/cohort.vcf.gz "/Volumes/Seagate Expansion Drive/refs/hg38/gatk download/Homo_sapiens_assembly38.fasta" -m -both -o "$OUT"/results/variants/cohort.norm.vcf.gz -O z
# Index the vcf
tabix -p vcf "$OUT"/results/variants/cohort.norm.vcf.gz
#====================================================
# Variant recalibration
#====================================================
gatk VariantRecalibrator \
-R "/Volumes/Seagate Expansion Drive/refs/hg38/gatk download/Homo_sapiens_assembly38.fasta" \
-V "$OUT"/results/variants/cohort.norm.vcf.gz \
-AS \
--resource:hapmap,known=false,training=true,truth=true,prior=15.0 "/Volumes/Seagate Expansion Drive/refs/hg38/gatk download/hapmap_3.3.hg38.vcf.gz" \
--resource:omni,known=false,training=true,truth=false,prior=12.0 "/Volumes/Seagate Expansion Drive/refs/hg38/gatk download/1000G_omni2.5.hg38.vcf.gz" \
--resource:1000G,known=false,training=true,truth=false,prior=10.0 "/Volumes/Seagate Expansion Drive/refs/hg38/gatk download/1000G_phase1.snps.high_confidence.hg38.vcf.gz" \
--resource:dbsnp,known=true,training=false,truth=false,prior=2.0 "/Volumes/Seagate Expansion Drive/refs/hg38/gatk download/Homo_sapiens_assembly38.dbsnp138.vcf" \
-an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR \
-mode SNP \
-O "$OUT"/results/variants/output.AS.recal \
--tranches-file "$OUT"/results/variants/output.AS.tranches \
--rscript-file "$OUT"/results/variants/output.plots.AS.R
And here's the error I see:
***********************************************************************
A USER ERROR has occurred: Bad input: Found annotations with zero variance. They must be excluded before proceeding.
***********************************************************************
one of those annotation doesn't change among all your samples.
OK thanks, how do I fix this and make it flexible enough to deal with other samples in the future? The problem is probably because I've only got one sample in the vcf, so how do I 'pad' my GVCF file with 1000G samples as the original posts suggest?
you would have to run something like
bcftools merge
or GATK combineVariants. BUT with a 1000G vcf that would have been processed with GATK and the same annotations....Thank you, do you know which 1000G VCF to download? There are lots on the site with various levels of annotation, but I can’t find a GVCF. I thought I’d need the raw data? I could then use gatk CombineGVCFs
I doubt such file exits.
Ok thank you. In which case is there somewhere to get ~30 1000G fastq files that would be appropriate to use in the above pipeline to ‘pad’ it out?