Question

Padding out a GVCF file with 1000G exomes to get gatk VariantRecalibrator working with a small sample

0

Entering edit mode

3.0 years ago

michael.flower.14 ▴ 200

I've got sequencing data for a small 500 bp amplicon from a few samples. GATK best principles suggest running VariantRecalibrator on the GVCF files I generate. I'm trying to get this working, but I get an error about "Found annotations with zero variances". Reading the gatk manual and other posts (eg1, eg2), I believe this is because I don't have enough samples in my vcf file. I don't know how to 'pad' my vcf file with data from 1000G or exac though! Any help much appreciated.

Here's my pipeline to get to this point, and the error I see at the bottom.

#====================================================
# Alignment with BWA mem
#====================================================
bwa mem -M -t 1 -R $(echo "@RG\tID:$ID\tSM:$ID"_"$SM\tLB:$ID"_"$SM\tPL:ILLUMINA") "$REF" "$R1" "$R2" > "$OUT"/results/alignment/${SN}_bwa.sam

samtools view -Sb "$OUT"/results/alignment/${SN}_bwa.sam > "$OUT"/results/alignment/${SN}_bwa.bam

#====================================================
# Sort and mark duplicated with picard
#====================================================

picard SortSam I="$OUT"/results/alignment/${SN}_bwa.bam O="$OUT"/results/alignment/${SN}_bwa_sorted.bam SORT_ORDER=coordinate USE_JDK_DEFLATER=true USE_JDK_INFLATER=true

picard MarkDuplicates I="$OUT"/results/alignment/${SN}_bwa_sorted.bam O="$OUT"/results/alignment/${SN}_bwa_sorted_marked.bam M="$OUT"/results/alignment/marked_dup_metrics.txt USE_JDK_DEFLATER=true USE_JDK_INFLATER=true


#====================================================
# Run base-recalibrator and applyBSQR to recalibrate base qualities
#====================================================

# Base calibration
gatk --java-options "-Xmx4g" BaseRecalibrator -I "$OUT"/results/alignment/${SN}_bwa_sorted_marked.bam -R "$REF" --known-sites "$knownINDELS" -O "$OUT"/results/alignment/recal_data.table

gatk --java-options "-Xmx4g" AnalyzeCovariates -bqsr "$OUT"/results/alignment/recal_data.table -plots "$OUT"/results/alignment/AnalyzeCovariates.pdf

gatk --java-options "-Xmx4g" ApplyBQSR -R "$REF" -I "$OUT"/results/alignment/${SN}_bwa_sorted_marked.bam --bqsr-recal-file "$OUT"/results/alignment/recal_data.table -O "$OUT"/results/alignment/${SN}_bwa_sorted_marked_recalibrated.bam

#====================================================
# Index the bam file
#====================================================

samtools index "$OUT"/results/alignment/${SN}_bwa_sorted_marked_recalibrated.bam


#====================================================
# Run HaplotypeCaller
#====================================================

gatk --java-options "-Xmx4g" HaplotypeCaller \
      --intervals "$INTERVALS" \
      -R "$REF" \
      -I "$OUT"/results/alignment/${SN}_bwa_sorted_marked_recalibrated.bam \
      -O "$OUT"/results/variants/${SN}_g.vcf.gz \
      -ERC GVCF




#====================================================
# Genotype GVCFs
#====================================================

gatk --java-options "-Xmx4g" GenotypeGVCFs  --allow-old-rms-mapping-quality-annotation-data  -R "/Volumes/Seagate Expansion Drive/refs/hg38/gatk download/Homo_sapiens_assembly38.fasta" -V "$OUT"/results/variants/cohort.g.vcf.gz -O "$OUT"/results/variants/cohort.vcf.gz



#====================================================
# Normalise and index the vcf
#====================================================

# Normalise indels
bcftools norm "$OUT"/results/variants/cohort.vcf.gz "/Volumes/Seagate Expansion Drive/refs/hg38/gatk download/Homo_sapiens_assembly38.fasta" -m -both -o "$OUT"/results/variants/cohort.norm.vcf.gz -O z

# Index the vcf
tabix -p vcf "$OUT"/results/variants/cohort.norm.vcf.gz


#====================================================
# Variant recalibration
#====================================================

gatk VariantRecalibrator \
   -R "/Volumes/Seagate Expansion Drive/refs/hg38/gatk download/Homo_sapiens_assembly38.fasta" \
   -V "$OUT"/results/variants/cohort.norm.vcf.gz \
   -AS \
   --resource:hapmap,known=false,training=true,truth=true,prior=15.0 "/Volumes/Seagate Expansion Drive/refs/hg38/gatk download/hapmap_3.3.hg38.vcf.gz" \
   --resource:omni,known=false,training=true,truth=false,prior=12.0 "/Volumes/Seagate Expansion Drive/refs/hg38/gatk download/1000G_omni2.5.hg38.vcf.gz" \
   --resource:1000G,known=false,training=true,truth=false,prior=10.0 "/Volumes/Seagate Expansion Drive/refs/hg38/gatk download/1000G_phase1.snps.high_confidence.hg38.vcf.gz" \
   --resource:dbsnp,known=true,training=false,truth=false,prior=2.0 "/Volumes/Seagate Expansion Drive/refs/hg38/gatk download/Homo_sapiens_assembly38.dbsnp138.vcf" \
   -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR \
   -mode SNP \
   -O "$OUT"/results/variants/output.AS.recal \
   --tranches-file "$OUT"/results/variants/output.AS.tranches \
   --rscript-file "$OUT"/results/variants/output.plots.AS.R

And here's the error I see:

***********************************************************************

A USER ERROR has occurred: Bad input: Found annotations with zero variance. They must be excluded before proceeding.

***********************************************************************

variantrecalibrator gatk gvcf • 1.8k views

ADD COMMENT • link 3.0 years ago by michael.flower.14 ▴ 200

0

Entering edit mode

Found annotations with zero variance

 -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR

one of those annotation doesn't change among all your samples.

ADD REPLY • link 3.0 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

OK thanks, how do I fix this and make it flexible enough to deal with other samples in the future? The problem is probably because I've only got one sample in the vcf, so how do I 'pad' my GVCF file with 1000G samples as the original posts suggest?

ADD REPLY • link 3.0 years ago by michael.flower.14 ▴ 200

0

Entering edit mode

you would have to run something like bcftools merge or GATK combineVariants. BUT with a 1000G vcf that would have been processed with GATK and the same annotations....

ADD REPLY • link 3.0 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Thank you, do you know which 1000G VCF to download? There are lots on the site with various levels of annotation, but I can’t find a GVCF. I thought I’d need the raw data? I could then use gatk CombineGVCFs

ADD REPLY • link 3.0 years ago by michael.flower.14 ▴ 200

0

Entering edit mode

I doubt such file exits.

ADD REPLY • link 3.0 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Ok thank you. In which case is there somewhere to get ~30 1000G fastq files that would be appropriate to use in the above pipeline to ‘pad’ it out?

ADD REPLY • link 3.0 years ago by michael.flower.14 ▴ 200