Using GATK method to call variants from RNASeq data, I have VCF files from 30 patients. There's a lot of missing genotype data and I need to impute missing information from Sanger imputation service. When I uploaded my vcf file to Sanger, I received this error:
--- Aborted Job --- The input file sanity check failed, "bcftools norm -ce" exited with the following message: Reference allele mismatch at MT:150 .. REF_SEQ:'C' vs VCF:'T'
The error report further recommended that I try ref allele mismatch correction using bcftools.
Based on documentation on this page https://samtools.github.io/bcftools/howtos/plugin.fixref.html I tried a ref allele mismatch correction using : 'bcftools +fixref report.vcf -- -f ref.fa'
My output looks like this:
# SC, guessed strand convention
SC TOP-compatible 0
SC BOT-compatible 0
# ST, substitution types
A>C 1058 3.3%
ST A>G 5927 18.5%
ST A>T 726 2.3%
ST C>A 1069 3.3%
ST C>G 1512 4.7%
ST C>T 5513 17.3%
ST G>A 5551 17.4%
ST G>C 1509 4.7%
ST G>T 1107 3.5%
ST T>A 744 2.3%
ST T>C 6174 19.3%
ST T>G 1064 3.3%
# NS, Number of sites:
NS total 33609
NS ref match 31954 100.0%
NS ref mismatch 0 0.0%
NS skipped 1655
NS non-ACGT 0
NS non-SNP 1647
NS non-biallelic 8
In the vcf file itself, I manually found chrM 150 and it shows:
#CHROM POS ID REF ALT QUAL FILTER INFO
chrM 150 . T C 1536.77 SnpCluster AC=2;AF=1.00;AN=2;DP=29;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=34.76;SOR=1.460
Could anyone please help me troubleshoot this step. Thanks!
are you using the very same REFerence genome that was used to map he reads ? MT sequences are not always the same.
Hi Pierre! Thanks so much for responding. Yes, in fact, to avoid the liftover procedure, I did the variant calling from scratch using GRCh37 since Sanger Imputation Service works on GRCh37.
In my very first impute attempt on Sanger, I got an error: faidx_fetch_seq failed at chr1:567242 and the error report further suggested that I change the UCSC-style chromosome names to Ensembl-style chromosome names which I did using:
as described on this page
EDIT: I'm wondering if it's relevant to point out that the variant calling was done on RNASeq data via GATk and the RNASeq is a 50bp SR dataset.