I am trying to impute a genotype dataset using the Sanger Imputation Service. The vcf file was created using GATK RNASeq variant calling method. I've moved past a lot of errors but now I'm stuck at :
--- Aborted Job --- The input file sanity check failed, "bcftools norm -ce" exited with the following message: [E::faidx_fetch_seq] The sequence "JH636052.4" not found
faidx_fetch_seq failed at JH636052.4:4111085
My RNASeq data is mapped to GRCh37 and the vcf files are zipped and indexed. When I run +fixref on the vcf file,
$ bcftools +fixref input.vcf.gz -- -f /GRCh37.p13.genome.fa
I get the following report:
# SC, guessed strand convention
SC TOP-compatible 0
SC BOT-compatible 0
# ST, substitution types
ST A>C 601 3.2%
ST A>G 3641 19.2%
ST A>T 432 2.3%
ST C>A 620 3.3%
ST C>G 833 4.4%
ST C>T 3340 17.6%
ST G>A 3361 17.7%
ST G>C 871 4.6%
ST G>T 643 3.4%
ST T>A 423 2.2%
ST T>C 3560 18.8%
ST T>G 611 3.2%
# NS, Number of sites:
NS total 19864
NS ref match 18936 100.0%
NS ref mismatch 0 0.0%
NS skipped 928
NS non-ACGT 0
NS non-SNP 925
NS non-biallelic 3
Hi Micheal, thank you so much for replying. I am using GRCh37 because Sanger Imputation Service requires that the coordinates are on GRCh37. I had everything mapped to GRCh38 but that took me down the path to using liftover files which I didn't want to do.
Where can I find the final build for GRCh37? Would you recommend that I use GRCh37 ( primary assembly) and comprehensive gene annotation from Gencode?