Hi folks!
I've imputed genotype data using the Michigan Imputation Server (MIS), using the 1000 Genomes Phase 1 panel (not many errors were found, according to MIS). After (and before) imputation, I wanted to perform a sanity check by running checkVCF.py [https://github.com/zhanxw/checkVCF], to make sure the ref/alt alleles in my data were consistent with 1000G Phase1 data. This analysis revealed several inconsistent reference sites, when comparing to this fasta file from 1000G Phase1. Upon close inspection, I noticed that the reference alleles for several SNPs which were "supposedly inconsistent" in my vcf were actually consistent with the data in UCSC Browser, suggesting me that I was using the wrong fasta file as reference for checkVCF.py. I saw that this person also had a similar issue, but I could not find an answer regarding which fasta file I should use as reference for checkVCF.py (or for other tools, like "bcftools norm --check-ref").
I found this link which says that 1000 Genomes doesn't provide fasta files containing variant information, so the file I used as reference for checkVCF.py was not right in the first place. Any clue anyone?
@rodd I am also facing same issue, as per @ chrchang523 reply I had tried using reference from "http://lh3.github.io/2017/11/13/which-human-reference-genome-to-use" but still the problem persist.
I have some doubts in understanding the checkVCF.py report, could anyone please anyone please help me the understand what exactly going wrong.
Things I had done using checkVCF.py
Files used
Command used:
---1st command--
script was not working in python3
python2.7 ../checkVCF.py -r ../GCA_000001405.15_GRCh38_full_analysis_set.fna.gz -o chry ../../Samples_datatstore_chr22.vcf
--------------- ACTION ITEM ---------------
Please use the following command to clean your VCF file and then re-run checkVCF.py
(grep ^"#" $your_old_vcf; grep -v ^"#" $your_old_vcf | sed 's:^chr::ig' | sort -k1,1n -k2,2n) | bgzip -c > $your_vcf_file
---2nd command--
grep ^"#" Samples_datatstore_chr22.vcf; grep -v ^"#" Samples_datatstore_chr22.vcf | sed 's:^chr::ig' | sort -k1,1n -k2,2n >Modified_chr22.vcf
Note: VCF header and the columns of a VCF (header, and is tab separated into 8 mandatory columns and sample name) was absent in Modified_chr22.vcf generated.
---3nd command--
python2.7 ../checkVCF.py -r ../GCA_000001405.15_GRCh38_full_analysis_set.fna.gz -o chry ../../Modified_chr22.vcf
Output:
Line [ 58801 ] does not have correct column number, exiting!
Line [ 58802 ] does not have correct column number, exiting!
Line [ 58803 ] does not have correct column number, exiting!
Line [ 58804 ] does not have correct column number, exiting!
Line [ 265558 ] does not have correct column number, exiting!
--------------- REPORT ---------------
Total [ 265558 ] lines processed
01-09 13:56 Examine [ 0 ] VCF header lines, [ 265558 ] variant sites, [ 0 ] samples
01-09 13:56 [ 0 ] duplicated sites
01-09 13:56 [ 50050 ] NonSNP site are outputted to [ chr22/chr22_test.check.nonSnp ]
01-09 13:56 [ 215508 ] Inconsistent reference sites are outputted to [ chr22/chr22_test.check.ref ]
01-09 13:56 [ 0 ] Variant sites with invalid genotypes are outputted to [ chr22/chr22_test.check.geno ]
01-09 13:56 [ 0 ] Alternative allele frequency > 0.5 sites are outputted to [ chr22/chr22_test.check.af ]
01-09 13:56 [ 0 ] Monomorphic sites are outputted to [ chr22/chr22_test.check.mono ]
01-09 13:56 --------------- ACTION ITEM ---------------
01-09 13:56 * Read chr22/chr22_test.check.ref, for autosomal sites, make sure the you are using the forward strand 01-09 13:56 * Upload these files to the ftp: chr22/chr22_test.check.log chr22/chr22_test.check.dup chr22/chr22_test.check.noSnp chr22/chr22_test.check.ref chr22/chr22_test.check.geno chr22/chr22_test.check.af chr22/chr22_test.check.mono
Doubts:
Thank you for your time.