I've gotten back vcf files for the Michigan server. My intent was to filter the vcfs by their info value; however, it seems the vcf files don't have that specific value listed in the header:
##fileformat=VCFv4.1
##FILTER=<ID=PASS,Description="All filters passed">
##filedate=2021.7.17
##contig=<ID=5>
##pipeline=michigan-imputationserver-1.5.7
##imputation=minimac4-1.0.2
##phasing=eagle-2.4
##r2Filter=0.0
##INFO=<ID=AF,Number=1,Type=Float,Description="Estimated Alternate Allele Frequency">
##INFO=<ID=MAF,Number=1,Type=Float,Description="Estimated Minor Allele Frequency">
##INFO=<ID=R2,Number=1,Type=Float,Description="Estimated Imputation Accuracy (R-square)">
##INFO=<ID=ER2,Number=1,Type=Float,Description="Empirical (Leave-One-Out) R-square (available only for genotyped variants)">
##INFO=<ID=IMPUTED,Number=0,Type=Flag,Description="Marker was imputed but NOT genotyped">
##INFO=<ID=TYPED,Number=0,Type=Flag,Description="Marker was genotyped AND imputed">
##INFO=<ID=TYPED_ONLY,Number=0,Type=Flag,Description="Marker was genotyped but NOT imputed">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DS,Number=1,Type=Float,Description="Estimated Alternate Allele Dosage : [P(0/1)+2*P(1/1)]">
##FORMAT=<ID=HDS,Number=2,Type=Float,Description="Estimated Haploid Alternate Allele Dosage">
##FORMAT=<ID=GP,Number=3,Type=Float,Description="Estimated Posterior Probabilities for Genotypes 0/0, 0/1 and 1/1">
Would anyone have any insights/explanation on what's going on? I'm a novice at bioinformatics so any help would be appreciated.
Edit - Further clarification
The 'info' value I am addressing comes from the numerous values within the INFO column of the dataset. The vcf has its own typical columns of CHROM, POS, REF, ALT, INFO, etc (based on this explanation). And within the INFO column is numerous values such as R2-score (INFO/r2), P-value (INFO/p), MAF score (INFO/maf), etc.
Looking into numerous papers and other individuals' posts on Biostars, there seems to be an info value (INFO/info). I wanted to utilize this as a filter, however, it seems to be missing (by both looking at my header, as well as by querying my data). So essentially I am asking - is there an explanation for the lack of the INFO/info value and/or is there a way to get it?
Can you please elaborate, what you mean by value in header file?
Sure! From what I've observed/seen in documentation, the INFO column carries numerous values/fields. For example, internationalgenome.org lists some of the possible values:
When I look into the INFO column of my vcf data, I receive values such as P-value (annotated as INFO/p), R2-score (INFO/r2), etc:
However, looking into some papers/other forums on Biostar, it seems that people are able to filter based on the INFO/info value, which seems to be missing from my vcf. Essentially, my question is there an explanation to why I don't have this field and/or is there a way of getting it?
I don't have experience working with imputed vcf. But I think before imputation, you should first filter your vcf (having expected info columns like DP, MAF etc) using vcftools. After imputation, use can further filter based on Estimated Imputation Accuracy (R-square) using bcftools. Please check the similar posts like this and this for more idea.