I have a SNP dataset in Plink for 419,102 SNPs.
I am trying to run them through ANNOVAR, so I can figure out what types of functional elements they are spread in across the genome.
plink --bfile input --recode vcf-iid --out Ancestral_419k
convert2annovar.pl -format vcf4old Ancestral_419k.vcf -outfile Ancestral_419k.avinput
The resulting VCF file has all 419,102 SNPs (and 28 header lines)
The ANNOVAR log file states the following:
NOTICE: Read 419130 lines and wrote 417600 different variants at 418216 genomic positions (418216 SNPs and 0 indels)
NOTICE: Among 418216 different variants at 418216 positions, 111601 are heterozygotes, 305999 are homozygotes
NOTICE: Among 418216 SNPs, 340143 are transitions, 78073 are transversions (ratio=4.36)
The avinput file has 418216 SNPs. I am not sure why 886 SNPs are not being read in the conversion. Anyone have an idea what is going on?
show us one that's missing
It appears that there are 886 that are completely monomorphic within my dataset, so ANNOVAR is ignoring them instead of including them...
Here are five examples rs377583051 rs561224271 rs544889745 rs573338017 rs555351100
What do you mean by monomorphic? If there is an rsID, it should correspond to at least a SNP
Within Plink, all the samples are monomorphic, so ANNOVAR ignores them instead of including them in the avinput file
Unless I add more samples that have the derived versions of those SNPs, ANNOVAR will think they are monomorphic and ignore them. There must be some way to override this.
I'm not sure annovar if looks at the sample annotation. AFAIR it derives the concordance by looking at chrom post ref and alt only
Pastr some lines in the main Q, where annovar is missing the annotation. Or better, upload part of the file to somewhere and attach a link here
Then something might be getting lost when Plink converts from bed/bim/fam to VCF. The bim file clears displays both alleles (even though the derived allele is absent for 886 sites).
I created a bim for just those 886 sites, and I reformatted it to avinput format and cat'ed it to the end of actual avinput file. It's running through ANNOVAR now. Hopefully, it annotates those SNPs.