ANNOVAR is losing 886 SNPs, and I can't figure out why
1
0
Entering edit mode
7.6 years ago
devenvyas ▴ 760

I have a SNP dataset in Plink for 419,102 SNPs.

I am trying to run them through ANNOVAR, so I can figure out what types of functional elements they are spread in across the genome.

plink --bfile input --recode vcf-iid --out Ancestral_419k
convert2annovar.pl -format vcf4old Ancestral_419k.vcf -outfile Ancestral_419k.avinput

The resulting VCF file has all 419,102 SNPs (and 28 header lines)

The ANNOVAR log file states the following:

NOTICE: Read 419130 lines and wrote 417600 different variants at 418216 genomic positions (418216 SNPs and 0 indels)
NOTICE: Among 418216 different variants at 418216 positions, 111601 are heterozygotes, 305999 are homozygotes
NOTICE: Among 418216 SNPs, 340143 are transitions, 78073 are transversions (ratio=4.36)

The avinput file has 418216 SNPs. I am not sure why 886 SNPs are not being read in the conversion. Anyone have an idea what is going on?

SNP • 2.8k views
ADD COMMENT
0
Entering edit mode

show us one that's missing

ADD REPLY
0
Entering edit mode

It appears that there are 886 that are completely monomorphic within my dataset, so ANNOVAR is ignoring them instead of including them...

Here are five examples rs377583051 rs561224271 rs544889745 rs573338017 rs555351100

ADD REPLY
0
Entering edit mode

What do you mean by monomorphic? If there is an rsID, it should correspond to at least a SNP

ADD REPLY
0
Entering edit mode

Within Plink, all the samples are monomorphic, so ANNOVAR ignores them instead of including them in the avinput file

Unless I add more samples that have the derived versions of those SNPs, ANNOVAR will think they are monomorphic and ignore them. There must be some way to override this.

ADD REPLY
0
Entering edit mode

I'm not sure annovar if looks at the sample annotation. AFAIR it derives the concordance by looking at chrom post ref and alt only

ADD REPLY
0
Entering edit mode

Pastr some lines in the main Q, where annovar is missing the annotation. Or better, upload part of the file to somewhere and attach a link here

ADD REPLY
0
Entering edit mode

Then something might be getting lost when Plink converts from bed/bim/fam to VCF. The bim file clears displays both alleles (even though the derived allele is absent for 886 sites).

I created a bim for just those 886 sites, and I reformatted it to avinput format and cat'ed it to the end of actual avinput file. It's running through ANNOVAR now. Hopefully, it annotates those SNPs.

ADD REPLY
2
Entering edit mode
7.6 years ago
devenvyas ▴ 760

So I figured out how to do this. I am writing it up as an answer, so future users can refer to it if they have the same problem

Basically, I identified 886 SNPs, which were monomorphic in my dataset (and thus getting lost in the VCF to avinput conversion).

I created a bim file for the 886 SNPs and converted it to avinput format in Excel. I tacked this on to the end of the original avinput file and ran it through ANNOVAR successfully. The only issue is that the SNPs are not completely sorted by coordinate, since I tacked on those SNPs to the end.

It may be easier in the future to just convert a bim directly into an avinput.

ADD COMMENT
1
Entering edit mode

I was having a similar problem starting from vcf input. The key to keep all variants seems to be to call convert2annovar.pl with the following extra arguments -allsample -withfreq. With this modification, all variant positions found in the vcf are kept in the AVINPUT file and in the multianno.txt generated by Annovar. I know this is an old thread but I hope someone finds it useful!

ADD REPLY

Login before adding your answer.

Traffic: 2888 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6