Hi
I used VEP (command line) to annotate 147,204 SNPs identified in Bos taurus, with the --check_existing option and representing the SNPs in the input by chromosome:position. Code below.
vep --variant_class --format vcf --sift b --vcf_info_field ANN --offline --cache \ --dir_cache /home/program2/bin/VEP/95.2/cache/ --species "bos_taurus" \ --check_existing --stats_file vepstats_allrecodes.html --gene_phenotype \ -i /home/recode.vcf -o snps_annotated_allrecode.vcf
In the table in the beginning of the stats_file.html it says that there were 147,204 variants processed, 0 variants filtered out, 16,931 novel variants and 130,273 known variants.
But, when I open the snps_annotated_allrecode.vcf file and remove the duplicated chromosome:position (because I just want to see which SNPs are new and I am not interested now in the different transcripts) I get 147,204, SNPs as expected, 130,130 SNPs with rsIDs (so known) and 17,075 SNPs without rsIDs (so new). There were 144 more SNPs without rsID and 144 less with rsID in the .vcf file than it shoud based on the .html file.
Another researcher is having the same problem with another data set.
Which file is the right one?
Thank you