I'm trying to get a VCF file containing germline SNPs from NCBI's databases. This page says that I want the common_no_known_medical_impact.vcf.gz file and that I can find it at...
ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b150_GRCh37p13/VCF/
However that section only lists the common_all.vcf.gz file. Where can I get this file? Finding anything on NCBI's ftp site seems like an exercise in futility.
Excellent! Thanks for the links and the wonderful explanation!
Very useful ! Thanks !!
One more question. Are there redunant entries in dbSNP? I was trying to parse the common_no_known_medical_impact_20170905.vcf.gz file I downloaded from the links you posted but in this file there are about 38 million entries. dbSNP is only supposed to have 13 million, right?
dbSNP is constantly being curated and there are discrepancies in it. I don't fully know the extent of this, though.
I cannot confirm but, when you think about it, at each positon, there can be 4 possible bases. So, genome-wide, there are >10 billion possible bases to consider. For all dbSNP variants, the total would be ~50 million. I don't know if this logic explains the issue that you've found, though.
It says this about your file on the NCBI:
I don't know if that helps any further. All of this is a relatively novel area and I'm not sure how they can truly gauge pathogenic versus benign versus 'functional' with great confidence, given the current state of knowledge.
I do know that there are tools currently out there that attempt to assist in clinical exome variant filtering (and / or non-coding regulatory variants), such as:
This is an area of interest of mine right now, in fact.
Wait, my 13 million figure appears to have stuck in my head pre the release of the 1000 Genomes Phase III. dbSNP has currently amassed hundreds of millions of SNPs. Here are some release notes for dbSNP version 150: https://www.ncbi.nlm.nih.gov/projects/SNP/snp_summary.cgi?view+summary=view+summary&build_id=150
That explains your finding.