Which database would you like to choose to filter those common SNPs and find out rare ones which may be disease-causing? I used to apply 1000Genome as well as ESP (exome sequencing project) database. (ESP is derived from exome data of about 6500 individuals, which is fairly large enough.) Also both databases contains MAF. I don't initially use dbSNP, because it simply contains everything thus less permissive.
But I find sth. interesting today that, there are some SNPs, for example rs73979896: http://genome.ucsc.edu/cgi-bin/hgc?hgsid=308088757&c=chr17&o=21319207&t=21319208&g=snp135Common&i=rs73979896
THis SNP, nonsynonymous, present in dbSNP-135, with a very high MAF=49% derived from around 2204 alleles; however, it's absent from either 1000Genome (2012-Apr) or ESP-6500 (The latest version with exome data from 6500 individuals)! If this is really a true SNP with MAF=49%, how can it NOT be captured in ESP with information of 6500 ppl? This is very confusing.
thanks. Curious why this SNP is filtered later? Because of low coverage? Also I checked several bam files of unrelated individuals NOT from 1000Genome, this SNP does exist. However if checking bam files from 1000G, for example NA12878, NA12889, this SNP is not there. Problem is, this SNP has MAF=50%; it's common allele, not rare. Different groups should be consistent for common SNPs, right? That's where I'm confused.
Also ,what's special about genome patch in terms of calling variants? Is genome patch supposed to be regions holding many mutations?
Patched regions of the assembly are more likely to be regions with highly repetitive sequences or that are otherwise hard to assemble and thus to map to. That could be part of the issue here as well.
I do use dbSNP as well as 1000G and ESP MAF's but I tend to stick to older dbSNP versions. For newer versions I would want to go by the estimated MAF and not simple presence/absence, as I've seen entries in dbSNP with no population data at all and only seen in say one individual.