While working on a couple of exome projects, I've ran into situations where for some of the variants we are calling, the reference genome allele is annotated in dbSNP as the minor allele (MAF<5%), and as such the variants are not so interesting to us.
I would like to filter those out, but cannot find a simple file with that information... I am using the 00-All.vcf.gz from dbSNP to identify known variants, which lists whether some of the alleles are low frequency, but not which one is which...
I would like to be able to flag these variants - does someone know where I can get that data from, ideally somewhere that gets updated with dbSNP versions?
Thank you so much!
and then invoke this parser for dbsnp to print the frequencies:
curl -s "ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606/XML/ds_ch22.xml.gz" |\
gunzip -c | java -cp . Biostar15272
rs783 0.388 G
rs805 0.3067 C
rs820 0.1901 G
rs1056 0.2984 G
rs1118 0.3528 C
rs1119 0.4817 G
rs1222 0.1161 C
rs1312 0.4685 C
rs1314 0.1229 G
rs1315 0.1499 C
rs1317 0.1499 T
rs1320 0.0297 A
rs1321 0.3441 C
rs1608 0.1024 T
rs1654 0.1559 G
rs1656 0.1878 G
rs1682 0.1522 C
rs1803 0.2399 T
rs1858 0.0347 T
rs1859 0.0119 T
rs1860 0.3259 A
rs1866 0.308 T
rs1878 0.3729 A
rs1900 0.1463 C
rs1916 0.3286 C
rs1947 0.0718 C
rs1953 0.1622 T
rs2387 0.3048 T
rs2393 0.0178 C
rs2481 0.2893 A
rs2564 0.1545 C
rs2731 0.3857 G
rs2782 0.362 T
rs2913 0.3921 G
rs3179 0.4735 C
rs3233 0.2715 C
(...)
Key here is "for various populations" because what is a minor allele in one population could be major in another. The reference genome could represent some degree of admixture, for example, and so that "minor" allele may be major from the population contributing to the admixture.
Interesting paper Daniel - thanks! I don't have population information for the current datasets... for now, I am planning to filter these particular alleles only when they are also annotated as G5A (i.e. reference genome is minor allele, and minor allele is <5% in all populations).
Interesting paper Daniel - thanks!
I don't have population information for the current datasets... for now, I am planning to tag these particular alleles and use that in combination with the G5A annotation to de-prioritize these variants (i.e. reference genome is minor allele, and minor allele is <5% in all populations, then the sample allele is not that interesting to us).
Thank you so much Pierre. This is exactly what I need.