what is the best way to determine if any variants in LD map to exons and cause coding changes? Is there a tool like Polyphen through which they could be run in batches?
Bioinformatics. 2010 Aug 15;26(16):2069-70. Epub 2010 Jun 18. Deriving the consequences of genomic variants with the Ensembl API and SNP Effect Predictor.
On my side, last year, I wrote a simple program using the UCSC 'knownGene' table to predict the consequences of the mutations. A early version was described here.
This is really good, Pierre. It appears that it takes input like the kind I could provide (e.g. chr1:1158631, rs11689281). Now I just need someone to hook me up with the elusive tables the authors do not seem to have provided.
The paper supplementary material says:
"The coordinates and predicted functional consequences of all of the LOF variants identified in the project are available on the 1000 Genomes FTP site."
The lost of function variants are all annotated as part of the standard paper data set which can be found on the 1000genomes ftp site
There are a lot of files to navigate but as this represents a lot of data it was felt this was the best way to distribute it
As the code in a comment to another answer seems to of been a little mangled here it is again, a quick one liner with a bash foreach loop and our current.tree file which gets you all the files based on a simple grep
The coordinates and predicted functional consequences of all of the LOF variants identified in the project are available on the 1000 Genomes FTP site.
I suspect that these are the files named .LOF.txt.gz (or just .LOF.txt). They seem to be scattered through various directories at the FTP site.
For example this FTP directory contains "README.2010_07.lof_variants", with LOF files in the exon/, low_coverage/ and trio/ sub-directories (and in fact, more sub-directories therein, e.g. exon/snps, exon/indels). The directory for data from the paper seems to have a similar structure.
You may just have to navigate through the FTP site, taking notes and reading README files until you find what you want. Or I guess, email the authors and ask for a direct link to the LOF data.
A good way to do it would be to parse the VCF files provided at the 1000 genomes website, and use the fields provided inside the files to filter according to your needs.
You wrote, "In the 1000G paper it says: “In total, we found 68,300 non-synonymous SNPs, 34,161 of which were novel.” Presumably a table must exist to get such a number which has genome positions, rs#s, and predictions for which variants are functional. If anyone could find that table or that data on the 1000G website, that would solve this problem."
I would write to Daniel MacArthur and ask him for that table. He is the one who worked on the loss of function variants identified in the 1000G data and presented this at ASHG last week. He is on Twitter as dgmacarthur.
In the 1000G paper it says: “In total, we found 68,300 non-synonymous SNPs, 34,161 of which were novel.”
Presumably a table must exist to get such a number which has genome positions, rs#s, and predictions for which variants are functional.
If anyone could find that table or that data on the 1000G website, that would solve this problem.