I am looking at a bunch of SNPs. Some of them are part of genes, but other are not. I am interested to look up +60KB or -60KB of those SNPs to get details about some nearby genes. Please share your experience in dealing with such a situation or thoughts on any methods that can do this. Thanks in advance.
~> mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg19 -e '
select
K.proteinID,
K.name,
S.name,
S.avHet,
S.chrom,
S.chromStart,
K.txStart,
K.txEnd
from snp130 as S
left join knownGene as K on
(S.chrom=K.chrom and not(K.txEnd+60000<S.chromStart or S.chromEnd+60000<K.txStart))
where
S.name in ("rs25","rs100","rs75","rs9876","rs101")'
Pierre, this is great solution. I completely forgot about mysql server. However, I'd recommend to use refFlat table instead of knownGene and K.geneName instead of K.proteinID. In this case you will get HUGO gene name. Anyway, it's up to OP.
I found the UCSC server unusably slow. I used the UCSC Table browser to export tab delimited files. Bu picking 'selected fields from primary and related tables' I was able to just get the fields I wanted. I then used Toad for MySQL (Freeware on Windows) to import the flat file to a local db in a single step.
I found the UCSC server unusably slow. I used the UCSC Table browser to export tab delimited files for the snp131 and refFlat (per Yuri above) tables. By choosing 'selected fields from primary and related tables' from drop down menu I was able to get only the the fields I wanted, thereby saving time and disk space. I then used Toad for MySQL (Freeware on Windows) to import the flat files to a local db in a single step each.
'not(K.txEnd+60000<S.chromStart or S.chromEnd+60000<K.txStart)' means if the gene end +60KB is before the SNP or the gene start - 60Kb if after the SNP the DO NOT accept the SNP.
Sorry, could you please tell me exactly what I have to type on the command line? If I copy this mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A in MySQL 5.7 Command Line Client I get an error in SQL syntax...thank you
I imagine you have this sorted out by now, but I thought I'd add that assuming your SNPs are in BED/ROD or VCF format and your genes are in BED12 or GFF format, then you can easily do this with a single command in my BEDTools suite. The relevant command would be:
windowBed -a snps.bed -b genes.bed -w 60000 > snps.withGenesWithin60Kb.bed
Hi Khader, Just wondering if you managed to try it with BEDTools? I'm trying to do a similar thing but not very familiar with mySQL. Also, may I just ask on what basis/why you chose a +/- 60kb window size? Thanks!
I just checked it works with that size flanks. The result is a FASTA file.
On the other hand, if you have a SNP position it could be easier to simply search for the genes directly in the genome annotation. Get a genome annotation as a tab separated file from e.g. BioMart and search for the genes with start/stop positons within this range.
I am afraid this is not exactly I am looking for. I wanted to check if a particular SNP is not mapped to a gene, whether there will be any known genes in +/- 60KB of that SNP. This particular BioMart query is fetching the sequence not annotation. I have tried one with annotation and it works. : http://www.biomart.org/biomart/martview/c26efcafd362abccc4af83039c441762. But sometimes BioMart out put requires lot of post-processing. Also it will be nice if BioMart can have options to select fields multiple attributes.
I downloaded refFlat table from UCSC Genome Browser, which contains gene symbols and their genomic locations. Then mapped SNPs to genes. I did it in MATLAB, but can be done with any language.
If you just want gene names within 60kb of the SNP:
If you want their locations, also:
These are operations to retrieve genes within 60kb of a SNP.
If you want the very nearest gene to a SNP, you can use
closest-features
:This gives you, for each SNP, its nearest gene and the distance value in bases.
These and other genomic set operation tools are part of the BEDOPS toolkit.