How To Map A Snp To A Gene Around +/- 60Kb ?
5
22
Entering edit mode
14.7 years ago

I am looking at a bunch of SNPs. Some of them are part of genes, but other are not. I am interested to look up +60KB or -60KB of those SNPs to get details about some nearby genes. Please share your experience in dealing with such a situation or thoughts on any methods that can do this. Thanks in advance.

snp data human annotation • 26k views
ADD COMMENT
0
Entering edit mode

If you just want gene names within 60kb of the SNP:

$ bedmap --echo --echo-map-id-uniq --range 60000 <(vcf2bed snps.vcf) genes.bed

If you want their locations, also:

$ bedmap --echo --echo-map --range 60000 <(vcf2bed snps.vcf) genes.bed

These are operations to retrieve genes within 60kb of a SNP.

If you want the very nearest gene to a SNP, you can use closest-features:

$ closest-features --closest --dist <(vcf2bed snps.vcf) genes.bed > answer.bed

This gives you, for each SNP, its nearest gene and the distance value in bases.

These and other genomic set operation tools are part of the BEDOPS toolkit.

ADD REPLY
26
Entering edit mode
14.7 years ago

Using the UCSC mysql server:

~> mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg19 -e '
select
 K.proteinID,
 K.name,
 S.name,
 S.avHet,
 S.chrom,
 S.chromStart,
 K.txStart,
 K.txEnd
from snp130 as S
left join knownGene as K on
 (S.chrom=K.chrom and not(K.txEnd+60000<S.chromStart or S.chromEnd+60000<K.txStart))
where
 S.name in ("rs25","rs100","rs75","rs9876","rs101")'

result:

+-----------+------------+--------+----------+-------+------------+----------+----------+
| proteinID | name       | name   | avHet    | chrom | chromStart | txStart  | txEnd    |
+-----------+------------+--------+----------+-------+------------+----------+----------+
| NULL      | NULL       | rs100  |        0 | chr7  |   24438348 |     NULL |     NULL |
| NULL      | NULL       | rs101  |        0 | chr7  |   24438147 |     NULL |     NULL |
| NP_056019 | uc003ssf.3 | rs25   | 0.499586 | chr7  |   11584141 | 11414172 | 11871824 |
| NP_056019 | uc003ssf.3 | rs75   | 0.241967 | chr7  |   11613691 | 11414172 | 11871824 |
| B2RNV1    | uc003tnv.2 | rs9876 | 0.426096 | chr7  |   47315290 | 47314752 | 47579199 |
| B2RNV1    | uc003tnw.2 | rs9876 | 0.426096 | chr7  |   47315290 | 47314752 | 47621742 |
+-----------+------------+--------+----------+-------+------------+----------+----------+
ADD COMMENT
4
Entering edit mode

Pierre, this is great solution. I completely forgot about mysql server. However, I'd recommend to use refFlat table instead of knownGene and K.geneName instead of K.proteinID. In this case you will get HUGO gene name. Anyway, it's up to OP.

ADD REPLY
1
Entering edit mode

Very nice solution - anything that takes us away from ad hoc Perl scripts is good!

ADD REPLY
1
Entering edit mode

Recently, a question similar to this question was asked in BioBB. URL : http://www.bioinformatics.org/pipermail/bbb/2010-April/004849.html

ADD REPLY
0
Entering edit mode

Excellent ! I was looking for something like this. Thanks a lot Pierre.

ADD REPLY
0
Entering edit mode

Ditto. Very nice solution, I didn't even know that server existed. looks like they may be getting a bit more traffic in the future.

ADD REPLY
0
Entering edit mode

I wanted to limit my results to items with a geneSymbol so I added an extra join:

select S.name as rsId, kgXref.geneSymbol from snp131 as S left join knownGene as K on (S.chrom=K.chrom and not(K.txEnd+60000

ADD REPLY
0
Entering edit mode

I found the UCSC server unusably slow. I used the UCSC Table browser to export tab delimited files. Bu picking 'selected fields from primary and related tables' I was able to just get the fields I wanted. I then used Toad for MySQL (Freeware on Windows) to import the flat file to a local db in a single step.

ADD REPLY
0
Entering edit mode

I found the UCSC server unusably slow. I used the UCSC Table browser to export tab delimited files for the snp131 and refFlat (per Yuri above) tables. By choosing 'selected fields from primary and related tables' from drop down menu I was able to get only the the fields I wanted, thereby saving time and disk space. I then used Toad for MySQL (Freeware on Windows) to import the flat files to a local db in a single step each.

ADD REPLY
0
Entering edit mode

quite old but still - doesn't "not(K.txEnd+60000<S.chromStart" also mean "all genes that end more than 60kb away from snp" to the right?

ADD REPLY
0
Entering edit mode

'not(K.txEnd+60000<S.chromStart or S.chromEnd+60000<K.txStart)' means if the gene end +60KB is before the SNP or the gene start - 60Kb if after the SNP the DO NOT accept the SNP.

ADD REPLY
0
Entering edit mode

Sorry, could you please tell me exactly what I have to type on the command line? If I copy this mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A in MySQL 5.7 Command Line Client I get an error in SQL syntax...thank you

ADD REPLY
0
Entering edit mode

2019-11-20: the sql server has changed to mysql --user=genome --host=genome-mysql.soe.ucsc.edu -A -P 3306

ADD REPLY
8
Entering edit mode
14.4 years ago

I imagine you have this sorted out by now, but I thought I'd add that assuming your SNPs are in BED/ROD or VCF format and your genes are in BED12 or GFF format, then you can easily do this with a single command in my BEDTools suite. The relevant command would be:

windowBed -a snps.bed -b genes.bed -w 60000 > snps.withGenesWithin60Kb.bed
ADD COMMENT
0
Entering edit mode

Thanks for adding your answer. Interesting to note the BEDTools. I will go through it.

ADD REPLY
0
Entering edit mode

Hi Khader, Just wondering if you managed to try it with BEDTools? I'm trying to do a similar thing but not very familiar with mySQL. Also, may I just ask on what basis/why you chose a +/- 60kb window size? Thanks!

ADD REPLY
5
Entering edit mode
14.7 years ago
Michael 55k

No need to program this. This can be done as a BioMart query in MartView:

rs8 with 60000 bases up/donw-flanks

I just checked it works with that size flanks. The result is a FASTA file.

On the other hand, if you have a SNP position it could be easier to simply search for the genes directly in the genome annotation. Get a genome annotation as a tab separated file from e.g. BioMart and search for the genes with start/stop positons within this range.

ADD COMMENT
0
Entering edit mode

Thanks Michael.

ADD REPLY
0
Entering edit mode

I am afraid this is not exactly I am looking for. I wanted to check if a particular SNP is not mapped to a gene, whether there will be any known genes in +/- 60KB of that SNP. This particular BioMart query is fetching the sequence not annotation. I have tried one with annotation and it works. : http://www.biomart.org/biomart/martview/c26efcafd362abccc4af83039c441762. But sometimes BioMart out put requires lot of post-processing. Also it will be nice if BioMart can have options to select fields multiple attributes.

ADD REPLY
0
Entering edit mode

Yes, sorry, I misunderstood your question. What Pierre described is much better here.

ADD REPLY
2
Entering edit mode
14.7 years ago
Yuri ★ 1.7k

I downloaded refFlat table from UCSC Genome Browser, which contains gene symbols and their genomic locations. Then mapped SNPs to genes. I did it in MATLAB, but can be done with any language.

You can also try a Perl script from here.

ADD COMMENT

Login before adding your answer.

Traffic: 2305 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6