Question

How To Find Nearest Gene To A Batch Of Snps

3

Entering edit mode

13.4 years ago

Hans ▴ 30

Hi, I'm looking for a tool that can help me find nearest gene to ~64K SNPs. It would be great if the tool allows me to define the number of kb to search up- and downstream of the SNP.

I appriciate any help, hints and tips!

snp gene gene • 20k views

ADD COMMENT • link updated 5.9 years ago by yupadhya • 0 • written 13.4 years ago by Hans ▴ 30

1

Entering edit mode

You might like to look in the related questions box (right sidebar) for this; it's been answered several times before.

ADD REPLY • link 13.4 years ago by Neilfws 49k

score 5 · Answer 1 · 2011-07-21

5

Entering edit mode

13.4 years ago

Sander Timmer ▴ 710

I'm not sure if any tools exist that can do this straight out of the box.... Though I know for sure that within a few lines of Perl you can write this using the Ensembl API. You could even think of just using the Ensembl Mysql and then select for transcripts with the start/stop position closest by your SNP location.

ADD COMMENT • link 13.4 years ago by Sander Timmer ▴ 710

3

Entering edit mode

In fact, the Feature object in the Ensembl Core API, and consequently also the VariationFeature object in the Variation API, has a method called get_nearest_Gene. But even without this is should be easy, I have done it myself in the past and might have still have the script lying around somewhere ....

ADD REPLY • link 13.4 years ago by Bert Overduin ★ 3.7k

Ram · Answer 2 · 2011-07-21

4

Entering edit mode

13.4 years ago

Pierre Lindenbaum 164k

You can use the UCSC mysql database to find the SNPs at a given distance of a gene. See my previous answer for a related question: How To Map A Snp To A Gene Around +/- 60Kb ?

This won't return THE closest gene, but it's a part of your solution.

ADD COMMENT • link updated 5.2 years ago by Ram 44k • written 13.4 years ago by Pierre Lindenbaum 164k

score 3 · Answer 3 · 2013-05-01

If your data are in UCSC BED format — or can be converted to UCSC BED format with, say, BEDOPS conversion scripts — then BEDOPS closest-features is very efficient at solving this problem. I'll describe below how you might approach this using BEDOPS and standard UNIX tools.

To demonstrate, let's say we start with a set of SNPs in a VCF file, and we have a set of GENCODE genes in GTF format. We'll convert these two files to sorted BED files and ask some questions with them.

First, we convert the VCF file:

$ vcf2bed < mySNPs.vcf > mySNPs.bed

Next, let's assume we're working with human data and that we visit the GENCODE site to grab GENCODE v15 records in GTF format. We can use BEDOPS gtf2bed to convert this to a sorted UCSC BED-formatted file containing GENCODE genes:

$ wget -O - ftp://ftp.sanger.ac.uk/pub/gencode/release_15/gencode.v15.annotation.gtf.gz \
    | gunzip -c - \
    | gtf2bed - \
    | grep -w gene \
    > gencodeGenes.bed

(This doesn't take long to do, but since releases only come out once in a while, you probably shouldn't have to do this last step too often.)

Now that both genes and SNPs are in sorted BED files, we can find the nearest gene to a SNP with closest-features:

$ closest-features --closest --dist mySNPs.bed gencodeGenes.bed > answer.bed

The file answer.bed is a sorted BED file that contains each SNP from mySNPs.bed, the closest GENCODE gene to it from gencodeGenes.bed, and the distance between the SNP and the gene in bases. The abstract format of each line of the result in this file is as follows, delimited with pipe characters ("|"):

[ SNP-1 ] | [ gene-nearest-to-SNP-1 ] | [ distance-between-SNP-1-and-gene ]
[ SNP-2 ] | [ gene-nearest-to-SNP-2 ] | [ distance-between-SNP-2-and-gene ]
...
[ SNP-x ] | [ gene-nearest-to-SNP-x ] | [ distance-between-SNP-x-and-gene ]

We can use this distance value with awk to quickly filter elements on some criteria of interest (say, we want genes that are no further away from SNPs than n bases).

As an example, let's say we want all SNPs where the nearest gene is no more than 1000 bases away from its SNP:

$ awk -F "|" -v max_distance=1000 '{ \
    if ($3 <= max_distance) { \
        print $0; \
    } \
}' answer.bed > filteredAnswer.bed

As you can see in the examples above, BEDOPS is designed to work with standard input and output to make it easy to string commands together into a pipeline. So if you need to repeat experiments with different inputs, you might do everything just described in just a few lines of code and put it into a shell script:

$ wget -O - ftp://ftp.sanger.ac.uk/pub/gencode/release_15/gencode.v15.annotation.gtf.gz
    | gunzip -c - \
    | gtf2bed - \
    | grep -w gene \
    > gencodeGenes.bed

$ vcf2bed < mySNPs.vcf \
    | closest-features --closest --dist - gencodeGenes.bed \
    | awk -F "|" -v max_distance=100 '{ \
        if ($3 <= max_distance) { \
            print $0; \
        } \
    }' - \
    > filteredAnswer.bed

Eliminating two intermediate files (mySNPs.bed and answer.bed) reduces file I/O. Along with other efficiency advancements introduced by BEDOPS, this pipeline is fairly efficient.

score 1 · Answer 4 · 2011-07-22

1

Entering edit mode

13.4 years ago

Rhys ▴ 10

NeibG (http://wuxbl.scripts.mit.edu/neibg/neibg.php) might do the job but it won't take SNP ids as input, they need to be converted to a locus first.

ADD COMMENT • link 13.4 years ago by Rhys ▴ 10

score 1 · Answer 5 · 2013-05-01

Are you lookink only for SNP flanking regions or genes ?

If you are only willing to get flanking regions of a SNP there is this method that can help you.

Otherwise, I suggest you do this to get the nearest gene :

1- extract your SNP in a bed format with a chr, start, end. 2- Get all refGenes from UCSC in a bed format as well 3- Use bedtools "closestBed" to get sens of the nearest gene to your SNP region

That's a suggestion that might be improved of course

score 0 · Answer 6 · 2011-07-22

Are those SNP from particular species? And do you have SNP coordinates, or just flanking sequence?

If you have coordinates, get your genome annotation (GFF or GTF) and simply parse it looking for nearest coding sequence. I don't know tools for that, but writting your own script wouldn't be that difficult:)

If you don't have SNP coordinates, you will need to map those on the genome first (f.e. BLAT).

score 0 · Answer 7 · 2019-01-18

0

Entering edit mode

5.9 years ago

yupadhya • 0

https://snp-nexus.org/index.html

I found this is a great webpage. You can query using text file or vcf file here.

ADD COMMENT • link 5.9 years ago by yupadhya • 0