Hi, I'm looking for a tool that can help me find nearest gene to ~64K SNPs. It would be great if the tool allows me to define the number of kb to search up- and downstream of the SNP.
I appriciate any help, hints and tips!
Hi, I'm looking for a tool that can help me find nearest gene to ~64K SNPs. It would be great if the tool allows me to define the number of kb to search up- and downstream of the SNP.
I appriciate any help, hints and tips!
I'm not sure if any tools exist that can do this straight out of the box.... Though I know for sure that within a few lines of Perl you can write this using the Ensembl API. You could even think of just using the Ensembl Mysql and then select for transcripts with the start/stop position closest by your SNP location.
In fact, the Feature object in the Ensembl Core API, and consequently also the VariationFeature object in the Variation API, has a method called get_nearest_Gene. But even without this is should be easy, I have done it myself in the past and might have still have the script lying around somewhere ....
You can use the UCSC mysql database to find the SNPs at a given distance of a gene. See my previous answer for a related question: How To Map A Snp To A Gene Around +/- 60Kb ?
This won't return THE closest gene, but it's a part of your solution.
If your data are in UCSC BED format — or can be converted to UCSC BED format with, say, BEDOPS conversion scripts — then BEDOPS closest-features
is very efficient at solving this problem. I'll describe below how you might approach this using BEDOPS and standard UNIX tools.
To demonstrate, let's say we start with a set of SNPs in a VCF file, and we have a set of GENCODE genes in GTF format. We'll convert these two files to sorted BED files and ask some questions with them.
First, we convert the VCF file:
$ vcf2bed < mySNPs.vcf > mySNPs.bed
Next, let's assume we're working with human data and that we visit the GENCODE site to grab GENCODE v15 records in GTF format. We can use BEDOPS gtf2bed
to convert this to a sorted UCSC BED-formatted file containing GENCODE genes:
$ wget -O - ftp://ftp.sanger.ac.uk/pub/gencode/release_15/gencode.v15.annotation.gtf.gz \
| gunzip -c - \
| gtf2bed - \
| grep -w gene \
> gencodeGenes.bed
(This doesn't take long to do, but since releases only come out once in a while, you probably shouldn't have to do this last step too often.)
Now that both genes and SNPs are in sorted BED files, we can find the nearest gene to a SNP with closest-features
:
$ closest-features --closest --dist mySNPs.bed gencodeGenes.bed > answer.bed
The file answer.bed
is a sorted BED file that contains each SNP from mySNPs.bed
, the closest GENCODE gene to it from gencodeGenes.bed
, and the distance between the SNP and the gene in bases. The abstract format of each line of the result in this file is as follows, delimited with pipe characters ("|
"):
[ SNP-1 ] | [ gene-nearest-to-SNP-1 ] | [ distance-between-SNP-1-and-gene ]
[ SNP-2 ] | [ gene-nearest-to-SNP-2 ] | [ distance-between-SNP-2-and-gene ]
...
[ SNP-x ] | [ gene-nearest-to-SNP-x ] | [ distance-between-SNP-x-and-gene ]
We can use this distance value with awk
to quickly filter elements on some criteria of interest (say, we want genes that are no further away from SNPs than n
bases).
As an example, let's say we want all SNPs where the nearest gene is no more than 1000 bases away from its SNP:
$ awk -F "|" -v max_distance=1000 '{ \
if ($3 <= max_distance) { \
print $0; \
} \
}' answer.bed > filteredAnswer.bed
As you can see in the examples above, BEDOPS is designed to work with standard input and output to make it easy to string commands together into a pipeline. So if you need to repeat experiments with different inputs, you might do everything just described in just a few lines of code and put it into a shell script:
$ wget -O - ftp://ftp.sanger.ac.uk/pub/gencode/release_15/gencode.v15.annotation.gtf.gz
| gunzip -c - \
| gtf2bed - \
| grep -w gene \
> gencodeGenes.bed
$ vcf2bed < mySNPs.vcf \
| closest-features --closest --dist - gencodeGenes.bed \
| awk -F "|" -v max_distance=100 '{ \
if ($3 <= max_distance) { \
print $0; \
} \
}' - \
> filteredAnswer.bed
Eliminating two intermediate files (mySNPs.bed
and answer.bed
) reduces file I/O. Along with other efficiency advancements introduced by BEDOPS, this pipeline is fairly efficient.
NeibG (http://wuxbl.scripts.mit.edu/neibg/neibg.php) might do the job but it won't take SNP ids as input, they need to be converted to a locus first.
Are you lookink only for SNP flanking regions or genes ?
If you are only willing to get flanking regions of a SNP there is this method that can help you.
Otherwise, I suggest you do this to get the nearest gene :
1- extract your SNP in a bed format with a chr, start, end. 2- Get all refGenes from UCSC in a bed format as well 3- Use bedtools "closestBed" to get sens of the nearest gene to your SNP region
That's a suggestion that might be improved of course
Are those SNP from particular species? And do you have SNP coordinates, or just flanking sequence?
If you have coordinates, get your genome annotation (GFF or GTF) and simply parse it looking for nearest coding sequence. I don't know tools for that, but writting your own script wouldn't be that difficult:)
If you don't have SNP coordinates, you will need to map those on the genome first (f.e. BLAT).
https://snp-nexus.org/index.html
I found this is a great webpage. You can query using text file or vcf file here.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
You might like to look in the related questions box (right sidebar) for this; it's been answered several times before.