Segregate out SNPs into two classes and process them into one BED file. Once you have them as a BED file, you can do set operations against gene tables (RefSeq, GENCODE, etc.).
This answer uses BEDOPS tools and conversion scripts, so have this suite installed before proceeding.
First, use grep
to extract all of your SNPs that start with rs
:
$ grep "^rs" SNPs.txt > rsSNPs.txt
Then use mysql
to grab the dbSNP snp138Common
table as a BED file:
$ mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg19 -e 'SELECT chrom, chromStart, chromEnd, name FROM snp138Common' | tail -n +2 > snp138Common.bed
Now use grep
to filter snp138Common.bed
for your rs*
SNPs of interest:
$ grep -wFf rsSNPs.txt snp138Common.bed | sort-bed - > rsSNPs.bed
Second, Illumina SNPs which have chromosome name and base positions in the annotation name could be converted to a BED file directly:
$ grep "^imm" SNPs.txt \
| awk ' \
BEGIN { \
OFS = "\t"; \
} \
{ \
split($0, elems, "_"); \
chr = "chr"elems[2]; \
start = elems[3] - 1; \
stop = elems[3]; \
id = $0; \
print chr, start, stop, id; \
} \
' - | sort-bed - > illuminaSNPs.bed
Third, take the union of the two BED files with the bedops
set operation tool:
$ bedops --everything rsSNPs.bed illuminaSNPs.bed > allSNPs.bed
You now have a sorted BED file called allSNPs.bed
, which you can query against RefSeq, GENCODE and other gene tables of interest.
As an example, let's grab GENCODE v19 records, filter them for genes, and convert the result to BED with the gtf2bed
conversion tool:
$ wget -O - ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz \
| gunzip -c \
| grep -w "gene" \
| gtf2bed \
> gencode.v19.genes.bed
To demonstrate mapping/querying, we use the bedmap
map tool to look at a 1 kb window around each of your SNPs of interest, looking for any GENCODE v19 gene name annotations that fall within that window:
$ bedmap --range 500 --echo --echo-map-id-uniq allSNPs.bed gencode.v19.genes.bed > answer.bed
The file answer.bed
will be in the following format:
...
[ SNP-i ] | [ gene-id-1 ] ; [ gene-id-2 ] ; ... ; [ gene-id-n ]
...
In other words, each line has a SNP and the names of any GENCODE gene names that overlap a 1 kb window centered on the SNP.