How to obtain rs ID having genomic position
1
1
Entering edit mode
7.2 years ago
Adam ▴ 40

Hello,

is it possible easily to get rs id's having genomic locations? Ex: from this:

Chr1    158669597   158669597
Chr11   72946311    72946311

To this:

Chr1    158669597   158669597   rs141159720
Chr11   72946311    72946311    rs145119561

My idea was to dobnload all SNP's from UCSC or Ensembl and then compare two files by genomic coordinates in R it's simply:

mydata$rs <- allsnps$rs[match(mydata$loc,allsnps$loc)]

However list of all human SNP's is huge, reference file has anout 20gb. Do you know any easy solution to get result such like this?

Thanks.

R SNP • 4.8k views
ADD COMMENT
2
Entering edit mode

you want bedtools interesct

ADD REPLY
0
Entering edit mode

allsnps$rs[match(mydata$loc,allsnps$loc)]

i don't think it would work if the SNP length is not '1'

Chr1 158669597 158669597 rs141159720

be careful, this doesn't look like a BED interval because start==end. while the UCSC data are 0-based and use open-ended intervals.

ADD REPLY
0
Entering edit mode

Thank you for reply. Yes, I know that it isn't proper BED file and it complicating the problem. Is it possible to convert it into .vcf and then run some variant annotation tools such as SnpSift and SnpEff?

ADD REPLY
1
Entering edit mode
ADD REPLY
0
Entering edit mode

Yes I have but the main problem is to compare it into my 'pseudo' BED file

ADD REPLY
0
Entering edit mode

fix the bed and use bedtools intersect ?

ADD REPLY
4
Entering edit mode
7.2 years ago

If you're worried about disk usage, BEDOPS does set operations on compressed files in a format called Starch, which offers a higher compression ratio than bzip2 and, as mentioned, allows direct set operations, without any extraction to an intermediate file.

With BEDOPS installed, you could create a Starch file of SNPs like so:

$ wget -qO- http://hgdownload.cse.ucsc.edu/goldenpath/hg19/database/snp147.txt.gz \
  | gunzip -c \
  | awk -v OFS="\t" '{ print $$2,$$3,($$3+1),$$5 }' \
  | sort-bed - \
  | starch - \
  > hg19.snp147.starch

(Adjust as needed for the SNPs you need, for the reference genome you're working with.)

An uncompressed raw BED file of SNPs is roughly 5.5 GB. Compressed as a Starch file, as shown above, it should be about 5-10% of that, around 250-500 MB.

Given a sorted, tab-delimited genomic regions file regions.bed like this:

chr1    158669597   158669598
chr11   72946311    72946312

Then you could do a bedmap operation with the compressed SNP annotations like this:

$ bedmap --echo --echo-map-id --delim '\t' regions.bed hg19.snp147.starch
chr1    158669597       158669598    
chr11   72946311        72946312    rs765534001

In this example chr1:158669597-158669598 does not associate with a SNP ID, while chr11:72946311-72946312 associates with SNP rs765534001.

So there are data hygiene issues with your input you should fix before using this toolkit:

  1. Your regions should be half-open, zero-based. This means that the start and stop columns should not be equal. You can use awk to increment the stop field by 1, where start and stop are equal.
  2. Your regions should be sorted with BEDOPS sort-bed for set operations to run quickly and correctly.
  3. Your regions should start with chr and not Chr. This can be fixed with awk, as well.
ADD COMMENT

Login before adding your answer.

Traffic: 1578 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6