Question

How to obtain rs ID having genomic position

1

Entering edit mode

7.2 years ago

Adam ▴ 40

Hello,

is it possible easily to get rs id's having genomic locations? Ex: from this:

Chr1    158669597   158669597
Chr11   72946311    72946311

To this:

Chr1    158669597   158669597   rs141159720
Chr11   72946311    72946311    rs145119561

My idea was to dobnload all SNP's from UCSC or Ensembl and then compare two files by genomic coordinates in R it's simply:

mydata$rs <- allsnps$rs[match(mydata$loc,allsnps$loc)]

However list of all human SNP's is huge, reference file has anout 20gb. Do you know any easy solution to get result such like this?

Thanks.

R SNP • 4.8k views

ADD COMMENT • link updated 7.2 years ago by Alex Reynolds 36k • written 7.2 years ago by Adam ▴ 40

2

Entering edit mode

you want bedtools interesct

ADD REPLY • link 7.2 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

allsnps$rs[match(mydata$loc,allsnps$loc)]

i don't think it would work if the SNP length is not '1'

Chr1 158669597 158669597 rs141159720

be careful, this doesn't look like a BED interval because start==end. while the UCSC data are 0-based and use open-ended intervals.

ADD REPLY • link 7.2 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Thank you for reply. Yes, I know that it isn't proper BED file and it complicating the problem. Is it possible to convert it into .vcf and then run some variant annotation tools such as SnpSift and SnpEff?

ADD REPLY • link 7.2 years ago by Adam ▴ 40

1

Entering edit mode

what about using dbsnp as VCF ? https://www.ncbi.nlm.nih.gov/variation/docs/human_variation_vcf/

ADD REPLY • link 7.2 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Yes I have but the main problem is to compare it into my 'pseudo' BED file

ADD REPLY • link 7.2 years ago by Adam ▴ 40

0

Entering edit mode

fix the bed and use bedtools intersect ?

ADD REPLY • link 7.2 years ago by Pierre Lindenbaum 164k

score 4 · Accepted Answer · 2017-09-15

If you're worried about disk usage, BEDOPS does set operations on compressed files in a format called Starch, which offers a higher compression ratio than bzip2 and, as mentioned, allows direct set operations, without any extraction to an intermediate file.

With BEDOPS installed, you could create a Starch file of SNPs like so:

$ wget -qO- http://hgdownload.cse.ucsc.edu/goldenpath/hg19/database/snp147.txt.gz \
  | gunzip -c \
  | awk -v OFS="\t" '{ print $$2,$$3,($$3+1),$$5 }' \
  | sort-bed - \
  | starch - \
  > hg19.snp147.starch

(Adjust as needed for the SNPs you need, for the reference genome you're working with.)

An uncompressed raw BED file of SNPs is roughly 5.5 GB. Compressed as a Starch file, as shown above, it should be about 5-10% of that, around 250-500 MB.

Given a sorted, tab-delimited genomic regions file regions.bed like this:

chr1    158669597   158669598
chr11   72946311    72946312

Then you could do a bedmap operation with the compressed SNP annotations like this:

$ bedmap --echo --echo-map-id --delim '\t' regions.bed hg19.snp147.starch
chr1    158669597       158669598    
chr11   72946311        72946312    rs765534001

In this example chr1:158669597-158669598 does not associate with a SNP ID, while chr11:72946311-72946312 associates with SNP rs765534001.

So there are data hygiene issues with your input you should fix before using this toolkit:

Your regions should be half-open, zero-based. This means that the start and stop columns should not be equal. You can use awk to increment the stop field by 1, where start and stop are equal.
Your regions should be sorted with BEDOPS sort-bed for set operations to run quickly and correctly.
Your regions should start with chr and not Chr. This can be fixed with awk, as well.