Drawing A Random List Of Snps
3
0
Entering edit mode
11.6 years ago
Ryan D ★ 3.4k

I would like to be able to draw a random list of N SNPs from dbSNP/UCSC. If I have a list of HapMap SNPs, for instance, in a bed file format, I can shuffle them and select 1000 at random. Since the placement of SNPs in HapMap is not necessarily representative of the totality of SNPs in the genome, I'd like to do this with dbSNP. Short of downloading a bedfile of all SNPs in the genome from which resampling might be computationally intensive, is there an easy method by which to draw some number of random SNPs from the genome and have them returned in BED file format?

To do this with a list of SNPs in a bed file, I currently use the shuf command like shown below. But to do this for the 56M SNPs currently in dbSNP in order to resample 10k random SNPS multiple times might be too intensive. Ideas? R perhaps? Anyway to do this from the unix prompt so I can use the output in bedtools?

cat file.bed | head

chr1    235638  235751  13.6663
chr1    237748  237784  6.35761
chr1    521484  521614  10.0359
chr1    565575  566082  7.19007
chr1    567523  567873  10.5674
chr1    568176  568545  5.7313
chr1    569748  570042  652.342
chr1    664708  664756  6.32348

shuf file.bed | head

chr3    138552319       138553474       56.8719
chr12   7695465 7695792 11.469
chr20   23312538        23312926        6.68979
chr14   87802700        87802821        6.09238
chr2    180293340       180293591       4.35159
chr18   60279291        60279551        7.28719
chr19   49068267        49068726        34.7679
chr12   60729653        60729899        20.4301
chr2    30458084        30458522        65.6261
chr12   63695225        63695404        4.89757
hg19 ucsc • 3.3k views
ADD COMMENT
0
Entering edit mode
ADD REPLY
1
Entering edit mode
11.6 years ago

you could try to get the SNP by their index in the ucsc database using the 'LIMIT statement':

$ perl -e 'for($i=0;$i<10;$i++) {printf("select chrom,chromStart,chromEnd,name from snp137 as S limit %d,1;\n",1+rand(56248699));}' |\
mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg19 -N 

chr13    37964205    37964206    rs181919048
chr1    72200614    72200615    rs200947728
chr16    48061386    48061387    rs190847942
chr18    58696395    58696396    rs185805122
chr3    22533963    22533964    rs111931723
(...)

however, it remains slow :-)

ADD COMMENT
0
Entering edit mode

This works, Pierre, but it is slow as you said. I want to re-sample at least 1000 times. So I'll look at matted's solution below.

ADD REPLY
1
Entering edit mode
11.6 years ago
matted 7.8k

It sounds like reservoir sampling will do what you want. The idea is to keep only the M=1000 hits you want, replacing them at random as you go through the N=56M lines. That way your algorithm is O(N) runtime with O(M) memory usage, instead of O(N log N) runtime with O(N) memory usage. There is a good discussion of this applied to fastq files Selecting Random Pairs From Fastq? and for general files here. Both threads have example code snippets.

ADD COMMENT
0
Entering edit mode

This sounds computationally exactly like what I want to do.Thanks to both of you.

ADD REPLY
1
Entering edit mode
11.6 years ago
lh3 33k

Randomly sample k lines from a text file with reservoir sampling:

cat file.txt|awk -v k=10 '{y=x++<k?x-1:int(rand()*x);if(y<k)a[y]=$0}END{for(z in a)print a[z]}'

You keep at most k lines in RAM. If you want to sample more than ~10% of a huge file, you can sample with fraction p:

cat file.txt|awk -v p=0.1 'rand()<p'

which is much more efficient, though does not give you the exact number of output lines.

ADD COMMENT
0
Entering edit mode

This does a great job. Thanks. But it seems it always gives the same list from top to bottom. For instance:

cat dbsnp137.bed | awk -v k=3 '{y=x++<k?x-1:int(rand()*x);if(y<k)a[y]=$0}END{for(z in a)print a[z]}'

chr1    187631300       187631301       rs7549966       0       +
chr3    17116293        17116294        rs76328502      0       +
chr2    159274953       159274954       rs142896826     0       +

cat dbsnp137.bed | awk -v k=3 '{y=x++<k?x-1:int(rand()*x);if(y<k)a[y]=$0}END{for(z in a)print a[z]}'

chr1    226110938       226110939       rs35269575      0       +
chr1    187631519       187631520       rs138424077     0       +
chr3    17116303        17116304        rs77268742      0       +
chr2    159274997       159274998       rs113076635     0       +
chr1    23741704        23741705        rs146723620     0       +

The first 3 entries are the same. And it is like that for additional draws. So I don't think this reservoir sampling. Or, if it is, it is not random.

ADD REPLY
1
Entering edit mode

Sorry, maybe I don't understand, but they don't look the same to me. You can see how to set the awk random seed here: http://stackoverflow.com/questions/4048378/random-numbers-generation-with-awk-in-bash-shell.

ADD REPLY

Login before adding your answer.

Traffic: 1988 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6