how to extract a random set of SNP from a SNP table

0

Entering edit mode

8.3 years ago

Ana ▴ 200

I have a SNP table (tab.table format) ,containing more than 10000K SNPs (its a whole genome data). Now I need to extract a random set of 10k SNPs chosen approximately equally spaced along the chromosomes (17 chromosomes). Could you please help me to figure out how to do that? thanks in advance for any suggestion

HanXRQChr00c0001    68313   N   N   N   N   C   C   N   N   N   N   N   N   N   N   C   N   N   N   N   N   C   C   C   N   NN
HanXRQChr00c0001    68457   N   N   N   N   N   G   N   N   N   N   N   N   N   R   G   N   N   N   N   N   N   G   N   N   NN
HanXRQChr00c0001    68521   N   N   N   N   N   K   N   N   N   N   N   N   N   G   K   N   N   N   N   N   N   G   N   G   NN
HanXRQChr00c0001    68536   N   N   N   N   N   A   N   N   N   N   N   N   N   A   A   N   N   N   N   N   N   A   N   A   NN
HanXRQChr00c0001    68746   N   N   N   N   N   A   N   N   N   N   N   N   N   A   A   N   N   N   N   N   N   A

SNP random sample extraction • 4.0k views

ADD COMMENT • link updated 8.3 years ago by Alex Reynolds 36k • written 8.3 years ago by Ana ▴ 200

0

Entering edit mode

Do you need exactly 10k SNPs or approximately?

ADD REPLY • link 8.3 years ago by WouterDeCoster 48k

0

Entering edit mode

I need approximately 10K

ADD REPLY • link 8.3 years ago by Ana ▴ 200

0

Entering edit mode

If you know a bit of programming this should be easy. The likelihood that a line will get included in the final set is 10k divided by total number of lines.

Pseudocode (no time for real code)

for line in table:
    If random(probablility) is True:
        print(line)
    else:
     pass #do nothing

ADD REPLY • link 8.3 years ago by WouterDeCoster 48k

0

Entering edit mode

Thanks for your comment but I do not get it well, could you please explain a bit more?I appreciate that I have more than 11 million lines! the probability that a particular line will get included is equal to other lines.

ADD REPLY • link 8.3 years ago by Ana ▴ 200

0

Entering edit mode

Typing code on phone is hard, I'm traveling to a conference. Probability is 10k/11M.

ADD REPLY • link 8.3 years ago by WouterDeCoster 48k

2

Entering edit mode

8.3 years ago

WouterDeCoster 48k

Hi Ana,

Sorry, took a while to get back to this. Below is a bit of code with very basic random line selector based on a predefined odds of inclusion:

	from __future__ import print_function, division
	import random
	import sys

	odds = 10000/11000000

	for line in open(sys.argv[1]):
	if random.random() <= odds:
	print(line, end="")

view raw RandomLine.py hosted with ❤ by GitHub

If your file has a header this will need minor modification.

Script is intended to be saved as lineSelector.py (for example) and executed as:

python lineSelector.py yourfile.txt > selectedfile.txt

Let me know if this doesn't work as expected.

ADD COMMENT • link 8.3 years ago by WouterDeCoster 48k

0

Entering edit mode

thanks a lot WouterDeCoster, sorry I was away and just checked the code now. It seems that it does not work, I do not have the experience of Python. I get this error message

Traceback (most recent call last): File "./Lineselector.py", line 7, in <module> for line in open(sys.argv[1]): NameError: name 'sys' is not defined

can you help me out to fix it, thank you so much

ADD REPLY • link 8.3 years ago by Ana ▴ 200

0

Entering edit mode

Woop, made a tiny mistake :)
I edited the code, can you try again?

ADD REPLY • link 8.3 years ago by WouterDeCoster 48k

0

Entering edit mode

Many many thanks WouterDeCoster, it works perfectly :)

ADD REPLY • link 8.3 years ago by Ana ▴ 200

0

Entering edit mode

Good to hear. You can mark my answer as 'accepted' to mark this question as solved.

ADD REPLY • link 8.3 years ago by WouterDeCoster 48k

1

Entering edit mode

8.3 years ago

Alex Reynolds 36k

To sample without replacement from a text file, you can use sample:

$ N=10000
$ sample --sample-size=${N} foo.vcf > sample.${N}.vcf

This does a sample with uniform probability, so you will get a sample with 10K SNPs equally distributed over the input set (and so evenly distributed over the chromosomes).

ADD COMMENT • link 8.3 years ago by Alex Reynolds 36k

0

Entering edit mode

8.3 years ago

Ram 45k

Given that you need random yet equally spaced, I'd recommend creating 10k equal sized bins along the genome (or a subset of the genome - whatever you're using) and then randomly picking within each bin. Address each step as you get to it and you should be able to solve this problem - it looks pretty straightforward.

ADD COMMENT • link 8.3 years ago by Ram 45k

Login before adding your answer.