how to extract a random set of SNP from a SNP table
3
0
Entering edit mode
7.7 years ago
Ana ▴ 200

I have a SNP table (tab.table format) ,containing more than 10000K SNPs (its a whole genome data). Now I need to extract a random set of 10k SNPs chosen approximately equally spaced along the chromosomes (17 chromosomes). Could you please help me to figure out how to do that? thanks in advance for any suggestion

HanXRQChr00c0001    68313   N   N   N   N   C   C   N   N   N   N   N   N   N   N   C   N   N   N   N   N   C   C   C   N   NN
HanXRQChr00c0001    68457   N   N   N   N   N   G   N   N   N   N   N   N   N   R   G   N   N   N   N   N   N   G   N   N   NN
HanXRQChr00c0001    68521   N   N   N   N   N   K   N   N   N   N   N   N   N   G   K   N   N   N   N   N   N   G   N   G   NN
HanXRQChr00c0001    68536   N   N   N   N   N   A   N   N   N   N   N   N   N   A   A   N   N   N   N   N   N   A   N   A   NN
HanXRQChr00c0001    68746   N   N   N   N   N   A   N   N   N   N   N   N   N   A   A   N   N   N   N   N   N   A
SNP random sample extraction • 3.5k views
ADD COMMENT
0
Entering edit mode

Do you need exactly 10k SNPs or approximately?

ADD REPLY
0
Entering edit mode

I need approximately 10K

ADD REPLY
0
Entering edit mode

If you know a bit of programming this should be easy. The likelihood that a line will get included in the final set is 10k divided by total number of lines.

Pseudocode (no time for real code)

for line in table:
    If random(probablility) is True:
        print(line)
    else:
     pass #do nothing
ADD REPLY
0
Entering edit mode

Thanks for your comment but I do not get it well, could you please explain a bit more?I appreciate that I have more than 11 million lines! the probability that a particular line will get included is equal to other lines.

ADD REPLY
0
Entering edit mode

Typing code on phone is hard, I'm traveling to a conference. Probability is 10k/11M.

ADD REPLY
2
Entering edit mode
7.6 years ago

Hi Ana,

Sorry, took a while to get back to this. Below is a bit of code with very basic random line selector based on a predefined odds of inclusion:

If your file has a header this will need minor modification.

Script is intended to be saved as lineSelector.py (for example) and executed as:

python lineSelector.py yourfile.txt > selectedfile.txt

Let me know if this doesn't work as expected.

ADD COMMENT
0
Entering edit mode

thanks a lot WouterDeCoster, sorry I was away and just checked the code now. It seems that it does not work, I do not have the experience of Python. I get this error message

Traceback (most recent call last): File "./Lineselector.py", line 7, in <module> for line in open(sys.argv[1]): NameError: name 'sys' is not defined

can you help me out to fix it, thank you so much

ADD REPLY
0
Entering edit mode

Woop, made a tiny mistake :)
I edited the code, can you try again?

ADD REPLY
0
Entering edit mode

Many many thanks WouterDeCoster, it works perfectly :)

ADD REPLY
0
Entering edit mode

Good to hear. You can mark my answer as 'accepted' to mark this question as solved.

ADD REPLY
1
Entering edit mode
7.6 years ago

To sample without replacement from a text file, you can use sample:

$ N=10000
$ sample --sample-size=${N} foo.vcf > sample.${N}.vcf

This does a sample with uniform probability, so you will get a sample with 10K SNPs equally distributed over the input set (and so evenly distributed over the chromosomes).

ADD COMMENT
0
Entering edit mode
7.7 years ago
Ram 44k

Given that you need random yet equally spaced, I'd recommend creating 10k equal sized bins along the genome (or a subset of the genome - whatever you're using) and then randomly picking within each bin. Address each step as you get to it and you should be able to solve this problem - it looks pretty straightforward.

ADD COMMENT

Login before adding your answer.

Traffic: 2428 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6