I have a SNP table (tab.table format) ,containing more than 10000K SNPs (its a whole genome data). Now I need to extract a random set of 10k SNPs chosen approximately equally spaced along the chromosomes (17 chromosomes). Could you please help me to figure out how to do that? thanks in advance for any suggestion
HanXRQChr00c0001 68313 N N N N C C N N N N N N N N C N N N N N C C C N NN
HanXRQChr00c0001 68457 N N N N N G N N N N N N N R G N N N N N N G N N NN
HanXRQChr00c0001 68521 N N N N N K N N N N N N N G K N N N N N N G N G NN
HanXRQChr00c0001 68536 N N N N N A N N N N N N N A A N N N N N N A N A NN
HanXRQChr00c0001 68746 N N N N N A N N N N N N N A A N N N N N N A
Do you need exactly 10k SNPs or approximately?
I need approximately 10K
If you know a bit of programming this should be easy. The likelihood that a line will get included in the final set is 10k divided by total number of lines.
Pseudocode (no time for real code)
Thanks for your comment but I do not get it well, could you please explain a bit more?I appreciate that I have more than 11 million lines! the probability that a particular line will get included is equal to other lines.
Typing code on phone is hard, I'm traveling to a conference. Probability is 10k/11M.