I have to make a list of 1000 human snps, all with 1000 genome MAF more than 5%. The problem is, they should be really random, and not come from any particular chromosome or gene group. Any idea how to fetch such a set apart from getting a random number generator and entering them one by one?
There are two tags in dbSNP :G5 and G5A. G5 is >5% in any one or more populations and G5A is >5% in all the populations. When you mention MAF > 5% in 1000 genomes, is it any one or more population (G5 in dbSNP) or in all populations (G5A)? In first case, filter dbSNP by G5 tag and then sample the vcf records. If you are still in doubt, filter by both KG and G5/G5A tags.
Example code:
Similar post here on biostars:Picking random SNPs from 1000 Genomes using Vcftools. VCFlib has vcfrandomsample option. It samples by % ( note: calculate percentage of your records to get 1000 variants). You need to use dbSNP vcf for all the chromosomes instead of 20 above. Alex has code for random sampling vcf: https://github.com/alexpreynolds/sample.