Hi, I'm using the sim1000G package to simulate genotype data : https://adimitromanolakis.github.io/sim1000G/inst/doc/SimulatingFamilyData.html
in their example, they use data from the CHR4 region :
But it only contains 567 SNPs from 95 patients is there a way to get data on more regions so I would have more SNPs?
I tried following their manual :
I downloaded the data for chrY just for example because its is a small one anf it looks like this :
the problem is that the ID column is empty and there for it prevents me from using sim1000G as in the code it uses the ID of the varaiants :
vcf_file = file.path(examples_dir,"region.vcf.gz") vcf = readVCF( vcf_file, maxNumberOfVariants = 400 , min_maf = 0.01, max_maf = 1)#@param subset A subset of individual IDs to use for simulation
so i was also wondering where can i get this data BUT with the variants IDs as they used IGSR data base so there must be the variant IDs(http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/)
Note that the plink 1.9 resource is from 1000 Genomes phase 1. 1000 Genomes phase 3 (https://www.cog-genomics.org/plink/2.0/resources#phase3_1kg ) is more complete.
bk11 thank you, Just to clarify sim1000G can simulate SNPs which are in linkage disequilibrium from the input vcf file-which is what I want , I was wondering if the link you provided from plink gives SNPS for multiple individuals with SNPS that are in linkage disequilibrium ?
Data from 1KG includes ~2500 subjects from 26 population. These data have all genotyped SNPs across the genome. And yes you will find SNPs that are in linkage disequilibrium for sure. Just test them in the region that you are interested in.