Hi , Im working with the sim1000G package for genotype simulations :https://cran.r-project.org/web/packages/sim1000G/sim1000G.pdf. The package uses GRCh37 coordinates to create the genetic map. I want to run simulations on specific genes for example:NERG1, SORCS3 and for that I need to download a VCF file for each genes with their SNPs . Im currently facing a problem in which I don't now the correct source to get the gene coordinates to correctly run the simulation and not incounter the error:
Error in startSimulation(vcf, totalNumberOfIndividuals = 3010) :
Error: mismatch between chromosomes in genetic map and vcf
In addition: There were 50 or more warnings (use warnings() to see the first 50)
when I run this gene for example:
get_genotype=function(vcf_path){
vcf = readVCF( vcf_path ,min_maf = NA, max_maf = NA,maxNumberOfVariants = 8000)
startSimulation(vcf, totalNumberOfIndividuals = 3010)
ids = generateUnrelatedIndividuals(3000)
genotype = retrieveGenotypes(ids)
return(genotype)
}
genotype_nerg=get_genotype("NERG1.vcf")
the coordinates of for NERG1 that I used are :1:71,861,626-72748222 but they are apparently not correct since there is a mismatch error and when I check the smallest POS in the VCF of NERG1 it returns :71,861,826 instead of 71,861,626 . I was wondering what is the correct data base from which i can get the coordinates for the genes
check the chromosome notation 'chr1' vs '1'.
why do you think there must be a variant at the very first position of the gene ?
@Pierre Lindenbaum do you mean when downloading the VCF with tabix :
tabix -h ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz 10:109517591-109871360 |cut -f1-5 |awk '!/##/' |head
@ Pierre Lindenbaum In the VCF file its '1'