I want to do a simulation analysis for my project which requires benchmarking on a cohort of 200-300 exomes of healthy people. I tried to download such data from GnomAD or the 1000 Genome project but each VCF contains thousands of samples and I do not need more than 200-300 exome vcfs. Any idea how I could access such files?
The only purpose of this part of my project is to spike in specific ClinVar pathogenic mutations associated with rare diseases into a background (exome) of an otherwise "healthy" individual or an individual that does not have a congenital rare disease. I just need exome VCFs mapped to hg19 of a cohort of 200-300 individuals for this.
No publicly available resource will give you VCF of individuals, because this would be revealing confidential information about those individuals. Almost all publically available resources will give you variant frequencies within a population. This is usually suitable for most purposes.
If you do really need VCFs of individuals, you will have to apply for access to protected information at one of the big resources. Most big resources have a way to request access to protected information. They will need evidence that you are a genuine researcher, that your computer systems are sufficiently secure to handle protected data, and that you have a good reason for wanting access to the data.
How are you defining ‘healthy’?
The only purpose of this part of my project is to spike in specific ClinVar pathogenic mutations associated with rare diseases into a background (exome) of an otherwise "healthy" individual or an individual that does not have a congenital rare disease. I just need exome VCFs mapped to hg19 of a cohort of 200-300 individuals for this.