We have a few in house genomic samples where we validated a couple of hundred variants. To test new tools that are forever being published we could really use a control data set where there is a large number (thousands?) of validated variants. Ideally, the variants would be validated in a publicly available cell line so that we could run samples on our own machines and use the control data to calibrate any new tools we put in to the pipeline.
Is anyone aware of a good sample, with validated variants to use to calibrate machines and bioinformatic tools?
I realize we could use simulated data for this but right now I'm just interested in real data that we could generate with our sequencers.
People often use NA12878 from the 1000 genomes project for validating their SNP calling algorithms, probably the most extensively sequenced genome on the planet. Multiple technologies and large scale Sanger validation of many variant calls. Daniel MacArthur used it and validated a ton of LOF variants for example. Might want to start there as it is a commonly used sample and publicly available.
I think you should take a look at TCGA data. Go to Data Matrix and select any kind of cancer and which platforms you need.
For e.g., if you need only somatic variants, select Somatic Mutations under Data Type, Availability: Available, and you can select Tumor Matched or Normal Matched, based on your needs. Then just select the ones you want to download. I think the Somatic Mutations are in .maf format.