I've been trying to delve into the data from whole genome sequencing, specifically by looking at the already existing data in the 1000 genome project and gnomad, and I have a lot of questions. Does gnomAD contain the 1000gp samples?
I've found many vcf including these:
- http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/hd_genotype_chip/
- http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/integrated_sv_map/
- http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase1/analysis_results/integrated_call_sets/
- https://gnomad.broadinstitute.org/downloads
(also what is the difference between the gnomAD variant and callset files and why are both so huge?)
Are the really huge vcf generated from whole genome sequencing whereas the smaller ones are from chip arrays?
I'm also looking to use this data to compare a sample to them. I'm most familiar with doing this by running a UMAP on the PCA and then clustering to see where the sample lies. I found this implementation https://github.com/diazale/umap_review/blob/master/code/umap_dev_experimentation.ipynb , but it seems to skip lines and only uses the chip array sized file I think.
I've seen that plink can run PCA on many samples. Is there a way to run plink on each of the huge chromosome callset files on gnomAD to get the PCA, then use that data to generate the UMAP clustering? I haven't been able to figure out the PCA in plink, or how to combine multiple PCA from plink.
Lastly, is there an easy way to merge callsets? It's unfeasible to redo the callset from the gnomAD data, but would it be possible to add a current vcf to the current callsets for only SNPs that are already in the existing callsets? I forsee this not losing that much data since the gnomAD callsets have plenty of SNPs that will probably match up and the discordant SNPs can be discarded. Is there a program that does this?
What are these callset files you're referring to?
Also, please read the gnomAD help pages - they have answers to at least a few of the questions you're asking us here.
At the gnomAD page there's a list of files called callsets: HGDP + 1KG callset Also, the gnomAD help pages have some information, but don't seem to explain how the data were generated for a non expert in the field.
gnomAD VCFs usually don't have individual level information. These files have a bunch of INFO fields, but no columns beyond the fixed set (read VCF specification for fixed fields and genotype fields).
The callset files (which you should have specified as gnomAD v3 specific datasets) state that they contain "individual genotypes for all samples in the HGDP and 1KG datasets" must contain genotype fields for each sample, which adds a whole bunch of columns per row in the file - this should be the reason for the inflated size.
If you're interested in site level information only, you won't need the callset. You'll only need it if you want to look at the genotype fields for 1KG/HGDP samples.