Hello,
I am quite new to programming and bioinformatics. I am trying to access some VCF files from the 1000 genomes project and follow along a YouTube tutorial (OMGenomics) to do some analysis. I would also like to learn how to use the ensemble api later on as well so I would like to have the rsID within the VCF file. The problem is the current build is grch38 and I was not able to find VCF file with both sample data and rsID value for the latest build.
I couldn't really check the high coverage (http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20201028_3202_raw_GT_with_annot/) because of its massive size.
I thought maybe I could use grch37 vcf files but with v5b(http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/) they removed the rsID values and said mapping was available in ensemble. I used ensemble (http://ftp.ensembl.org/pub/grch37/release-105/variation/vcf/homo_sapiens/). There seems to be more snps than the VCF file (v5b) I got from 1000 genome website. Not sure if i have right data or you have to map it differently. Although, I can merge them so the location and alleles match and add the ids, not sure if there would be some conflicts. I did manage to find a v5a(https://ftp.ncbi.nih.gov/1000genomes/ftp/release/20130502/) from ncbi site but when I tried to look up some rsID on the browser, it found no matches or wrong information like the position on the site, guessing due to updates.
What I ideally want is a VCF file for chr21(since its the smallest) with sample genotype and rsID values ideally using grch38 or the latest grch37 that matches ensemble.
If I am making any mistakes when choosing the files, please let me know.
Thanks for the help
The file
1000GENOMES-phase_3.vcf.gz
that contains variants from 1000Genomes is available in ENSEMBL at this link: https://ftp.ensembl.org/pub/release-105/variation/vcf/homo_sapiens/However, if you are looking for variants from a particular chromosome those
VCF
files contain variants from multiple resources including 1000Genomes. Since you are only interested in variants from 1000Genomes, you could just filter them to only keep lines that containE_1000G
in theINFO
columnThank you,
Does the 1000GENOMES-phase_3.vcf.gz file contain the sample genotype data as well?
I tried the individual chromosome VCF file and unfortunately it does not contain any sample data. I don't quite understand the versions of different files so not sure how to map to v5b from ftp(http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/) which does contain sample genotype data but that's on reference build 37.