Hi,
I am new to the field of Bioinformatics. I have downloaded files ALL.chr1.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz
and its tabix file ALL.chr1.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz.tbi
from ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20110521/.
1) The above vcf.gz contains compressed chromosome 1 sequence of 1092 individuals. How can get separate 1092 sequences in fasta format?
I read about vcf-consensus script but I am confused how to use it here?
cat ref.fa | vcf-consensus file.vcf.gz > out.fa
Does The phase 1 release of 1000 genome project use the following reference genome (as mentioned in ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20110521/README.phase1_integrated_release_version3_20120430): ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz
?
2) How can get entire genome sequence in fasta format for HG00096, HG00097 etc?
Thanking you in advance
thanks vcf-subset will work to get individual samples in vcf format. After getting individual samples, which sequence do I have to use as reference to extract fasta sequence from obtained vcf file? Is it entire genome mentioned for phase1 1000 genome project "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz" or are there any references mentioned chromosome wise by 1000 genome project?
The
human_g1k_v37.fasta.gz
should be used as the variants are called using that genome. You can split the fasta into individual chromosomes if you are only looking for chr1. There are several posts about splitting a fasta chromosome wise.