Question

How to get individual chromosome sequence in fasta format from vcf.gz and its vcf.gz.tbi file of 1000 genome project?

1

Entering edit mode

8.9 years ago

David Simmons ▴ 10

Hi,

I am new to the field of Bioinformatics. I have downloaded files ALL.chr1.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz and its tabix file ALL.chr1.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz.tbi from ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20110521/.

1) The above vcf.gz contains compressed chromosome 1 sequence of 1092 individuals. How can get separate 1092 sequences in fasta format?

I read about vcf-consensus script but I am confused how to use it here?

cat ref.fa | vcf-consensus file.vcf.gz > out.fa

Does The phase 1 release of 1000 genome project use the following reference genome (as mentioned in ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20110521/README.phase1_integrated_release_version3_20120430): ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz?

2) How can get entire genome sequence in fasta format for HG00096, HG00097 etc?

Thanking you in advance

vcf DNA 1000genome sequence fasta • 3.8k views

ADD COMMENT • link updated 2.3 years ago by Ram 44k • written 8.9 years ago by David Simmons ▴ 10

Ram · Answer 1 · 2016-01-09

2

Entering edit mode

8.9 years ago

GouthamAtla 12k

You can get the VCF for each individual by:

vcf-subset --exclude-ref -c sample_ID in.vcf > sample_ID.vcf

Something like:

parallel --jobs <int> "vcf-subset --exclude-ref -c {} in.vcf > {}.vcf " ::: `grep -m1 "CHROM" <in.vcf> | cut -f 10-`

Once you have the individual VCF file, you could use FastaAlternateReferenceMaker to get the alternate genome for each individual.

parallel --jobs <int> "java -jar GenomeAnalysisTK.jar -T FastaAlternateReferenceMaker -R reference.fasta -o {}.fasta -V {}.vcf" ::: `grep -m1 "CHROM" <in.vcf> | cut -f 10-`

Hope this is what you are looking for.

ADD COMMENT • link updated 2.3 years ago by Ram 44k • written 8.9 years ago by GouthamAtla 12k

0

Entering edit mode

thanks vcf-subset will work to get individual samples in vcf format. After getting individual samples, which sequence do I have to use as reference to extract fasta sequence from obtained vcf file? Is it entire genome mentioned for phase1 1000 genome project "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz" or are there any references mentioned chromosome wise by 1000 genome project?

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 8.9 years ago by David Simmons ▴ 10

0

Entering edit mode

The human_g1k_v37.fasta.gz should be used as the variants are called using that genome. You can split the fasta into individual chromosomes if you are only looking for chr1. There are several posts about splitting a fasta chromosome wise.

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 8.9 years ago by GouthamAtla 12k