Generate genotype specific FASTA sequences from VCF file and reference sequence
2
0
Entering edit mode
9.7 years ago
gtho123 ▴ 260

I have some VCF files, each of which I have merged to contain >300 genotypes. Furthermore, to make them more manageable I have subsetted them to just contain the chromosome regions I am interested in.

Now I wish to generate some genotype specific FASTA sequences using these files and a reference sequence; i.e. a sequence for each genotype which is the same as the reference sequence but with the SNPs specific to each genotype in place of their counterparts in the reference sequence.

Now I know that there is variation in the genotypes. Here is a picture visualizing three exemplar genotypes that I generated by loading the VCF file into Geneious.

http://imgur.com/LRZHIcw

I then try to create individual VCF files for each genotype using this:

java -jar GenomeAnalysisTK.jar -R ~/Path/to/reference/sequence/ref.fasta -T SelectVariants  --variant ~/Path/to/complete/vcf/example.vcf -o ~/Path/to/individual/genotype.vcf -sn genotype

While I can't be sure this had the desired effect as it is difficult to assess a whole VCF file I can say that the header now only contains the relevant genotype so I assume this is the case.

I then try and use this individual VCF file for each genotype like this:

java -jar GenomeAnalysisTK.jar -R ~/Path/to/reference/sequence/ref.fasta -T FastaAlternateReferenceMaker --variant~/Path/to/individual/genotype.vcf -L chrX:XX,XXX,XXX-XX,XXX,XXX -o ~/Path/to/individual/genotype.fasta

Here the Xs represent the location on the reference sequence of the regions of interest.

I did this in a loop and got identical sequences for every genotypes. I then implemented it individually for the 3 exemplar genotypes in the picture above and in both cases I get identical sequences for every genotype. Interestingly they are not the reference sequence.

What am I doing wrong?

I will also post this on the GATK forum.

SNP sequence gatk • 4.9k views
ADD COMMENT
1
Entering edit mode
9.7 years ago
gtho123 ▴ 260

It turns out that when creating individual VCF files using the SelectVariants tool there is a optional flag -env which needs to be called too. The docs say that it is called to not "include loci found to be non-variant after the subsetting procedure". I find the interpretation of this confusing but it appears to give me the sequences I was trying to get.

ADD COMMENT
0
Entering edit mode
9.6 years ago
Len Trigg ★ 1.6k

RTG Core includes quite a few simulation tools, one of which is used to replay sample variants from a VCF into a reference sequence. The intention is to allow the generation of the full genome of that sample for subsequent read simulation (so when dealing with diploid chromosomes for example you end up with two copies of those chromosomes), but it'll work for your case. e.g:

rtg format -o ref.sdf ref.fasta

rtg samplereplay --reference ref.sdf --input genotype.vcf \
  --sample genotype --output genotype_genome.sdf

rtg sdfsubseq --input genotype_genome.sdf --fasta \
  chrX:XXXXXXXX-XXXXXXXX >genotype_chrX.fasta
ADD COMMENT

Login before adding your answer.

Traffic: 1715 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6