I have some VCF files, each of which I have merged to contain >300 genotypes. Furthermore, to make them more manageable I have subsetted them to just contain the chromosome regions I am interested in.
Now I wish to generate some genotype specific FASTA sequences using these files and a reference sequence; i.e. a sequence for each genotype which is the same as the reference sequence but with the SNPs specific to each genotype in place of their counterparts in the reference sequence.
Now I know that there is variation in the genotypes. Here is a picture visualizing three exemplar genotypes that I generated by loading the VCF file into Geneious.
I then try to create individual VCF files for each genotype using this:
java -jar GenomeAnalysisTK.jar -R ~/Path/to/reference/sequence/ref.fasta -T SelectVariants --variant ~/Path/to/complete/vcf/example.vcf -o ~/Path/to/individual/genotype.vcf -sn genotype
While I can't be sure this had the desired effect as it is difficult to assess a whole VCF file I can say that the header now only contains the relevant genotype so I assume this is the case.
I then try and use this individual VCF file for each genotype like this:
java -jar GenomeAnalysisTK.jar -R ~/Path/to/reference/sequence/ref.fasta -T FastaAlternateReferenceMaker --variant~/Path/to/individual/genotype.vcf -L chrX:XX,XXX,XXX-XX,XXX,XXX -o ~/Path/to/individual/genotype.fasta
Here the X
s represent the location on the reference sequence of the regions of interest.
I did this in a loop and got identical sequences for every genotypes. I then implemented it individually for the 3 exemplar genotypes in the picture above and in both cases I get identical sequences for every genotype. Interestingly they are not the reference sequence.
What am I doing wrong?
I will also post this on the GATK forum.