I have been trying to get FASTA (with altered nucleotide sequences) sequences from VCF file.
I have tried using FastaAlternateReferenceMaker (from GATK) but it only gives me a single FASTA file.
Any help will be greatly appreciated. Thank you!
I have been trying to get FASTA (with altered nucleotide sequences) sequences from VCF file.
I have tried using FastaAlternateReferenceMaker (from GATK) but it only gives me a single FASTA file.
Any help will be greatly appreciated. Thank you!
A solution that worked for me:
I used vcf-consensus (from VCFtools) to generate the variant FASTA sequence. The Haplotype can be specified using the -H
parameter (as 1 or 2). The resulting FASTA file can be then used to extract the gene of interest.
FastaAlternateReferenceMaker performs one simple task.
It writes a fasta file in which reference SNP alleles are substituted with alternative SNP alleles. So, if your haplotypes do not coincide with the reference, it will not help you very much.
Luckily enough you can specify several intervals in which performing the task. So, you can first specify all the intervals (they can also be short, I think) in which haplotype A is coincident with the reference, and have the fasta of the haplotype B. Then, you can specify all the intervals in which the ahplotype B is equal to the reference and you will obtain the fasta of the haplotype A.
Man page is here, and the intervals can be specified with -L interval.file.name It's tricky, but should work!
Thank you for replying.
I am looking for FASTA files for both chromosomes (listed as 0|1 etc. in the VCF file) - and the tool only gives me one.
The intervals tools is actually really helpful and I am using it to extract my gene of interest from the reference sequence.
Thanks again!
I would appreciate any advice with regrds to getting separate FASTA files.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
This implies that you somehow phased your variants in two haplotypes?
The VCF file I have are phased (they have information about the presence/absence of the SNP in either of the chromosomes).
I may be wrong here or may have used incorrect terminology - I'm new to this sort of analysis and am hence asking for help.
Any help is greatly appreciated.
Phasing means that you determine which variants are on the same haplotype, hypothetical example with three heterozygous SNPs:
It's probably quite straightforward to determine if SNP1 and SNP2 are on the same allele since some reads will span the short distance between those (assumption: technology is sequencing and no SNP array). So if a read contains both the A and G we could say that on haplotype A we have the A and G allele and on the other haplotype B we have the T and T allele of respectively SNP 1 and SNP 2. As such we could take the reference genome and create two new fasta files for this position with an A and G for haplotypeA.fasta and a T and T for haplotypeB.fasta.
However, the situation is problematic for SNP3 since no read (assuming Illumina sequencing) is going to span from SNP 1 and SNP2 to SNP3. There is no way based on this data to find out if the C allele of SNP3 is on haplotype A or haplotype B. So on which fasta do we put the C and on which fasta do we put the A allele?
Is that the same terminology for phasing you had in mind?
Thank you for your detailed reply.
Yes, I had the same terminology for phasing in mind - but this makes it far clearer. Thanks!
As for my problem statement - I want to get altered FASTA files (like haplotypeA.fasta and haplotypeB.fasta from your example) and I have a VCF file and a reference sequence - however FastaAlternateReferenceMaker only gives me one output FASTA file.
Do you think splitting the VCF file based on the GT field will help?
Thanks again!
So that means you found a way to determine on which haplotype the alleles from SNP3 are?
I think FastaAlternateReferenceMaker would indeed work if you would have correctly-split vcf files, but I'm unsure if there is a "correct" way to do that.
I was thinking of an awk statement to split all the file with the same GT information (1|0, 0|1).
Like: