VCF To FASTA
3
0
Entering edit mode
5.9 years ago
Myo Naung ▴ 10

Hi,

Do anyone know how to convert VCF files into FASTA files such as using R ?

Myo

next-gen genetic genome • 23k views
ADD COMMENT
1
Entering edit mode

Hello Myo Naung ,

what do you expect by this "conversion"? A vcf file contains information about differences compared to a given reference. What you can do is: Take your reference sequence and integrate your variants. This is called a consensus sequence. bcftools consensus is one tool that can do this (if this is what you want to do).

fin swimmer

ADD REPLY
0
Entering edit mode

Hello fin swimmer,

Thanks for the suggestions. What I would like to do is to calculate neutrality statistics like Tajima's D, and there I would like to see TD values of variant sites related to conserved regions. Not sure I can do directly from vcf.

Myo

ADD REPLY
0
Entering edit mode

Hi fin swimmer, creating consensus reference genome for a specific human population can be considered as a alternative for de novo genome assembly?

ADD REPLY
3
Entering edit mode
5.9 years ago
MatthewP ★ 1.4k

Hello, vcftools has a command vcf-consensus which can convert VCF to FASTA file if you got the reference. Command like:

cat ref.fa | vcf-consensus your_file.vcf.gz > out.fa
ADD COMMENT
1
Entering edit mode

Please do not use convert when dealing with data formats with non-equivalent information content. You cannot convert VCF to FASTA, you can combine alternate allele information from a VCF into a reference FASTA to get FASTA that incorporates a specific percentage of the ALT alleles into itself.

ADD REPLY
0
Entering edit mode

Thanks, @MattweP. I will have a look

ADD REPLY
0
Entering edit mode

I tried this command and I got: Applied 0 variants.

ADD REPLY
1
Entering edit mode
5 months ago
LauferVA 4.5k

Myo,

Tajima's D statistic is a function of the number of nucleotide differences (pi), the number of segregating sites in the sequences considered (S), and the population mutation rate (theta).

When considering a VCF and a FASTA generated from the same sample, pi and S should theoretically be identical between the two. And actually, theta is calculated directly from the number of segregating sites S (divided by a constant), the estimate of theta likewise will not differ. However, the variance of theta does depend on the total number of sites, including both polymorphic and invariant (monomorphic) sites.

As such, it is possible to generate a different value for Tajima's D based on a VCF versus a FASTA; again, the reason for this discrepancy is that VCF files typically do not include monomorphic sites (gVCFs do). Supplementation of the VCF data with reference data or another form of supporting data is therefore crucial as a QA measure in order to learn about monomorphic positions and so forth as your question rightly implies.

But, this also means the choice of reference is amplified importance in this context. If the reference sequence chosen does not accurately reflect the sequence in the samples, Tajima's D will be systematically miscalibrated. Similarly, incorrect variant calls, missing genotypes, imperfect multiple sequence alignments (MSA), and so forth can impact the accurate calculation of Tajima's D, but are handled differently by various software packages.

For these reasons, if the VCFs you have are derived from very high-quality data (high variant call accuracy, minimal or no missing data), and the MSA is free of alignment errors and gaps, and is well-represented by the chosen reference genome, then most packages will generate a similar value for Tajima's D starting from different file types.

In your case, what gives me pause is that, if you have ONLY a VCF and the FASTA is generated from that, then you can neither check how closely sample sequences truly reflect the reference chosen, nor examine missing genotypes or low-quality calls (beyond a certain limit).

To summarize, while a FASTA can certainly be generated from VCFs, and while it is better practice to use a FASTA as it includes both monomorphic and polymorphic sites, some caution in interpreting results is advisable in this case. If your goal centers on neutrality statistics, access to the FASTQ and BAM files may be preferable depending on sample quality indicators.

ADD COMMENT
0
Entering edit mode
5 months ago

From bcftools consensus man page:

# Apply variants present in sample "NA001", output IUPAC codes for hets
bcftools consensus -i -s NA001 -f in.fa in.vcf.gz > out.fa

# Create consensus for one region. The fasta header lines are then expected
# in the form ">chr:from-to". Ignore samples and consider only the REF and ALT columns
samtools faidx ref.fa 8:11870-11890 | bcftools consensus -s - in.vcf.gz -o out.fa

# For more examples see http://samtools.github.io/bcftools/howtos/consensus-sequence.html
ADD COMMENT

Login before adding your answer.

Traffic: 1907 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6