Hi everyone,
I am analying a data set for a user interested in looking at variants in the MHC gene of coyotes. I have used Freebayes to generate a VCF file. The user would now like haploytypes, does anyone know how to decipher that info from a file such as this vcf:
https://docs.google.com/spreadsheets/d/1rH-6Q8559C70QeINoOOwCj8SCRA2U-AYr9NSb-Qefs4/edit?usp=sharing
UPDATE:
Hi again everyone,
Ram has asked that I provide more clarity on the purpose of of my exercise. So here goes:
The user is looking at SNPs that are linked on a single functional gene in the coyote genome and would like to decipher haploytpes of these SNPs because different haploytypes are indicative of different levels of resistance to things such as Mange.
If there are 4 SNPs across the amplicon my user needs to know how these SNPs are inherited in an allele form, meaning which SNPs were on the same strand of DNA.
Ummm, why did you give us a spreadsheet of a VCF file?
Ummm, because any VCF file can be viewed as a csv or xls file. I provided this format to allow for easy viewing. If it would help anyone to have the actual VCF file I can easily provide that upon request. I wanted to know if I can get haplotypes from the data presented in the spreadsheet view of the VCF file.
I believe I found my own answer. I need to re-run FreeBayes and enable the parameter -E --max-complex-gap. Otherwise Freebayes only calculates haplotypes within a 3 bp distance. I am trying to calculate haploypes over a 250 bp amplicon.
My question was, people here know what a VCF file looks like. You could have provided a code block or a gist if you wished to exhibit an excerpt. Why use a spreadsheet?
Also, "any VCF can be viewed as a spreadsheet": Nope, don't do that. Excel is not meant to handle that data volume or that kind of data.
So I should paste like this for next time?
Do you have any information on my actual question about haplotypes?
Presumably the customer wants a FASTA file with two sequences per individual (i.e., the haplotypes)? Am I right? And then later a GENEPOP file input or something like that?
There are two options:
You could phase the VCF file with a tool such as BEAGLE (https://faculty.washington.edu/browning/beagle/beagle.html) then use
bcftools consensus -H 1
(for the haplotype 1's allele) andbcftools consensus -H 2
(for haplotype 2's allele) and output as FASTA for each allele. You could also useReadBackedPhasing
from GATK (https://software.broadinstitute.org/gatk/documentation/tooldocs/3.8-0/org_broadinstitute_gatk_tools_walkers_phasing_ReadBackedPhasing.php) note that it does not support indels such as the ones you have in your example VCF output.One could also directly output the FASTA sequence with ambiguity codes with
bcftools consensus -I
option then use a program such as PHASE implemented in DnaSP (http://www.ub.edu/dnasp/) to determine the alleles for each haplotype. If you have reference alleles, you can also give those to PHASE to possibly improve the phasing.Note that I've had issues with strange haplotypes sequences when I tried option 1, so I'd recommend option 2.
If you use PGDSpider (http://www.cmpg.unibe.ch/software/PGDSpider/), it can take the input as FASTA files from either method 1 or 2 from above and it will tell you the allele numbers for the individuals in GENEPOP format if you convert FASTA to GENEPOP. Granted it won't know that a sequence such as
belong to the same individual, so you will have to manually convert the GENEPOP output from
to
You could also write a regular expression to convert the default GENEPOP output of PGDSpider to one line per individual.
Unfortunately, no. Maybe if you could explain the purpose behind this exercise, we could get some clarity.
The explanation provided in the initial comment is sufficient. Comments like these are not helpful. If you're not willing to help then don't comment. It is not necessary.
Thank you for your feedback.
Almost. I've corrected it so it's better. You can use the formatting bar (especially the
code
option) to present your post better.