Question

Getting Haplotypes from a VCF file

4

Entering edit mode

6.9 years ago

gkuffel22 ▴ 100

Hi everyone,

I am analying a data set for a user interested in looking at variants in the MHC gene of coyotes. I have used Freebayes to generate a VCF file. The user would now like haploytypes, does anyone know how to decipher that info from a file such as this vcf:

https://docs.google.com/spreadsheets/d/1rH-6Q8559C70QeINoOOwCj8SCRA2U-AYr9NSb-Qefs4/edit?usp=sharing

UPDATE:

Hi again everyone,

Ram has asked that I provide more clarity on the purpose of of my exercise. So here goes:

The user is looking at SNPs that are linked on a single functional gene in the coyote genome and would like to decipher haploytpes of these SNPs because different haploytypes are indicative of different levels of resistance to things such as Mange.

If there are 4 SNPs across the amplicon my user needs to know how these SNPs are inherited in an allele form, meaning which SNPs were on the same strand of DNA.

Freebayes VCF haplotypes • 10k views

ADD COMMENT • link updated 3.6 years ago by Ram 44k • written 6.9 years ago by gkuffel22 ▴ 100

0

Entering edit mode

Ummm, why did you give us a spreadsheet of a VCF file?

ADD REPLY • link 6.9 years ago by Ram 44k

1

Entering edit mode

Ummm, because any VCF file can be viewed as a csv or xls file. I provided this format to allow for easy viewing. If it would help anyone to have the actual VCF file I can easily provide that upon request. I wanted to know if I can get haplotypes from the data presented in the spreadsheet view of the VCF file.

I believe I found my own answer. I need to re-run FreeBayes and enable the parameter -E --max-complex-gap. Otherwise Freebayes only calculates haplotypes within a 3 bp distance. I am trying to calculate haploypes over a 250 bp amplicon.

ADD REPLY • link 6.9 years ago by gkuffel22 ▴ 100

0

Entering edit mode

My question was, people here know what a VCF file looks like. You could have provided a code block or a gist if you wished to exhibit an excerpt. Why use a spreadsheet?

Also, "any VCF can be viewed as a spreadsheet": Nope, don't do that. Excel is not meant to handle that data volume or that kind of data.

ADD REPLY • link 6.9 years ago by Ram 44k

0

Entering edit mode

So I should paste like this for next time?

#CHROM  POS     ID  REF     ALT     QUAL    FILTER  INFO    FORMAT  10  110_S24_L001    111_S25_L001    115_S26_L001    118_S27_L001    11_S5_L001  120_S28_L001    121_S29_L001    123_S30_L001    124_S31_L001    12$
gi|1209890|gb|U47338.1|CFMHCDRB02   7   .   C   G   0   .   AB=0.223249;ABP=867.188;AC=63;AF=0.203226;AN=310;AO=386;CIGAR=1X;DP=15213;DPB=15213;DPRA=3.99182;EPP=815.344;EPPR=26084.5;GTI=16;LEN=1;MEA$
gi|1209890|gb|U47338.1|CFMHCDRB02   28  .   TTGGA   GTGAG,GTGAA     9.25207e+06     .   AB=0.415709,0.037858;ABP=75093.2,2.2572e+06;AC=51,4;AF=0.132812,0.0104167;AN=384;AO=530334,47952;CIGAR=1X2M2X,1X2M1X1M;DP=$

ADD REPLY • link updated 6.9 years ago by Ram 44k • written 6.9 years ago by gkuffel22 ▴ 100

2

Entering edit mode

Do you have any information on my actual question about haplotypes?

ADD REPLY • link 6.9 years ago by gkuffel22 ▴ 100

1

Entering edit mode

Presumably the customer wants a FASTA file with two sequences per individual (i.e., the haplotypes)? Am I right? And then later a GENEPOP file input or something like that?

There are two options:

You could phase the VCF file with a tool such as BEAGLE (https://faculty.washington.edu/browning/beagle/beagle.html) then use bcftools consensus -H 1 (for the haplotype 1's allele) and bcftools consensus -H 2 (for haplotype 2's allele) and output as FASTA for each allele. You could also use ReadBackedPhasing from GATK (https://software.broadinstitute.org/gatk/documentation/tooldocs/3.8-0/org_broadinstitute_gatk_tools_walkers_phasing_ReadBackedPhasing.php) note that it does not support indels such as the ones you have in your example VCF output.
One could also directly output the FASTA sequence with ambiguity codes with bcftools consensus -I option then use a program such as PHASE implemented in DnaSP (http://www.ub.edu/dnasp/) to determine the alleles for each haplotype. If you have reference alleles, you can also give those to PHASE to possibly improve the phasing.

Note that I've had issues with strange haplotypes sequences when I tried option 1, so I'd recommend option 2.

If you use PGDSpider (http://www.cmpg.unibe.ch/software/PGDSpider/), it can take the input as FASTA files from either method 1 or 2 from above and it will tell you the allele numbers for the individuals in GENEPOP format if you convert FASTA to GENEPOP. Granted it won't know that a sequence such as

>Individual1-1
ACTG
>Individual1-2
ACTG

belong to the same individual, so you will have to manually convert the GENEPOP output from

Individual1-1 001
Individual1-2 001

to

Individual1 001001

You could also write a regular expression to convert the default GENEPOP output of PGDSpider to one line per individual.

ADD REPLY • link 6.8 years ago by jean.elbers ★ 1.7k

0

Entering edit mode

Unfortunately, no. Maybe if you could explain the purpose behind this exercise, we could get some clarity.

ADD REPLY • link 6.9 years ago by Ram 44k

0

Entering edit mode

The explanation provided in the initial comment is sufficient. Comments like these are not helpful. If you're not willing to help then don't comment. It is not necessary.

ADD REPLY • link 3.6 years ago by nataliagru1 ▴ 90

0

Entering edit mode

Thank you for your feedback.

ADD REPLY • link 3.6 years ago by Ram 44k

0

Entering edit mode

Almost. I've corrected it so it's better. You can use the formatting bar (especially the code option) to present your post better.

ADD REPLY • link 6.9 years ago by Ram 44k