Question

freebayes haplotype calling (phased variants)

0

Entering edit mode

8.8 years ago

mark.rose ▴ 50

Hi All

I often use freebayes for variant calling and am pleased with the results. Now I want to use it to call haplotypes. The nature of my project is that I wish to phase variants in 400-450bp amplicons. These amplicons are generated from a mixed target source such that different targets within it will generate amplicons of slightly different sequence though different targets may share one or more specific variants. My plan was to subject these amplicons to 2X300 bp Illumina sequencing, merge read 1 and read 2, then align to a reference. The freebayes documentation indicates that the “-E” flag is required to call phased variants, and so I’ve included “-E 400” in my command line. I then tested this against simulated reads I made that contain a 2bp and 3bp deletion 11bp apart. While both deletions are called by freebayes I see nothing that indicates they are phased. I had read online that I might expect these two variants to be depicted by freebayes as a single "complex" variant bound by the 2 deletions. I do not. Moreover, in the header I see the following: "##phasing=none". Am I mistaken about how freebayes reports phased variants or is there something more to the proper command line that I am missing?

Thanks

Mark

Example:

freebayes version: v1.0.2-14-g9e98366

Code:

$ freebayes -F 0.1 -E 400 -f target.ref.fa -v outE400F0.10.new  out.ref.sort.bam

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  unknown
target  186     .       TAGAGGCTGCGTACGGGTAACACGTCGTAAGGAGCGATGACGCGGGCACTGCTCCGGCGTCGTCGA  TAGAGGCTGCGTACGGGTAACACGTCGTAAGGAGCGATGACGCGGGCACTGCTCCGGCGTCGTA      2403.38 .       AB=0;ABP=0;AC=2;AF=1;AN=2;AO=280;CIGAR=63M2D1M;DP=993;DPB=962.909;DPRA=0;EPP=4.53033;EPPR=0;GTI=0;LEN=2;MEANALT=452;MQM=17;MQMR=0;NS=1;NUMALT=1;ODDS=388.578;PAIRED=0;PAIREDR=0;PAO=0;PQA=0;PQR=0;PRO=0;QA=4320;QR=0;RO=0;RPL=280;RPP=611.023;RPPR=0;RPR=0;RUN=1;SAF=147;SAP=4.53033;SAR=133;SRF=0;SRP=0;SRR=0;TYPE=del     GT:DP:DPR:RO:QR:AO:QA:GL        1/1:993:993,280:0:0:280:4320:-327.573,-84.2884,0

target  261     .       TGAGGTGGAGGCGA  TGAGGTGGAGA     7515.8  .       AB=0;ABP=0;AC=2;AF=1;AN=2;AO=787;CIGAR=1M3D10M;DP=981;DPB=775.071;DPRA=0;EPP=5.02174;EPPR=0;GTI=0;LEN=3;MEANALT=37;MQM=17;MQMR=0;NS=1;NUMALT=1;ODDS=1091.43;PAIRED=0;PAIREDR=0;PAO=0;PQA=0;PQR=0;PRO=0;QA=13337;QR=0;RO=0;RPL=787;RPP=1711.96;RPPR=0;RPR=0;RUN=1;SAF=380;SAP=5.02174;SAR=407;SRF=0;SRP=0;SRR=0;TYPE=del     GT:DP:DPR:RO:QR:AO:QA:GL      1/1:981:981,787:0:0:787:13337:-991.985,-236.911,0

freebayes haplotype phase phased variant • 5.2k views

ADD COMMENT • link updated 4.7 years ago by mikemc ▴ 10 • written 8.8 years ago by mark.rose ▴ 50

0

Entering edit mode

Did you ever figure this out? I also am trying to get haploytpes from 450 bp amplicons.

ADD REPLY • link 6.9 years ago by gkuffel22 ▴ 100

0

Entering edit mode

As I recall I was able to get some haplotype calling but it was erratic and failed to accurately represent simulated data. Your objective seems much like mine was. What I wound up doing was to write a pipeline which would analyze read pairs from amplicons individually and then identify sets of variants that occur in phase on these individual amplicons above a set minimum frequency threshold, thus eliminating PCR/sequencing errors present in individual reads. This was more reliable.

ADD REPLY • link 6.9 years ago by mark.rose ▴ 50

0

Entering edit mode

Thank you for the reply. I have 250 bp paired end reads and my user (I work for a core) is interested in haplotypes that occurr in the MHC gene in coyotes. Would you be willing to share your script or provide any additional insight?

ADD REPLY • link 6.9 years ago by gkuffel22 ▴ 100

0

Entering edit mode

As I work for a private company, I cannot share my code with you. Sorry. If you can code or have access to someone who can, I could describe it to you a bit more and you could attempt to create your own version.

ADD REPLY • link 6.9 years ago by mark.rose ▴ 50

score 0 · Answer 1 · 2020-04-10

I know this question is old, but for future readers...if you are doing amplicon sequencing where the two ends of your paired ends overlap, then you can (and probably should) use a program specifically for analyzing amplicon sequencing data. Programs exist that can infer the exact amplicon sequences (also known as amplicon sequence variants). In this way you will have the full haplotype information for each variant. Available software includes DADA2 (https://benjjneb.github.io/dada2/tutorial.html), Deblur (https://msystems.asm.org/content/2/2/e00191-16), and Unoise (https://drive5.com/usearch/manual/pipe_otus.html).