Hi All
I often use freebayes for variant calling and am pleased with the results. Now I want to use it to call haplotypes. The nature of my project is that I wish to phase variants in 400-450bp amplicons. These amplicons are generated from a mixed target source such that different targets within it will generate amplicons of slightly different sequence though different targets may share one or more specific variants. My plan was to subject these amplicons to 2X300 bp Illumina sequencing, merge read 1 and read 2, then align to a reference. The freebayes documentation indicates that the “-E” flag is required to call phased variants, and so I’ve included “-E 400” in my command line. I then tested this against simulated reads I made that contain a 2bp and 3bp deletion 11bp apart. While both deletions are called by freebayes I see nothing that indicates they are phased. I had read online that I might expect these two variants to be depicted by freebayes as a single "complex" variant bound by the 2 deletions. I do not. Moreover, in the header I see the following: "##phasing=none". Am I mistaken about how freebayes reports phased variants or is there something more to the proper command line that I am missing?
Thanks
Mark
Example:
freebayes version: v1.0.2-14-g9e98366
Code:
$ freebayes -F 0.1 -E 400 -f target.ref.fa -v outE400F0.10.new out.ref.sort.bam
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT unknown
target 186 . TAGAGGCTGCGTACGGGTAACACGTCGTAAGGAGCGATGACGCGGGCACTGCTCCGGCGTCGTCGA TAGAGGCTGCGTACGGGTAACACGTCGTAAGGAGCGATGACGCGGGCACTGCTCCGGCGTCGTA 2403.38 . AB=0;ABP=0;AC=2;AF=1;AN=2;AO=280;CIGAR=63M2D1M;DP=993;DPB=962.909;DPRA=0;EPP=4.53033;EPPR=0;GTI=0;LEN=2;MEANALT=452;MQM=17;MQMR=0;NS=1;NUMALT=1;ODDS=388.578;PAIRED=0;PAIREDR=0;PAO=0;PQA=0;PQR=0;PRO=0;QA=4320;QR=0;RO=0;RPL=280;RPP=611.023;RPPR=0;RPR=0;RUN=1;SAF=147;SAP=4.53033;SAR=133;SRF=0;SRP=0;SRR=0;TYPE=del GT:DP:DPR:RO:QR:AO:QA:GL 1/1:993:993,280:0:0:280:4320:-327.573,-84.2884,0
target 261 . TGAGGTGGAGGCGA TGAGGTGGAGA 7515.8 . AB=0;ABP=0;AC=2;AF=1;AN=2;AO=787;CIGAR=1M3D10M;DP=981;DPB=775.071;DPRA=0;EPP=5.02174;EPPR=0;GTI=0;LEN=3;MEANALT=37;MQM=17;MQMR=0;NS=1;NUMALT=1;ODDS=1091.43;PAIRED=0;PAIREDR=0;PAO=0;PQA=0;PQR=0;PRO=0;QA=13337;QR=0;RO=0;RPL=787;RPP=1711.96;RPPR=0;RPR=0;RUN=1;SAF=380;SAP=5.02174;SAR=407;SRF=0;SRP=0;SRR=0;TYPE=del GT:DP:DPR:RO:QR:AO:QA:GL 1/1:981:981,787:0:0:787:13337:-991.985,-236.911,0
Did you ever figure this out? I also am trying to get haploytpes from 450 bp amplicons.
As I recall I was able to get some haplotype calling but it was erratic and failed to accurately represent simulated data. Your objective seems much like mine was. What I wound up doing was to write a pipeline which would analyze read pairs from amplicons individually and then identify sets of variants that occur in phase on these individual amplicons above a set minimum frequency threshold, thus eliminating PCR/sequencing errors present in individual reads. This was more reliable.
Thank you for the reply. I have 250 bp paired end reads and my user (I work for a core) is interested in haplotypes that occurr in the MHC gene in coyotes. Would you be willing to share your script or provide any additional insight?
As I work for a private company, I cannot share my code with you. Sorry. If you can code or have access to someone who can, I could describe it to you a bit more and you could attempt to create your own version.