Hi All
I've aligned pacbio ccs reads to a small reference sequence using minimap2 and viewed the result in IGV. The alignments look good, with generally very few discrepancies with the reference. There is one region, however where more variations are observed and that is at a string of 13 Cs. This is not unexpected but when viewing it in IGV I'm actually impressed by how minimal it appears to be.
Note that the reads aligned in this snapshot are typical of the rest of the alignment as well and that the center line marks the beginning of the polyC string
Subsequently I attempted to call variants on this bam file using a variety of variant caller, callvariants (BBMAP), freebayes, and HaplotypeCaller (gatk). All call a 1bp deletion of a C at the poly C string.
callvariants
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT MIR604-F2-M2-1A.ccs.10pct.fastq-vs-MIR604.5901-10938.fa.minimap2.sort
MIR604.5901-10938 1401 . Tc T 23.72 PASS SN=0;STA=1401;STO=1402;TYP=DEL;R1P=340;R1M=281;R2P=0;R2M=0;AD=621;DP=1000;MCOV=-1;PPC=0;AF=0.6210;RAF=0.6210;LS=3013780;MQS=37260;MQM=60;BQS=21508;BQM=41;EDS=960160;EDM=4471;IDS=619759;IDM=999;NVC=0;FLG=0;CED=601;HMP=6;SB=0.9997 GT:DP:AD:AF:RAF:NVC:FLG:SB:SC:PF 1:1000:621:0.6210:0.6210:0:0:0.9997:23.72:PASS
freebayes
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT unknown
MIR604.5901-10938 1401 . TC T 20562.3 . AB=0.578755;ABP=61.8389;AC=2;AF=0.666667;AN=3;AO=632;CIGAR=1M1D;DP=1092;DPRA=0;EPP=10.9266;EPPR=20.4059;GTI=0;HWE=-0;LEN=1;MEANALT=20;MQM=60;MQMR=60;NS=1;NUMALT=1;ODDS=371.527;PAIRED=0;PAIREDR=0;RO=364;RPP=626.539;RPPR=393.971;RUN=1;SAP=16.2178;SRP=20.4059;TYPE=del;XAI=0.00135563;XAM=0.00142991;XAS=7.42841e-05;XRI=0.00134239;XRM=0.00140067;XRS=5.82842e-05;BVAR GT:DP:RO:QR:AO:QA 0/1/1:1092:364:32027:632:23740
HaplotypeCaller
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT MIR604-F2-M2-1A
MIR604.5901-10938 1401 . TC T 2899.01 . AC=1;AF=1.00;AN=1;BaseQRankSum=0.922;DP=342;FS=5.456;MLEAC=1;MLEAF=1.00;MQ=60.00;MQRankSum=0.000;QD=8.48;ReadPosRankSum=-0.229;SOR=0.463 GT:AD:DP:GQ:PL 1:119,223:342:99:2909,0
All three variant callers call a 1 bp deletion at the homopolymer. What is curious though is the apparent frequency of the alternate allele each of these tools perceive compared to what is visible in IGV. IGV indicates that variants at this position are relatively rare (1-2%) where as each of these callers is reporting the variant is observed in the majority of reads (~60%). Any ideas why?
Thanks
Please use the code button (
101010
) to highlight code and data examples and make sure you use the image button to embed images. You have to post the full image link including the suffix, here that ishttps://i.ibb.co/DpTp5Vr/poly-C-variants.png
which needs to be pasted into the image button field. I did it for you this time. Thanks!Actually I tried doing both but they never appeared correctly in the preview window so I must be doing something wrong