I am using Freebayes (v1.3.6) for SNP calling analysis on eggplant population data. While scrutinizing the VCF file, it has come to my attention the heterozygous assignation in some cases. This was the command used to perform the SNP calling:
/usr/bin/freebayes -b all_parents_mapped_BWA_v3_20X_dedup.bam -f /data/users/vbarfon/Genomes/Solgenomics_2019_08_30/Eggplant_V3_Chromosomes.fa -v multithreadfreebayes/subfiles/all_parents_mapped_BWA_v3_20X_dedup_SMEL3Ch01_0_136534347.vcf -r SMEL3Ch01:0-136534347 --limit-coverage 800 --min-coverage 10 -m 20 -q 20
In this case, a homozygous genotype was assigned to sample5, supported by 11 reference reads and 1 alternate read, resulting in an alternate allele frequency of 1/12=0.083. On the other hand, sample8 exhibited a heterozygous genotype, supported by 9 reference reads and 1 alternate read, with an alternate allele frequency of 1/10=0.1. Despite the similarity in alternate allele frequencies between samples, Freebayes assigns different genotypes at the same loci.
SMEL3Ch01 96806 . TTTGCC CTGGCG 47.5168 . AB=0.185185;ABP=49.4959;AC=3;AF=0.1875;AN=16;AO=12;CIGAR=1X1M1X2M1X;DP=131;DPB=131;DPRA=1.18636;EPP=3.73412;EPPR=3.90444;GTI=1;LEN=6;MEANALT=1;MQM=35.5833;MQMR=60;NS=8;NUMALT=1;ODDS=4.88556;PAIRED=1;PAIREDR=0.915966;PAO=0;PQA=0;PQR=0;PRO=0;QA=465;QR=4692;RO=119;RPL=10;RPP=14.5915;RPPR=4.48836;RPR=2;RUN=1;SAF=9;SAP=9.52472;SAR=3;SRF=58;SRP=3.17453;SRR=61;TYPE=complex GT:GQ:DP:AD:RO:QR:AO:QA:GL 0/1:31.4607:17:13,4:13:520:4:159:-6.03904,0,-42.0287 0/0:101.192:22:22,0:22:872:0:0:0,-6.62266,-78.7923 0/0:71.0896:12:12,0:12:460:0:0:0,-3.61236,-41.7443 0/0:63.2764:21:20,1:20:792:1:38:0,-2.73405,-68.0036 0/0:51.1134:12:11,1:11:437:1:40:0,-1.51775,-37.5857 0/1:41.2006:27:22,5:22:851:5:187:-7.81267,0,-68.7726 0/0:65.069:10:10,0:10:401:0:0:0,-3.0103,-36.4467 0/1:0.0326615:10:9,1:9:359:1:41:-0.735817,0,-29.6608
Moreover, at other loci, sample6, supported by 11 reference reads and 1 alternate read, exhibits a heterozygous genotype instead of a homozygous one, as exhibited in the previous case by sample5.
SMEL3Ch01 82135 . A G 885.116 . AB=0.0833333;ABP=21.1059;AC=5;AF=0.3125;AN=16;AO=30;CIGAR=1X;DP=107;DPB=107;DPRA=1.03535;EPP=10.2485;EPPR=11.1604;GTI=1;LEN=1;MEANALT=1;MQM=60;MQMR=59.6104;NS=8;NUMALT=1;ODDS=0.82119;PAIRED=1;PAIREDR=0.922078;PAO=0;PQA=0;PQR=0;PRO=0;QA=1174;QR=2974;RO=77;RPL=19;RPP=7.64277;RPPR=9.35551;RPR=11;RUN=1;SAF=11;SAP=7.64277;SAR=19;SRF=35;SRP=4.39215;SRR=42;TYPE=snp GT:GQ:DP:AD:RO:QR:AO:QA:GL 1/1:30.4148:14:0,14:0:0:14:551:-49.9224,-4.21442,0 0/0:52.3018:15:15,0:15:570:0:0:0,-4.51545,-51.6258 1/1:33.4251:15:0,15:0:0:15:582:-52.7037,-4.51545,0 0/0:40.2606:11:11,0:11:418:0:0:0,-3.31133,-37.9553 0/0:40.2606:11:11,0:11:434:0:0:0,-3.31133,-39.4105 0/1:1.58028:12:11,1:11:398:1:41:-0.482207,0,-32.3809 0/0:58.3224:17:17,0:17:679:0:0:0,-5.11751,-61.4334 0/0:43.2709:12:12,0:12:475:0:0:0,-3.61236,-43.0969
This performance was also observed at higher coverages. For example, in this case, sample1, sample2, sample4, and sample7 carry a heterozygous genotype when their alternate allele frequencies are 4/45=0.088, 4/27=0.148, 6/45=0.133, 3/24=0.125, respectively.
SMEL3Ch01 96344 . T C 30.2059 . AB=0.120567;ABP=179.331;AC=4;AF=0.25;AN=16;AO=24;CIGAR=1X;DP=242;DPB=242;DPRA=0;EPP=4.45795;EPPR=3.82085;GTI=2;LEN=1;MEANALT=1.125;MQM=45.4583;MQMR=54.682;NS=8;NUMALT=1;ODDS=5.42383;PAIRED=1;PAIREDR=0.949309;PAO=0;PQA=0;PQR=0;PRO=0;QA=966;QR=8706;RO=217;RPL=21;RPP=32.3252;RPPR=3.10036;RPR=3;RUN=1;SAF=15;SAP=6.26751;SAR=9;SRF=114;SRP=4.22112;SRR=103;TYPE=snp GT:GQ:DP:AD:RO:QR:AO:QA:GL 0/1:0.000458462:45:41,4:41:1610:4:164:-1.33939,0,-127.693 0/1:22.7339:27:23,4:23:927:4:164:-6.37746,0,-71.7496 0/0:82.5237:17:16,1:16:643:1:32:0,-1.94407,-52.6745 0/1:29.6502:45:38,6:38:1525:6:246:-8.28092,0,-119.131 0/0:64.3246:27:25,2:25:1013:2:82:0,-0.614318,-82.214 0/0:66.6283:32:29,3:29:1168:3:114:0,-0.956644,-93.4502 0/1:0.0107044:24:21,3:21:853:3:123:-2.78843,0,-68.7308 0/0:96.5417:25:24,1:24:967:1:41:0,-3.63818,-80.9723
Summarizing, those results showed that Freebayes assigns genotypes randomly (two different sites supported by the same number of reads have different genotypes) and it assigns heterozygous genotypes even when the alternate allele frequency is too low, expecting homozygous genotypes at those sites. How can I deal with this issue?
It was also asked for help in Github: https://github.com/freebayes/freebayes/issues/793
You are right. I have modified the entry by replacing the images with text, which makes it easier to view
but the number of reads is not enough. What about the genotype quality...
I have redone the SNP calling so that the GQ field is integrated into the resulting file. The quality of the assigned genotypes is indeed very low in the cases I have pointed out in the text, but my question is, why instead of assigning heterozygous genotypes with low quality does it not assign homozygous genotypes, which is what would be expected considering the coverage that supports the reference allele and the alternative allele? As it did in the first case with sample 5. Why has it not done this in the second case with sample 6?
I have visualized the alignments in IGV and have seen a possible reason why the genotype for sample 5 is 0/0 at position 96806 and the genotype for sample 6 is 0/1 at position 82135, both being supported by 11 reads for the reference allele and 1 for the alternative allele. As attached in the images, the mapping quality of the read from sample 5 is low (MAPQ 21), which is why this read is not considered when assigning the genotype.
On the other hand, the mapping quality of the read from sample 6 is high (MAPQ 60), so this read is taken into account, considering the heterozygous genotype.
Based on this, should I believe that indeed the genotype of sample 6 is heterozygous despite the difference in the number of reads supporting each allele being 10, favoring the reference allele? Shouldn't the number of reads also be considered when assigning genotypes? I know that the number of reads may not be enough, but there are studies where sequencing coverage is low (< 10X) and this methodology is used.