Hello,
I have mapped several Illumina paired-end libraries on the reference genome of a protist (unicellular eukaryote; 100Mb genome with few repeats but very low AT, high ploidie) using Bwa-mem. All libraries correspond to different lineages of the species of the reference genome.
After calling heterozygous and homozygous SNPs using samtools (only primary alignments: -F 260) and bcftools (only good quality alignments: -q 30), I obtained the following results:
- lineages nearly identical to the ref genome: 1K homo SNPs + 1K hetero SNPs
- lineages a bit more distant from the ref genome: 0.5M homo SNPs + 10K hetero SNPs
- lineages very distant from the ref genome: 1M homo SNPs + 50K hetero SNPs
nb: a SNP was called if the number of alternate reads (DP4) was at least 5 reads, and was called 'heterozygous' when the number of reference reads was below 5.
The fact that I observe more homozygous SNPs in distant lineages than in lineages close to the reference genome makes perfectly sense to me. But why the number of heterozygous SNPs also increases with the phylogenetic/evolutionary distance ? I would expect it to be constant across lineages. Does it suggest that there is a problem with my analyses ?
many thanks if you could help me making sense of these results.
I am not sure if your system to detect homozygous and heterozygous SNPs is common (using a five read threshold) but perhaps you want to try using GATK to call variants instead?
thanks Samuel. I'll give GATK or Freibades a try. But my problem will likely persist with these programs.