Hello everyone!
I mapped Forward and Reverse reads from one plant genome against a reference fasta sequence. As expected, there are varaints (SNP and indels). I am interested in retrieving the consensus sequence for downstream analysis. The reads come from a diploid plant genome and the gene of interest is likely in single copy. However, my plant genome is clearly not 100% homozygous. So, in the consensus sequence I should see homozygous variants and heterozygous variants coded with IUAPC letters. I ran this codes and everything worked well:
samtools mpileup -d8000 -A -uf ref.fasta sample.sorted.bam > sample.mpileup
bcftools call -mv -Oz -o sample.calls.vcf.gz sample.mpileup
tabix sample.calls.vcf.gz
cat ref.fasta | bcftools consensus sample.calls.vcf.gz > sample.consensus.fasta
The only problem is that I have no heterozygous site in the consensus sequence despite I clearly see some in the raw mapping alignment. How to get heterozygous site codes with IUAPC in the consensus sequence ?
Kevin
Ps: I tried the -i option in bcftools consensus but then all variant sites were codes heterozygous. Also the coverage is quite OK (20x)
Yes, that is this option I mentionned I my post scriptum but when I turn it on, then all the variant positions appear heterozygous. In fact that is not true: I checked the mapping output and some variant are homozygous and others are heterozygous between reads, that is because they reflect two different alleles... I think that the reason why I got all variant heterozygous is beacause the comparison is made between reads from my plant genome and the reference and not just between my reads despite they sometimes carry two alleles (diploid plant, highly heterozygous, single copy gene) in a 1:1 proportion
Did you find a solution to this? Please share if you did.