Hi all,
I am interested in generating consensus sequences from a bwa alignment. I generated the consensus sequences as follows:
samtools mpileup -uf reference.fasta sorted.bam -d 8000 | bcftools call -mv -Oz -o sorted.vcf.gz
tabix sorted.vcf.gz
cat reference.fasta | bcftools consensus sorted.vcf.gz > consensus.fasta
However, I noticed that even for sequences in my reference where none of the reads matched, I got a potential "consensus" sequence back. I assume this is because there were no entries in the VCF file that indicated SNPs, which could mean a perfect alignment.
However, this is misleading in my case, because if nothing mapped to a sequence in the reference set then I don't want to be confused with a consensus sequence. On the other hand, if a lot of reads mapped perfectly to a sequence in the reference set, then of course I'd like to see the correct consensus sequence.
How would you recommend I modify this approach?
Is it better to filter the bam somehow to only keep things that have a certain mapping depth? Or should I filter the VCF?
Thanks in advance!
Thank you for the suggestion, I will try that!