Hi,
I want to extract consensus sequences for a number of bam files with the intent to align sequences across individuals and build trees.
I am following the advice here https://github.com/samtools/bcftools/wiki/HOWTOs via vcf2fq: samtools mpileup -uf ref.fa aln.bam | bcftools call -c | vcfutils.pl vcf2fq > cns.fq
via bcftools consensus: samtools mpileup -uf ref.fa aln.bam | bcftools call -mv -Oz -o calls.vcf.gz tabix calls.vcf.gz cat ref.fa | bcftools consensus calls.vcf.gz > cns.fa
But I have some questions I cannot find the answer to:
How does bcftools consensus work differently from vcfutils.pl vcf2fq?
How does either deal with missing data? For example if a site is not covered by reads in the bam file, ideally I would like the consensus sequence to contain an N. Is this the case or is the allele in the reference sequence used?
Does the consensus callers filter the SNPs? And if not, how do you do this? is this achieved by running bcftools filter after bcftools calls?
Thanks!
I would say that inserting
N
might not be the right choice. If a region is not covered by reads it is probably not present - perhaps you might insert the reference sequence for that but definitely not Ns.