Question

How to do consensus calling from bam files

2

Entering edit mode

8.4 years ago

AW ▴ 350

Hi,

I want to extract consensus sequences for a number of bam files with the intent to align sequences across individuals and build trees.

I am following the advice here https://github.com/samtools/bcftools/wiki/HOWTOs via vcf2fq: samtools mpileup -uf ref.fa aln.bam | bcftools call -c | vcfutils.pl vcf2fq > cns.fq

via bcftools consensus: samtools mpileup -uf ref.fa aln.bam | bcftools call -mv -Oz -o calls.vcf.gz tabix calls.vcf.gz cat ref.fa | bcftools consensus calls.vcf.gz > cns.fa

But I have some questions I cannot find the answer to:

How does bcftools consensus work differently from vcfutils.pl vcf2fq?
How does either deal with missing data? For example if a site is not covered by reads in the bam file, ideally I would like the consensus sequence to contain an N. Is this the case or is the allele in the reference sequence used?
Does the consensus callers filter the SNPs? And if not, how do you do this? is this achieved by running bcftools filter after bcftools calls?

Thanks!

Consensus calling SNPs Samtools BCFtools • 4.5k views

ADD COMMENT • link updated 8.4 years ago by Brice Sarver ★ 3.8k • written 8.4 years ago by AW ▴ 350

0

Entering edit mode

I would say that inserting N might not be the right choice. If a region is not covered by reads it is probably not present - perhaps you might insert the reference sequence for that but definitely not Ns.

ADD REPLY • link 8.4 years ago by Istvan Albert 102k

score 1 · Answer 1 · 2016-09-14

1

Entering edit mode

8.4 years ago

Brice Sarver ★ 3.8k

Some consensus generators do strange things, especially if you're using the results for downstream analysis and not, say, as a crude sample-specific reference. Especially if you're doing phylogenetic analysis, I'd recommend a slightly different approach:

Get a set of high-quality, filtered SNPs based on whatever criteria you think is good for your system. Then, use code or the GATK's FastaAlternateReferenceMaker to inject that variation back into the reference. For regions that you are uncertain about, such as positions with low coverage or low quality, you can mask them using bedtools' maskfasta.

ADD COMMENT • link 8.4 years ago by Brice Sarver ★ 3.8k

0

Entering edit mode

This is sound like the a better approach - we've often tried to call consensus from mpileup outputs but that only works for the simplest of cases in general the situation is more complicated.

ADD REPLY • link 8.4 years ago by Istvan Albert 102k