generating a reasonable consensus with bcftools + vcf-consensus
0
4
Entering edit mode
7.6 years ago
mths_b ▴ 40

Hi,

I want to generate a consensus sequence using a bam alignment from a specific genome location. This consensus must include indels, and call IUPAC ambiguity codes for heterozygous sites. Ideally I would like to be able to call a highest quality (based on phred score), majority consensus sequence (i.e. most common bases, fewest ambiguities).

I've read many discussions about how to do that and the best solution I came up was to: 1) map reads to genome fasta file, get the bam alignment; 2) generate a reference fasta file in which regions with zero coverage are masked (using bedtools genomecov + bedtools maskfasta) - this allows me to deal with deleted regions in my sample, then:

samtools mpileup -uf reference_masked.fasta sample.sorted.bam -l coordinates.bed | bcftools call -vm -Oz - > output.vcf.gz
tabix -p vcf output.vcf.gz
samtools faidx reference_masked.fasta chr:start-end | vcf-consensus output.vcf.gz | bgzip > consensus.fasta.gz

It does give me a fair consensus sequence. Indeed, masking regions with no coverage before calling the SNPs seems to give a better consensus. However, when I examine the same bam alignment in Geneious, I realize that for many sites it is calling the REFERENCE base instead of my SAMPLE base with no apparent reason.

For example, in some sites, I have 4 reads calling a "G" and none calling a "C" (which is the base in the reference sequence), but the consensus give me a "C" . I thought that was a problem of coverage, but in other cases I also have 4 reads calling "T" and the consensus is really "T", although the reference is "C". Also, in other sites where I have 9 reads (which is a quite high coverage) calling "A", it calls "G" instead, which is the reference base. It does not seem to be a matter of quality, since these were all high quality (>20) bases.

In the example above I've used medium coverage datasets (~10x). However, I also want to be able to call consensus sequences from low coverage data (~1x). I suppose I would have to use different settings in this last case, and I am aware of issues with low coverage data.

Any suggestions for this issue? I really appreciate any help you guys can provide.

Thank you!

SNP consensus bam • 4.3k views
ADD COMMENT
1
Entering edit mode

I'm also having this issue. There doesn't seem to be any prebaked solution to converting an mpileup to a consensus sequence that will include both SNPs and INDELs. Even when I can see the deletion in the alignment in igv, and the pileup recognizes it, the resulting vcf going through bcftools gets rid of the indel!

ADD REPLY

Login before adding your answer.

Traffic: 1693 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6