using bcftools/samtools to generate consensus using reference + variants; output is a truncated reference
0
0
Entering edit mode
9.7 years ago
nirvan ▴ 10

I am using samtools/bcftools to generate a 'de novo' consensus assembly for use with rc454 to clean my sequencing data. The outputted consensus sequence is identical to the reference (see the BLAST alignment graphic at the bottom of the post) till it is truncated at ~5000 of ~7000 bp.

I don't understand why it's happening or what to do about it.

I've used

  1. samtools mpileup to generate .vcf
  2. bcftools consensus to (try to) generate a consensus assembly.

According to this guy, the consensus function is buggy. Does anyone else have similar issues? Any ideas of what could be going wrong or a viable workaround? I have to process ~150 files w/ 10,000-60,000 reads of lengths 450-720 bp.​

Details of commands, options, and outputs

### command to generate Variant calls
samtools mpileup -vf mnv1.fa -r mnv1:5473-5922 APL000000034.34.sorted.bam > APL000000034.34.amp2.vcf

[fai_load] build FASTA index.
[mpileup] 1 samples in 1 input files
<mpileup> Set max per-file depth to 8000
### consensus command and error message
bcftools consensus -f mnv1.fa APL000000034.34.amp2.vcf > scratch.fa
Symbolic alleles other than <DEL> are currently not supported: <X> at mnv1:5474
### first three lines of purported "consensus" sequence
head -3 scratch.fa
>mnv1 gi|113478394|ref|NC_008311.1| Murine norovirus 1, complete genome
GTGAAATGAGGATGGCAACGCCATCTTCTGCGCCCTCTGTGCGCAACACAGAGAAACGCA
AAAACAAGAAGGCTTCGTCTAAAGCTAGTGTCTCCTTTGGAGCACCTAGCCCCCTCTCTT
### first three lines of reference sequence
head -3 mnv1.fa
>mnv1 gi|113478394|ref|NC_008311.1| Murine norovirus 1, complete genome
GTGAAATGAGGATGGCAACGCCATCTTCTGCGCCCTCTGTGCGCAACACAGAGAAACGCAAAAACAAGAA
GGCTTCGTCTAAAGCTAGTGTCTCCTTTGGAGCACCTAGCCCCCTCTCTTCGGAGAGCGAAGACGAAATT

< image not found >

samtools bcftools mosaik consensus rc454 • 7.5k views
ADD COMMENT
0
Entering edit mode

Can you show the variant at mnv1:5474? I assume it's some sort of duplication or other large structural variant. The consensus command can't handle them, so you might need to filter them out. However, I think GATK has a similar tool and perhaps it can handle this better.

ADD REPLY
1
Entering edit mode

Also, I just tried generating a consensus sequence using bcftools call--A: error with vcfutils when building consensus sequence with samtools and bcftools , which you suggested to another user about a month ago. The result was basically the same. The output of the consensus function was the EXACT reference sequence, not truncated as in the original workflow

ADD REPLY
0
Entering edit mode

Ah, I'd forgotten that I'd written that, good catch.

ADD REPLY
0
Entering edit mode

Variant at mnv1:5474

grep 5474 APL000000034.34.amp2.vcf

mnv1    5474    .    A    <X>    0    .    DP=1321;I16=671,645,0,0,39501,1.24369e+06,0,0,44237,1.49124e+06,0,0,1358,1456,0,0;QS=1,0;MQSB=2.39844e-05;MQ0F=0    PL    0,255,255
ADD REPLY
0
Entering edit mode

Interesting, what version of bcftools is this? I would have expected the most recent ones to hand gVCFs.

ADD REPLY
0
Entering edit mode

This is bcftools1.2. I don't believe the earlier versions have "consensus" function.

ADD REPLY

Login before adding your answer.

Traffic: 1721 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6