I am using samtools/bcftools to generate a 'de novo' consensus assembly for use with rc454 to clean my sequencing data. The outputted consensus sequence is identical to the reference (see the BLAST alignment graphic at the bottom of the post) till it is truncated at ~5000 of ~7000 bp.
I don't understand why it's happening or what to do about it.
I've used
samtools mpileup
to generate .vcfbcftools consensus
to (try to) generate a consensus assembly.
According to this guy, the consensus function is buggy. Does anyone else have similar issues? Any ideas of what could be going wrong or a viable workaround? I have to process ~150 files w/ 10,000-60,000 reads of lengths 450-720 bp.
Details of commands, options, and outputs
### command to generate Variant calls
samtools mpileup -vf mnv1.fa -r mnv1:5473-5922 APL000000034.34.sorted.bam > APL000000034.34.amp2.vcf
[fai_load] build FASTA index.
[mpileup] 1 samples in 1 input files
<mpileup> Set max per-file depth to 8000
### consensus command and error message
bcftools consensus -f mnv1.fa APL000000034.34.amp2.vcf > scratch.fa
Symbolic alleles other than <DEL> are currently not supported: <X> at mnv1:5474
### first three lines of purported "consensus" sequence
head -3 scratch.fa
>mnv1 gi|113478394|ref|NC_008311.1| Murine norovirus 1, complete genome
GTGAAATGAGGATGGCAACGCCATCTTCTGCGCCCTCTGTGCGCAACACAGAGAAACGCA
AAAACAAGAAGGCTTCGTCTAAAGCTAGTGTCTCCTTTGGAGCACCTAGCCCCCTCTCTT
### first three lines of reference sequence
head -3 mnv1.fa
>mnv1 gi|113478394|ref|NC_008311.1| Murine norovirus 1, complete genome
GTGAAATGAGGATGGCAACGCCATCTTCTGCGCCCTCTGTGCGCAACACAGAGAAACGCAAAAACAAGAA
GGCTTCGTCTAAAGCTAGTGTCTCCTTTGGAGCACCTAGCCCCCTCTCTTCGGAGAGCGAAGACGAAATT
Can you show the variant at mnv1:5474? I assume it's some sort of duplication or other large structural variant. The consensus command can't handle them, so you might need to filter them out. However, I think GATK has a similar tool and perhaps it can handle this better.
Also, I just tried generating a consensus sequence using bcftools call--A: error with vcfutils when building consensus sequence with samtools and bcftools , which you suggested to another user about a month ago. The result was basically the same. The output of the consensus function was the EXACT reference sequence, not truncated as in the original workflow
Ah, I'd forgotten that I'd written that, good catch.
Variant at mnv1:5474
Interesting, what version of bcftools is this? I would have expected the most recent ones to hand gVCFs.
This is bcftools1.2. I don't believe the earlier versions have "consensus" function.