I am getting started with bioinformatics and experimenting with assembly tools.
I generated a random reference sequence of 10000bp, and then created a subject sequence from it by introducing 500 deletions. From the subject sequence, 5000 simulated reads with read length 50-200 was obtained in-silico, with all bases having quality of 30. The aim was to assemble these reads to re-create the subject sequence.
bowtie2-build reference.fasta.gz index
bowtie2 -x index -U reads.fastq -S alignment.sam
samtools sort alignment.sam -o alignment.bam
bcftools mpileup --fasta-ref reference.fasta.gz alignment.bam > subject.mpileup
bcftools call -mv --ploidy 1 subject.mpileup > subject.vcf
cat subject.vcf
The resultant VCF file only picked up 30 or so indels, out of the total 500. Part of the output as shown:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT alignment.bam
reference 581 . AC A 56 . INDEL;IDV=56;IMF=0.949153;DP=59;VDB=0.482463;SGB=-0.693147;MQ0F=0.0338983;AC=1;AN=1;DP4=13,0,46,0;MQ=16 GT:PL 1:83,0
reference 717 . CG C 52 . INDEL;IDV=61;IMF=0.924242;DP=66;VDB=0.967282;SGB=-0.693147;MQ0F=0;AC=1;AN=1;DP4=14,0,52,0;MQ=21 GT:PL 1:79,0
reference 1647 . CGG CG 43.3557 . INDEL;IDV=73;IMF=0.986486;DP=74;VDB=0.142911;SGB=-0.692717;MQ0F=0.135135;AC=1;AN=1;DP4=51,0,23,0;MQ=12 GT:PL 1:82,11
reference 1869 . TAG T 10.5041 . INDEL;IDV=3;IMF=0.046875;DP=64;VDB=0.963375;SGB=-0.676189;MQ0F=0.109375;AC=1;AN=1;DP4=53,0,11,0;MQ=9 GT:PL 1:38,0
Inspecting the alignment.bam file showed that besides indels at position 581 picked up, indels position 549, 634, 638 etc are all missed but they are very obvious from the reads.
On the other hand, de-novo assembly with spades.py perfectly reconstructed the exact subject seqeunce. Did I misconfigure bcftools somehow? Or is this an inherent limitation of variant calling from reference? Any help would be greatly appreciated.
Other intermediate files are found here: subject.mpileup
Thanks for the insight. Though picking up less than 10% of indels in simulated data despite coverage of 50x sounds really worrying. Is there an alternative tool that may work better for general purpose variant calling?