I'm having a problem with variant calling on some PacBio RS II data. Ive aligned my reads against different small reference (~15kb) using blasr and then ran the resulting bam file through PacBio's samtlink version 6 variantCaller as follows:
variantCaller --algorithm arrow --log-file variantCaller.log --annotateGFF --reportEffectiveCoverage --noEvidenceConsensusCall lowercasereference --minCoverage 40 --coverage 100 --minMapQV 40 --minConfidence 10 --minReadScore 0.75 --minSnr 3.75 --minZScore -3.5 --minAccuracy 0.82 --numWorkers 5 -r ref.fa -o 129C02-vs-ref.fa.sort.vcf -o 129C02-vs-ref.fa.sort.gff -o 129C02-vs-ref.fa.sort.fasta -o 129C02-vs-ref.fa.sort.fastq 129C02-vs-ref.fa.sort.bam
As you can see, among my outputs are a vcf and a fasta file. It is my understanding, perhaps incorrect, that the fasta file is the consensus sequence determined by the reads and that the differences between the original reference and it should be the variants identified in the vcf file (barring the presence of larger structural variants). This does not appear to always be the case. For one reference, the vcf indicates a one 1bp deletion but if I align the reference and the consensus fasta output, two, widely separate 1-bp deletions are obseved, only one being the deletion reported in the vcf.
Can anyone explain this discrepancy?