Hi, I'm looking for some tools to perform SNP calling on some of BAM alignment data. The data on which I am doing research come from Illumina technology (average coverage 300x) and PacBio (average coverage 30x). Does anyone know give me some suggestions on which are the best tools in the literature?
This is a paper that talk about almost all available tools and their comparison. http://www.nature.com/articles/srep17875. You can use samtools, GATK, freebayes. I also know you can use VarScan2.
As far as I can see, no common variant callers work on Pacbio Sequel data (bacterial, read length ~3-5 kbp, coverage 200-300x).
Supporting VCF4 is essential for the Pacbio software SMRT analysis. I currently have two Pacbio projects for resequencing (and also methylation detection). The resequencing part of the more recent SMRT Analysis tools only generates VCF 3.3 from memory. VCF 4.0 (at least) however is essential to work with downstream VCF analysis tools such as Snpeff, vt etc.
Current leading variant callers such as Freebayes do not work on Pacbio BAMs (I believe the MD tags were too long). I need to try Samtools next, but at the moment cannot recommend Pacbio for resequencing.
It would be great if anyone has any solutions to this issue. Even if the callers perform less than optimally it would be good to find one suitable.
BBMap's CallVariants will run fine on PacBio data, and produce output. It produces VCF 4.2 compliant files. As for the quality of the results... I'm not sure, since it's optimized for Illumina. Error-corrected PacBio reads should work well, though.
Thanks Brian. I haven't been successful with Samtools1.4 so far - though it produces output, the I16 fields are all 0 despite setting the --min-BQ lower as with this command.
So it turns out I can generate results using the current (bioconda) bbmap callvariants.sh. I had to play around reducing nearly all filters to really low (not 0, this outputs all base positions) for most parameters.
So progress has been made.
Command ( I have tried many variations of this already).
Great to hear; I'd certainly be interested in any additional feedback you have about what it gets wrong, or the best defaults for PacBio reads. But certainly, I expect CallVariants (or any variant-caller) to perform much better with error-corrected PacBio data.
I can't comment on general variant calling very much since this is just for high-depth mitochondria - but the next project is bacterial genomes with PacBio where SNVs should also be called, so maybe I can add more then.
Here is a very recent comparison of commonly used tools, using non-matched exome data.