For this (sorted) bam and reference file
http://saf.bio.caltech.edu/pub/pickup/scf7180000354095.bam
http://saf.bio.caltech.edu/pub/pickup/scf_7180000354095.fasta
(these will not be here forever, so don't expect the links to work if you come to this thread years from now) there are columns which freebayes, bcftools, and Platyplus all refuse to call which obviously contain variants. For instance, column 178 is all C or T, about 50:50. Doing it this way:
samtools index scf7180000354095.bam
samtools faindex scf_7180000354095.fasta
samtools mpileup -ugf scf_7180000354095.fasta scf7180000354095.bam \
| bcftools call -vmO z -o bcftools.vcf.gz
Platypus.py callVariants --refFile=scf_7180000354095.fasta \
--bamFiles=scf7180000354095.bam \
--nCPU=20 --minReads=1 --mergeClusteredVariants=1 --getVariantsFromBAMs=1 \
--output=platypus.vcf
freebayes -f scf_7180000354095.fasta -C 5 -p 2 -F 0.04 \
scf7180000354095.bam >scf7180000354095_freebayes.vcf
None of them call column 178. However....
cp scf_7180000354095.fasta scf_7180000354095_test.fasta
nedit scf_7180000354095_test.fasta
#change C->T at 178
samtools faidx scf_7180000354095_test.fasta
samtools mpileup -ugf scf_7180000354095_test.fasta scf7180000354095.bam \
| bcftools call -vmO z -o bcftoolsB.vcf.gz
Platypus.py callVariants --refFile=scf_7180000354095_test.fasta \
--bamFiles=scf7180000354095.bam \
--nCPU=20 --minReads=1 --mergeClusteredVariants=1 --getVariantsFromBAMs=1\
--output=platypusB.vcf
freebayes -f scf_7180000354095_test.fasta -C 5 -p 2 -F 0.04 \
scf7180000354095.bam >scf7180000354095_freebayesB.vcf
and now they all call column 178.
There must be some common logic about what is and isn't considered a variant that these share, and it somehow depends critically on the base in the column in the reference sequence. What is this logic, and why is it the default?
More to the point, how does one force each of these to emit all the columns which have this issue, without having to physically change the reference sequence?
This may or may not be relevant, but the sample is a bit peculiar in that it represent a region with what appears to be ~3 copies of a small family of ~4 similar sequences. The organism is diploid and all ~8 copies are about equal edit distance away from each other. The assembly is suspect and the reference sequence is definitely a mix, with each position being "most like" the sequence in that position, but not exactly like it. This makes the read positions suspect as well. The end result is that there are up to ~8 different types of reads mapped in various positions onto the reference. The VCF would help in sorting this out, but to do so all the variants need to be in it, and that isn't what is happening.
Thanks.
For testing, I ran BBMap's variant caller, and it also ignores that variant. Why? It only appears once in a read with mapq>3; reads with mapq 3 or under are by default ignored for variant-calling, since there is no confidence that they are mapped to the correct location. Also, all reads it appears in are on the plus strand, though that does not affect the scoring in this case since it's near the end of a contig so some strand bias is expected.
You can force it to appear with BBMap's variant-caller like this:
Out of curiosity, does BBMap act like the other three programs, calling column 178 if that one base is changed in the reference?
The mapped reads were corrected with Quorum. That was needed for the MaSuRCA assembly, perhaps it is a mistake to use the corrected reads, rather than the raw ones, in this subsequent step. It may be that some of these closely related reads have been miscorrected from one form to another in some columns.
Consider also column 1014, also not called. C or G. The G form is associated with (at least) these other changes in the reads I checked: 965:A, 967:G, 1018:C, 1029:G, 1060:T (1018 has A,C,T.) Platypus finds only 1060:T->:G (it also finds 1018:A->T, missing the C variant), bcftools finds 1018 A->C and 1060:T->G, and freebayes finds 1018 A->C,T and 1060:T->G.
If the base is changed in the reference, there will be 15/16 reads with mapq>3 aligned to that location indicating a non-ref allele, so yes, it should call a variation there if the ref is changed.
Error-correction is tricky with variant-calling on Illumina reads. It should be fine with unamplified high coverage in nonrepetitive parts of a genome, but once you get lots of copies, I think using the uncorrected reads is better - adapter-trimmed, quality-recalibrated, and deduplicated, for best results.
With one small tweak, that worked very well. Changed
to
to keep the spurious miscalls out of the variant file. Where are the various filters documented? Is there a percentage based equivalent of minreads? The desired acceptance condition for variants is "max(filter_fraction*number_of_reads_in_column, observed_variants_in_column) >= 5".
Thanks!
The only documentation so far is in the shellscript, which has a one-line description of each field, but I'll write more complete documentation soon. It does not have the power to do advanced filtering like you're describing, but you can adjust both minreads and minallelefraction (if you use both, only variants passing both filters will be retained). For a very specific filter you may want to generate the full VCF file with 'clearfilters' and postprocess it with custom code.
see also: Why Does Mpileup Skip My Mutation ?
Tried this
and it still skipped column 178. -E didn't help either. Changing m to c in bcftools also didn't do it. This is samtools 1-4, by the way.
just to be sure: I've plotted the BAM around the mutation:
Yes, that is the right region. (There are many others.)