Hi all!
We have a working NGS analysis variant calling pipeline for cancer which we've recently ran on a sample.
A certain mutation was found, specifically in chr3:178936095 (hg19), now as part of a confirmation process the lab ran a Sanger sequencing process on the mutation region and found out that a 2-based deletion and 1-base insertion wrt hg19, occurred in a close position, in chr3:178936116 and that the mutation we found seems like a false-positive.
Looking through the VCF files, I didn't find the deletion+insertion position, nor did I find it when I checked the BAM file.
We're using:
- BWA MEM as an aligner. Our command for running is:
bwa mem -t 12 -R <read groups info> <ref genome> <fastq files> > aligned.sam
- Realigning around indels using GATK.
- Calling variants with FreeBayes, GATK & bcf tools.
- Currently there's no filter based on read quality other than the default one in bwa mem.
There was a consensus between all 3 on the mutation we found from the NGS data.
This has led me to think that the reads 'supporting' that hypothesis weren't aligned from the .fastq files as they were too different from the reference or got a 'lower score'.
My question is: Has something similar ever occurred to you? Is there a parameter/argument I can provide BWA MEM with so that those missing reads would be aligned? Otherwise, can you recommend a program/pipeline I can use to call these indels/insertions successfully?
Thank you very much in advance,
Alon
Was this a germline or somatic mutation?
We're not sure, the only sample was from the tumor.
You do not provide much detail, you would probably get more helpful replies if you provide a better description of your analysis. Do you perform realignment around indels? You are aligning with bwa mem, but how are you calling the variants? Do you perform quality filtering on your NGS variants?
You're right, thanks. I've edited my question, hopefully now the info is sufficient.
I developed an analysis pipeline that has 100% sensitivity with Sanger. It's currently being used in a laboratory of the National Health Service England.
100% agreement sounds suspicious (no offence intended). Is this pipeline publicly accessible? Would be great to learn from well-tested procedures that used other than the standard test datasets like NA12878.
EDIT: If it is not public, could you tell which variant caller it uses as a workhorse?
Yes, it is quite suspicious - when we presented the results at conferences (I presented in Manchester, England, and Dundee, Scotland), there was usually silence. Whilst it's got 100% agreement in the sense that it detects all Sanger-confirmed variants when the sample passes QC, it equally calls variants not encountered by Sanger, which could be false-positives or ones that Sanger missed.
We originally just used the GATK whilst validating/researching, but had to move variant caller due to licensing issues with the GATK for going live. So, now it's just open source samtools. The key (on top of good QC) is merely the 'random' selection of reads from your final aligned BAM to produce multiple 'sub-BAMs', each from which variants are then called independently. What you'll always find is a high quality variant(s) called in a lower sub-BAM that was seemingly missed in the original BAM. Try it out.
By producing sub-BAMs in this way, you increase your chances of calling all genuine variants. Unfortunately, when you do this, you're also calling variants from lower read-depths, so QC is always critical. We never called anything from below total read depth 18 (~97% sensitivity to Sanger at this depth), with the optimum total depth being 30 (100% sensitivity to Sanger).
To 'randomly' select reads like we did/do, one can use Picard DownsampleSam
[Edit: I never researched properly how/why seemingly blatant variants were not being called, but I put it down to genotype prior probabilities, i.e., prior probabilities of calling a variant at each position, which is influenced by read-depth. When you produce sub-BAMS, you 'reshuffle' the reads and give yourself a new chance of calling].
I am a little confused about what you said in EDIT.
The prior probability, P(Genotype), is defined in early study, such as 1KB or dbSNP, and the probability is fixed (sure the probability value depends on the read depth in that vcf file).
When using sub-BAM instead of entire BAM, the read-depth / allele frequency of your current sample changes. That changes the P(Data) and therefore changes the final P(Genotype|Data)
Is this what we are talking about?
There are 2 kinds of prior / posterior probabilities here:
I am referring to the first type in my posting (above). As mentioned, I never had time to explore this further.
Yes, the first type is what I was talking about. It is a fixed probability depending on which reference you use.
I am not sure about the second type you mention. In my expression, GATK also use the first type as prior. That is why --dbsnp arguments is used in HC / Mutect