Would anyone care to share their experience with variant calling in cancer genomics using tumor - normal pair to find somatic vs germline variants especially indels?
I have been getting an unbelievably high number of germline indels that are "coding" after running GATK somatic Indel detector on a tumor-normal samples. Even after pretty strict coverage filters both for normal and tumor, we get ~20-30 somatic coding small indels (which I can digest) but about 600 coding germline indels - ~50% of them frameshift!
These are pretty convincingly "germline" when you look at the coverage in "normal" samples (to confirm germline events). I know this cannot happen and am trying to investigate the reasons - could there be
- Alignment issues
- contamination of normal (less likely as it is blood vs paraffin tumor)
- Annotation version issues (I have rechecked and eliminated this cause)
Any help is appreciated Thanks
Additional info:
% of consensus reads with called indel in Normal by total reads in normal is ~40-50% or ~90-100% with average over all indels as 60%. Similar numbers for tumor. So it does seem like true germline
By any chance are a lot of these indels close to repetitive sequences?
@GWW - not really, there are a whole lot in the non-coding region that are close to repetitive regions but the one I am talking about are smack in the middle of well meaning exons. abt 50% small 3n indels and rest 50% frameshift.
@GWW - not really, there are a whole lot in the non-coding region that are close to repetitive regions but the ones I am talking about are smack in the middle of well meaning exons. abt 50% small 3n indels and rest 50% frameshift.
What do your quality metrics look like? If they don't have a high quality score and good coverage, it's probably junk. See how many you have left if you use SNP quality cutoffs of 50 or 75.
And you ran an "indel realignment" step on both the tumor and normal BAMs?
@aaron - yes both files were run through local indel realignment.