I am analyzing the specificity of Pindel's indel calls. I have matched tumor/normal WGS data, bwa aligned. The original calls were filtered to remove any that had supporting reads from the normal sample to filter out germline calls. Then I take the remaining calls and tier them based on whether they occur in coding regions or not. Then the calls from these coding regions were validated using an orthogonal sequencing method.
Is there a short list of easily determined metrics to check for correlation with false-positive calls? I am considering gathering a bunch of metrics on these calls and tossing them all into a machine learning app like Weka to see if it finds anything, but I would like to add as many meaningful data-points to correlate as possible.
perhaps try using bam-readcounts, which can report some metrics regarding the read: http://hpc.mskcc.org/compute-accounts/account-request/
Machine learning related: perhaps try: https://github.com/google/deepvariant
I moved this to a comment as neither of the two suggestions is directly related to the question.