I don't know how people usually further filter structural variations, after making callings using multiple softwares (Pindel, CNVnator, Delly, Breakdancer, StriP.....you name it) ?
I'd like further prioritize those supported (reciprocal 70%) by at least two softwares. Does that sound reasonable? Intuitively I think so because SV identifications are still very noisy and technologically very challenging compared to SNP calling. I also saw some paper did the same (http://www.nature.com/articles/srep18501). But if I'm correct, in 1000Genome Project they "merged" to come up with a union based on output from 9 algorithms.
I proposed the above to my professors (they are geneticists but not bioinformaticians, basically have zero knowledge about NGS-based SV calling) and they thought "supported by at least two softwares" or "70% reciprocal overlap" are simply too arbitrary. How would I explain and convince them? Personally I think lots of procedures in SV calling and filter are indeed very arbitrary given the complex nature of structural variation and challenge for NGS-based calling.
Here are some thoughts about this:
(1) The most accurate consensus approach is probably merging across sequencing technologies such as merging an illumina and a PacBio call set.
(2) If you have only illumina data available then calls reported by multiple orthogonal methods should be better (e.g., merging a read-depth call set with a paired-end call set instead of merging 2 paired-end call sets).
(3) Assuming you are doing germline SV calling in a larger cohort then (a) create a separate population call set using each caller, (b) assess the FDR of each call set and (c) merge only call sets that pass a pre-defined FDR threshold.
(4) I actually do agree with your professors that current merging procedures are ad hoc and not satisfactory but still commonly applied because they appear to show an accuracy advantage (e.g. see MetaSV, PMID: 25861968). The merging itself can possibly be improved: (a) Require a reciprocal overlap threshold and a max. breakpoint offset (b) For common SVs, you can require in addition genotype concordance (c) You can try a repeat-aware merging of SVs (see PMID: 25979471).
Many thanks!
(1) Merging an illumina and Pacbio, here by "merge" you mean intersection or union?
(2) I only have illumina, again, "merging a read-depth call set with a paired-end", here by "merge" you mean intersection or union? Also, I ran CNVnator as read-depth caller, but I found Delly/Breakdancer share very few intersection with CNVnator. (But I pre-filtered CNVnator output setting certain p-value shreshold)
(3) I'm not doing germline SV calling in tumor, but similarly I'm trying to find rare/novel SV for neurological disease by filtering out common deletions in 1000Genome project. Btw, How to assess FDR? Using well-sequenced samples like NA12828? Or you mean to validate SV for my own samples using array/PCR?
For (1) and (2) I meant the intersection using a rec. overlap threshold and a max. breakpoint offset as constraints. For (3) an FDR estimate using your cohort of samples is better than using "only" NA12878 but not always feasible, of course.