I have 20 whole-genome sequences for certain complex disease and would like to look for rare or novel structural variants using different SV calling tools such as CNVnator, Pindel, Breakdancer..suggested by 1000 genome project. Also I'd like to include at least another 20 control WGS from 1000 genome project, in order to remove noisy background when calling SVs.
I'm familiar and experienced with each SV calling tool, but now confused about the pipeline of "integrating" these "multiple" tools for "multiple" samples.
Two pipelines I could think about is eithertool-centric
or sample-centric
:
Pipeline1 (tool-centric)
. Make the calling using each SV tool. But I'd run multiple samples at the same time (many programs now support multiple sample calling), which is good for increasing sensitivity. Ideally I should run 20 patients + 20 control at the same time, but I don't think my disk space could hold so many big bam files simultaneously. So my plan is run three times with each time run 6 patients + 6 control. Then merge together.
Would such results be the same as runnning 20 patients + 20 control? Zev.Kronenberg from my another post said most programs apply statistics on a per library/sample basis, so should be ok?
Once I get vcf file containing information for multiple samples for each SV calling tool, how would I intersect or merge to look for overlapping callings supported by multiple tools? Using vcf-merge? vcf-isec?
Pipeline 2 (sample-centric):
Make the calling using each SV tool, but this time run program independently for each sample. And for each sample, first prioritize for a list of most confident SV callings; then merge different samples together.
Anyway I'm looking for high-confident rare/novel SV, which is supposed to be very few, which need to pass very stringent filtering. So specificity is more important at the sacrifice of sensitivity.
But Question is : When merging high-confident SV calling from each sample, very likely I could see:
Sample 1:
chr1 14657 DEL
Sample 2:
chr1 14569 DEL
They are the same calling but with slightly different coordinate,how could I intersect them with all vcf information retained? Using vcftools-isec?
Hope make this clear
Many thanks
Why are they the same?
hi,
Very interesting ques. as I myself am struggling to digest what is spit out by SV callers, esp. when I want to find recurrent SVs. Like you very well pointed out: how to intersect? I am not conversant with SVs and they way they 'behave'. Anyways, regarding the 'integration' effort, I came across this meta-caller for integration. I haven't given it a try though. Still understanding what Lumpy & Delly are saying.
Another issue is what you do when you find an SV near/over a repeat. Many of them are false positives, but some could be genuine. Lumpy in this regard uses an exclusion list which is regions with high-coverage from Ceph samples, but I am not sure if this is enough to suppress potential false calls.