Question

Pipeline of structural variation calling using multiple tools for multiple samples

4

Entering edit mode

8.8 years ago

michealsmith ▴ 800

I have 20 whole-genome sequences for certain complex disease and would like to look for rare or novel structural variants using different SV calling tools such as CNVnator, Pindel, Breakdancer..suggested by 1000 genome project. Also I'd like to include at least another 20 control WGS from 1000 genome project, in order to remove noisy background when calling SVs.

I'm familiar and experienced with each SV calling tool, but now confused about the pipeline of "integrating" these "multiple" tools for "multiple" samples.

Two pipelines I could think about is eithertool-centric or sample-centric: Pipeline1 (tool-centric). Make the calling using each SV tool. But I'd run multiple samples at the same time (many programs now support multiple sample calling), which is good for increasing sensitivity. Ideally I should run 20 patients + 20 control at the same time, but I don't think my disk space could hold so many big bam files simultaneously. So my plan is run three times with each time run 6 patients + 6 control. Then merge together.

Would such results be the same as runnning 20 patients + 20 control? Zev.Kronenberg from my another post said most programs apply statistics on a per library/sample basis, so should be ok?
Once I get vcf file containing information for multiple samples for each SV calling tool, how would I intersect or merge to look for overlapping callings supported by multiple tools? Using vcf-merge? vcf-isec?

Pipeline 2 (sample-centric): Make the calling using each SV tool, but this time run program independently for each sample. And for each sample, first prioritize for a list of most confident SV callings; then merge different samples together.

Anyway I'm looking for high-confident rare/novel SV, which is supposed to be very few, which need to pass very stringent filtering. So specificity is more important at the sacrifice of sensitivity.

But Question is : When merging high-confident SV calling from each sample, very likely I could see:

Sample 1:

chr1 14657 DEL

Sample 2:

chr1 14569 DEL

They are the same calling but with slightly different coordinate,how could I intersect them with all vcf information retained? Using vcftools-isec?

Hope make this clear

Many thanks

structural variation 1000 genome project • 4.3k views

ADD COMMENT • link updated 6.2 years ago by Ram 44k • written 8.8 years ago by michealsmith ▴ 800

0

Entering edit mode

Why are they the same?

ADD REPLY • link 8.7 years ago by H.Hasani ▴ 990

0

Entering edit mode

hi,

Very interesting ques. as I myself am struggling to digest what is spit out by SV callers, esp. when I want to find recurrent SVs. Like you very well pointed out: how to intersect? I am not conversant with SVs and they way they 'behave'. Anyways, regarding the 'integration' effort, I came across this meta-caller for integration. I haven't given it a try though. Still understanding what Lumpy & Delly are saying.

Another issue is what you do when you find an SV near/over a repeat. Many of them are false positives, but some could be genuine. Lumpy in this regard uses an exclusion list which is regions with high-coverage from Ceph samples, but I am not sure if this is enough to suppress potential false calls.

ADD REPLY • link updated 6.2 years ago by Ram 44k • written 8.7 years ago by Amitm ★ 2.3k

score 8 · Answer 1 · 2016-05-10

This is the wild west of SV detection. There are too many callers each with advantages and disadvantages, but very few SV prioritization programs.

Each group does prioritization a little bit differently. I can speak for our pipeline which has been recently described in this publication: Frequency and Complexity of De Novo Structural Mutation in Autism

Specifically we tend to collapse CNV positions if the overlap is 90% reciprocal. I use an algorithm that finds the "median" CNV and for each overlapping CNV in other samples just use the "median" CNV positions

We prioritize positions based on the type of calling algorithm. Lumpy and Manta use discordant paired-ends so their positions are more accurate than ForestSV which uses coverage and a sliding window to call CNVs (ForestSV has one of the best duplication sensitivities. I don't know why people don't use it as much).

We also used genotyping algorithms (SVtyper and gtCNV) to remove poorly genotyping overlapping calls (like 50%-89% reciprocal).

Generally speaking when you report a final call set you should not have any overlapping CNVs within the same sample. But there may be overlapping alleles in your genotype matrix.