Question

Choosing controls for structural variation detection

1

Entering edit mode

9.4 years ago

michealsmith ▴ 800

I'm trying to detect structural variation using NGS data, more specifically to find novel or rare SV from disease samples. But my research targets are common neurological disease, not cancer, so there's NO a perfectly matched control to remove background. Also because there's lot of noise from read mapping and SV callings, so when running SV calling softwares I decided to include controls from 1000 genome project to remove as much background noise as possible to look for rare or novel SV.

I know 1000genome provides a list of high-confidence SV, but that's been through high-standard filtering with many more complex SV undetected; so if using this list to filter for rare/novel variants, there'll be many false positive.

Questions:

How many controls should I use? Ideally the more the better? I have 20 whole-genome sequences of patients to run. But considering the bam file size, I first tried only 10 CEU low-coverage WGS from 1000genome.
Many programs like breakdancer or pindel support to run multiple files. But do these programs apply statistics to all these parameter as a whole, or still apply statistics to each file and merge all statistical results together?
Control bam files from 1000 genome could have different insert size, mapped to different version of hg19/g1k_37, would that matter when I include these bams together with my patient bam files to call for SV?

NGS SV 1000 genome project • 2.8k views

ADD COMMENT • link updated 6.9 years ago by Ram 45k • written 9.4 years ago by michealsmith ▴ 800

Ram · Answer 1 · 2016-02-18

1

Entering edit mode

9.4 years ago

Zev.Kronenberg 12k

More is better. Check if you can find the Human Genome Diversity Project data.

"Global diversity, population stratification, and selection of human copy number variation"

It is also a good idea to run CHM1 or other genomes that have pacbio SV calls (allows you to threshold accuracy)
I can't speak for all programs, but calling SVs across many people usually increases the sensitivity and false discovery rate. I like to call individuals separately, merge the calls, and then joint genotype. It insures a single call has enough support within a single diploid. Joint genotyping mitigates missed calls. See my workflow.
Most tools model insert size on a per library basis. AKA you don't need to worry about it.

Now for unsolicited advice. I really recommend calling with multiple tools. We use LUMPY, WHAM, Genome STRiP and Delly. Using multiple caller and merging callsets will really help.

ADD COMMENT • link updated 6.9 years ago by Ram 45k • written 9.4 years ago by Zev.Kronenberg 12k

0

Entering edit mode

Many thanks Zev, I'm writing another post, actually my question is not about control, but about whole design of running multiple tools for multiple samples.

ADD REPLY • link 9.4 years ago by michealsmith ▴ 800

0

Entering edit mode

"Using multiple caller and merging callsets will really help" I'm wondering here by "merging" you mean overlap/intersect or do union? I guess do a union? But each caller alone will achieve high false positive, not to mention union...

ADD REPLY • link 9.3 years ago by michealsmith ▴ 800