I'm trying to detect structural variation using NGS data, more specifically to find novel or rare SV from disease samples. But my research targets are common neurological disease, not cancer, so there's NO a perfectly matched control to remove background. Also because there's lot of noise from read mapping and SV callings, so when running SV calling softwares I decided to include controls from 1000 genome project to remove as much background noise as possible to look for rare or novel SV.
I know 1000genome provides a list of high-confidence SV, but that's been through high-standard filtering with many more complex SV undetected; so if using this list to filter for rare/novel variants, there'll be many false positive.
Questions:
- How many controls should I use? Ideally the more the better? I have 20 whole-genome sequences of patients to run. But considering the bam file size, I first tried only 10 CEU low-coverage WGS from 1000genome.
- Many programs like breakdancer or pindel support to run multiple files. But do these programs apply statistics to all these parameter as a whole, or still apply statistics to each file and merge all statistical results together?
- Control bam files from 1000 genome could have different insert size, mapped to different version of hg19/g1k_37, would that matter when I include these bams together with my patient bam files to call for SV?
Many thanks Zev, I'm writing another post, actually my question is not about control, but about whole design of running multiple tools for multiple samples.
"Using multiple caller and merging callsets will really help" I'm wondering here by "merging" you mean overlap/intersect or do union? I guess do a union? But each caller alone will achieve high false positive, not to mention union...