Dear Biostars community,
A rather broad, theoretical, question so apologies. I am a PhD student looking at the genetic architecture of a rare disease using WGS data. As part of this I am looking at structural variation in my cohort of 1500 patients with the disease and roughly 17000 controls (all Europeans as selected by PCA).
We have called the structural variants using Manta and Canvas and for each patient there is a "structural SV" vcf.gz file which is a merger of all the Manta and Canvas calls. These were done by the central consortium although we do have access to the BAM files also.
From my reading it looks like ultimately one "calls" SVs by pruning and filtering to the point of being able to visualize potential changes in a genome viewer on a case control level (I appreciate functional assays would then be needed to prove any suspicions). To me this seems as though one would miss potential biology, as well as being pretty tedious.
Question 1: what are acceptable filtering criteria for "rare" SVs? I was thinking of applying <0.001% allele frequency, those that pass basic QC and taking it from there. In terms of merging "similar" calls I was going to merge those that overlap >50%.
Question 2: Are there non-visual methods to annotate and call SVs on a case control level. SV-Int has been mooted but mainly focuses on non-coding regions (http://compbio.berkeley.edu/proj/svint/).
The sheer volume of SVs called at these patients numbers is vast and a visual method seems a rather terrifying prospect.
Things I've tried to far: - SURVIVOR (to merge VCFs on nearby BPs -https://github.com/fritzsedlazeck/SURVIVOR) - doens't work with zipped files unfortunately - SVtools - the merged VCFs with Manta and Canvas calls seem to upset it when using lmerge and lsort - will try and sort this out
Things I've looked into: - SVE: https://github.com/TheJacksonLaboratory/SVE - would need to run BAMs from scratch, therefore am keen to avoid - MAVIS https://github.com/bcgsc/mavis - seems promising but not sure if VCFS can be input into it - This pipeline from the Hall group: https://github.com/hall-lab/sv-pipeline - again seems promising but a)need to start from scratch with BAM calls and b) the outputs would then be visualised at a case control level.
Ideally an approach that uses the existing VCFs (in zipped format) would be ideal.
Once again if you've got this far thank you for reading and apologies for the long, rather theoretical quesion!
All the best
Omid
Dear Cameron,
Many thanks for the swift, honest and helpful reply. To one of your questions: - 800 of 1500 definitely have a homogeneous genetic disease, this represents the largest cohort ever assembled. The others potentially have a variant or a phenotypically similar disease with a different genetic causality.
In regard to your other points, agreed! It already feels like one can spend a LOT of time exploring tools, glint in one's eye, seeking that code to solve the SV issue. It just feels very different coming from SNP data where a P-value of <10x-"large number" is the way forward and now it's -> get it into BEDPE and have a look, maybe that's a valid SV?
I think I'll start by coming up with a semi-sensible merge strategy, a way of pruning away noise (we have around 200 trios) and see what we find.
Many thanks again for taking the time to answer my questions.
All the best
Omid
I'll see what I can do with the VCFs I have (I see your lab has come up with StructuralVariantAnnotation which I will play around with)