What tools do you suggest for resolving large insertions?
1
0
Entering edit mode
7.5 years ago
novice ★ 1.1k

Large as in 10Kbp+. I tried many of the popular SV tools like Breakdancer, Pindel, Lumpy, and they cannot resolve large insertions. Usually they falsely count them as translocations as well: for exapmle, chr2 to chr1 insertion would be reported as two translocations. Would appreciate suggestions.

WGS SV • 2.8k views
ADD COMMENT
1
Entering edit mode
7.5 years ago
d-cameron ★ 2.9k

Large insertion of sequence already in the reference, or large insertions of novel sequence (e.g. viral integration)?

Usually they falsely count them as translocations as well: for exapmle, chr2 to chr1 insertion would be reported as two translocations.

I think you need to revise your expectations of what you expect your SV callers to report. As an author of an SV caller (GRIDSS, I highly recommend it ;), I consider reporting such an event as two translocation more 'correct' than the equivalent insertion call as it is the only way to report such an event in VCF without losing information. Using SVTYPE=INS results in the loss of the information that the inserted sequences comes from chr2.

Additionally, since your breakpoints are more than 10kbps apart, a SV caller cannot phase the events together based purely on breakpoint support. Without phasing information the actual event is ambiguous.The event could either be the duplication of 10kbp of sequencing of chr2 now located on chr1, or it could be a balanced translocation between chr1 and chr2, with 10kbp of sequence duplication at the balanced translocation site.

If all you've got is the breakpoints reported by the SV caller, chr1<->chr2(dup region)->chr1 and chr1<->chr2(dup region)<->chr2 + chr2<->chr2(dup region)<->chr1 are both possible. Since both these interpretations are consistent with the breakpoints detected, the 'correct' SV call is report two translocations.

ADD COMMENT
0
Entering edit mode

If you want to assume that all events that look like large insertions are indeed large insertions, then you should look for pairs of translocations in which a pair of translocations events occur at the same location on chrA with different orientations (technically, they with have a 1bp difference if using VCF breakend notation +- any microhomology at the insertion site), with their corresponding partner breakends occurring nearby (in your example, 10+kbps) with orientations indicating an insertion of that sequence.

If you're familiar with R, then my StructuralVariantAnnotation package can be used to convert a SV caller VCF to a GRanges object containing paired breakends. The logic to match pairs of translocations into putative insertion events can be done in a handful of lines of code.

ADD REPLY
0
Entering edit mode

Fair points. The VCF information-loss issue is easily avoidable though by using a different format like BEDPE. If VCF should represent 1 breakpoint per line, maybe BEDPE can be restricted to 1 event (which can have 1 or more breakpoints) per line. I'm actually interested in both: accurately quantifying breakpoints but also categorizing them. With the tools I've tested, I usually end up with accurate estimates of deletions, inversions, and tandem duplications, but inflated estimates of translocations and ~0 insertions (as I'm sure you know, most tools rely on insert-size distribution, so their upper bound for size is much lower than 10kbp). I know these proportions because I test with simulated data. It's not that I'm assuming it's a 10kbp insertion from chr2 into chr1; I essentially cut and pasted the base pairs myself! That's why I get frustrated when tools repeatedly tell me it's two translocations. But I understand now why I should keep an open mind :)

And thank you for sharing your tool. Very excited to see an approach based on de novo assembly. I was actually trying to use local de novo assembly to resolve/ascertain SVs myself but kept removing true positives. I will be running GRIDSS shortly so expect more messages from me!

ADD REPLY
1
Entering edit mode

If you're expecting an insertion-style format, outputting as BEDPE doesn't help as you'd either have to report a pair of translocations (that VCF can already do losslessly) or result to a custom file format.

If you're looking to detect novel sequence insertions then you'd need to use dedicated tool such as NovelSeq, or a fully de-novo assembly based method such as cortex-var, AsmVar, or LaSV.

I've done quite a bit of SV caller benchmarking and have simulation results that should give you an idea of how large a (novel) insertion event can be before various popular SV callers will stop calling it.

ADD REPLY
0
Entering edit mode

That is a very helpful benchmark–thank you. I have been looking for something similar. I ran GRIDSS on simulated data and I quite enjoyed it (despite my reservations about Java). Easy to use, fast, and pretty accurate. It initially produces many false positives but they are easily filtered by some quality + assembly conditions like the one you suggest in the docs.

ADD REPLY
0
Entering edit mode

BTW, there's a fairly recent tool called Svelter that is capable of detecting large insertions (it labels the above example as distant duplication). I haven't tested it extensively but thought you might find that interesting.

ADD REPLY
0
Entering edit mode

inflated estimates of translocations

I've found that the majority of false positive translocations are due to sequence homology between the called location. Looking at the extent of the sequence homology around the putative translocation can give you a good indication as to whether the variant is a false positive due to sequence homology. Note that this homology does not have to be an exact - having a few base pairs different on either side is actually required for the aligner to systematically 'incorrectly' place the reads with non-zero mapq thus providing the SV caller a false positive signal for putative translocation. GRIDSS reports the size of this inexact sequence homology using the custom IHOMLEN VCF INFO field, as well the exact homology size using the standard HOMLEN VCF INFO field.

ADD REPLY

Login before adding your answer.

Traffic: 2521 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6