Hi guys,
I've developed a tool that identifies the points at which insertions are made. This basically means, for various BAM files, I have various lists of coordinates at which I think insertions were made with various degrees of confidence. I'm trying to detect indels with sizes from around 10bp to a few kb.
What I'd like to do is actually figure out _what_ was inserted (the actual nucleotide sequence and how long it is). My question has a few parts.
How complicated is this? How difficult is it to determine the content of the insertion? I imagine for small insertions at least, one could use local de novo assembly, then compare the assembled sequence to the reference. But for insertions around read-length or longer, the insertion can't be reconstructed just from the reads in the area. You need to also find the orphan reads that are all/mostly the insertion sequence, and use those too. Right? Is this something I can implement myself without spending months on it, and do a good job? If not...
What are my options in terms of existing tools? I know there are lots of indel callers. But some of them, for long inerstions at least, don't even tell you what the insertion sequence is. Also, I've already done the work of determining where insertions were made; I don't need these areas to be re-identified, just reconstructed.
Thanks!
Thanks for the quick response! GRIDSS looks pretty cool—I think I'll try it! Just to be clear, what information (out of position, length, nucleotide content) does it report on insertions, and for what sizes?
For small insertions (less than around half the library fragment size), it reports the exact indel position and sequence. For longer variants, it reports the position and the nucleotide sequence of as much of the insertion as it could assemble from the anchoring reads (soft-clipped reads + unmapped reads with read mate mapped nearby). A large insertion on chr1 between bases 12345 and 12346 would look like:
Notice the
.
which is part of the ALT allele. This notation is covered in VCFv4.3 Section 5.4.9 "Single breakends" and indicates that the AT at position 12345-12346 is now ACGTGCGTCGG?????????????????TGTCATGCATGCAT.Breakend support is currently only on the dev branch. I'll be making an official versioned release with the capability within a week.
Sounds good. Thanks! Have you heard of Pamir (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5870608/), btw?
I hadn't. Looks like it's definitely worth trying out before attempting to put something together yourself.