Given a de novo assembly and a reference assembly, what methods have you tried / would you recommend for determining structural variations?
Given a de novo assembly and a reference assembly, what methods have you tried / would you recommend for determining structural variations?
For this paper An integrated map of structural variation in 2,504 human genomes my tiny part was to validate complex structural variation in long read TruSeq data.
It was about 3 years ago so I'm not sure if there are better approaches. But we created breakpoint contigs across putative SV breakpoints using Velvet.
Then I took the breakpoint contigs (there are many possible ones generated by Velvet) and I used BLAT to align them to the reference genome.
Using the BLAT results I was able to parse out the precise breakpoints.
Like I said it's rather labor intensive and I'm sure there's a better way of doing it. But this might be a good lead!
Probably too late to answer, but we developed SyRI which identifies structural differences between two assemblies. It identifies structural rearrangements (inversions, transpositions, translocations, segmental (distal) duplication, tandem duplication) between assemblies. It also identifies syntenic (conserved) regions, as well as local variations (SNPs, indels, CNVs) in both rearranged and conserved regions to provide a hierarchy of variations. You can read more SyRI here and download the method from Github.
Hi Manish,
I suggest you make a Tool
post about your tool, which should then include a description, use-cases and maybe some example code unless there is an extensive manual on Github that you can link.
This is probably better to make people aware of your tool than refreshing years-old threads.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Thank you for the answer. I really appreciate the detailed supplementary paper accompanying your paper. However, it doesn't seem to go into how BLAT was used. Could you please explain how you inferred breakpoints from the BLAT alignment?
It's been a long time and I have already used a combination of different methods for my purpose (similar to those of your collaborators), but I'm definitely interested in learning your method.
I attached this visual aid.
For the contig you generated across a breakpoint, you align it to the reference genome and seek alignments with high percent identity. You expect the sequence to match nearly to 100%.
In this example the deletion on the right is evident since the break point contig aligns with two noncontinuous parts. The number of base pairs between the last aligned base pairs for each aligned segment is the size of the deletion.
Using command line BLAT (download from UCSC genome browser under Tools) will give you output that makes parsing alignments easy.
Brilliant. Thank you for the explanation.
Quick question: could you use BLAST instead of BLAT? I'm wondering if there's a specific reason you choose BLAT.
I think BLAT works better for short sequences? Also you can download a command line version of BLAT. My PI uses it for primers, but I don't see why no BLAST.
I think BLAT is faster too, no?
Also BLAT output from the command line has the number of aligned segments. Anything equal to 1 is not a SV. Greater than 2 indicates a complex SV (DUP-INV-DUP, DEL-DUP, etc.) I found it really informative after writing a script able to parse the BLAT output (assuming you have a lot of breakpoints to test)
For my work (yeast) BLAST is actually much faster for some reason. But I didn't know that about alignment segments. When I tried BLAT out, I just formatted the output like BLAST (-out=blast8). Interesting!