I have an assembly of several scaffolds and a reference sequence. I'd like to align the scaffolds to the reference and determine the locations of the breakpoints (where indels, duplications, etc. happen). Ultimately, I'd like to have some table with regions and breakpoints, like 1-10k bp => 3 breakpoints; 10k-20k bp => 2 breakpoints; and so on. Is there a good software that would make this analysis easy for me?
Currently I'm trying to extract this data from a BLAST run– It's definitely a non-trivial task.
Thank you for the reply, Erik. The "count breakpoints in each bin" part doesn't seem to work. This is the error I got:
Although R is not my weapon of choice, this package seems useful. Though, I'm worried that you (and the package) say synteny, which is not really what I'm looking for. I'm interested in normal sequence alignments, like what BLAST gives. Finding where the endpoints of the scaffolds map to the reference is not what I'm looking for either, since those aren't necessarily breakpoints.
What I have right now is a script that parses BLAST's output and reports the endpoints of all the alignments, except the endpoints of the entire scaffold. I'm considering these coordinates the "breakpoints".
I appreciate your taking the time to answer so here's an upvote!
Edited the answer to fix your error with
hist()
. It sounds like I did not interpret your original question correctly. What you are doing is BLASTing a scaffold against a reference genome and then taking the endpoints of the BLAST alignment as the "breakpoints"?Yes, and I'm doing that for all scaffolds.
The script works now; thanks. Is there a way to plot this histogram for each chromosome (each sequence in the reference)?