Software for finding breakpoints along the genome
1
0
Entering edit mode
8.9 years ago
novice ★ 1.1k

I have an assembly of several scaffolds and a reference sequence. I'd like to align the scaffolds to the reference and determine the locations of the breakpoints (where indels, duplications, etc. happen). Ultimately, I'd like to have some table with regions and breakpoints, like 1-10k bp => 3 breakpoints; 10k-20k bp => 2 breakpoints; and so on. Is there a good software that would make this analysis easy for me?

Currently I'm trying to extract this data from a BLAST run– It's definitely a non-trivial task.

breakpoints SV alignment • 2.7k views
ADD COMMENT
2
Entering edit mode
8.8 years ago
Erik Wright ▴ 420

You can map syntenic regions using the R package DECIPHER. I will assume that your goal is to find the start and endpoints of each scaffold on a (closely related) reference genome. The process would look something like this:

library(DECIPHER)

# connect to an in-memory database
db <- dbConnect(SQLite(), ":memory:")

# load the sequences from each FASTA file
Seqs2DB("<<PATH TO SCAFFOLDS>>", "FASTA", db, "Scaffolds")
Seqs2DB("<<PATH TO REFERENCE GENOME>>", "FASTA", db, "Ref")

# find the syntenic regions w/o using 6-frame translations
synteny <- FindSynteny(db, useFrames = FALSE)

# look at the start and end on the reference:
synteny[2, 1][[1]][, c("start2", "end2")]

# count breakpoints in each bin
h <- hist(synteny[2, 1][[1]][, c("start2", "end2")],
          breaks = seq(from = 0,
                       to = max(synteny[1, 1][[1]],
                                synteny[2, 2][[1]]) + 1e4,
                       by = 1e4))
h$counts # number of endpoints per 10k nucleotides

dbDisconnect(db)

This will display the number of start and endpoints on the reference genome per 10k nucleotides. Hope that helps!

ADD COMMENT
0
Entering edit mode

Thank you for the reply, Erik. The "count breakpoints in each bin" part doesn't seem to work. This is the error I got:

Error in hist.default(synteny[2, 1][[1]][, c("start2", "end2")], breaks = seq(from = 0,  : 
  some 'x' not counted; maybe 'breaks' do not span range of 'x'

Although R is not my weapon of choice, this package seems useful. Though, I'm worried that you (and the package) say synteny, which is not really what I'm looking for. I'm interested in normal sequence alignments, like what BLAST gives. Finding where the endpoints of the scaffolds map to the reference is not what I'm looking for either, since those aren't necessarily breakpoints.

What I have right now is a script that parses BLAST's output and reports the endpoints of all the alignments, except the endpoints of the entire scaffold. I'm considering these coordinates the "breakpoints".

I appreciate your taking the time to answer so here's an upvote!

ADD REPLY
0
Entering edit mode

Edited the answer to fix your error with hist(). It sounds like I did not interpret your original question correctly. What you are doing is BLASTing a scaffold against a reference genome and then taking the endpoints of the BLAST alignment as the "breakpoints"?

ADD REPLY
0
Entering edit mode

Yes, and I'm doing that for all scaffolds.

The script works now; thanks. Is there a way to plot this histogram for each chromosome (each sequence in the reference)?

ADD REPLY

Login before adding your answer.

Traffic: 2215 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6