Finding and plotting gaps in denovo assembly in R
1
1
Entering edit mode
6.6 years ago
alslonik ▴ 320

Hi,

I am working on an assembly of a genome and currently am trying to annotate and visualize the regions of consecutive Ns. I would like to see the regions of my newly assembled genome, that are gaps (NNNNn) .

The way i tried to do it is with Letterfrequencyinslidingwindow command of Biostrings, but it takes forever (more than half an hour) for 1 scaffold, and have many of them. Later on, I make a dataframe out of the output matrix and try to plot it in order to see the regions where Ns are consecutive. Same goes for plotting it. It really takes a lot of time.

The command I use is:

Freq_N_758 <- sapply(chromium.assembly["758"], letterFrequencyInSlidingView, 1, "N")

I am sure that I miss a very important point here and do it the wrong way. What is the correct way to do it in R?

Many thanks, Alex

R genome assembly gaps • 1.7k views
ADD COMMENT
2
Entering edit mode
6.6 years ago

Try something like this using the Biostrings Bioconductor package.

library(Biostrings)
x = DNAString("ACTGNNTTGGNNNNAACTGC")
y = maskMotif(x,'N')
z = as(gaps(y),"Views")
ranges(z)
as.data.frame(ranges(z))

The final output from above will be:

  start end width
1     5   6     2
2    11  14     4

This will run nearly instantaneously for pretty much arbitrary sizes of sequence.

ADD COMMENT
0
Entering edit mode

That's exactly what i needed. THANKS!

ADD REPLY
0
Entering edit mode

Feel free to "accept" the answer so that others know that the question is answered. : )

ADD REPLY

Login before adding your answer.

Traffic: 1558 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6