R Loop In Biostrings
3
0
Entering edit mode
13.2 years ago
John Clark ▴ 30

Hi Guys

I could not find way to apply the following loop to the genome scaffolds:

peachgenome<- readFASTA(url("http://www.rosaceae.org/sites/www.rosaceae.org/files/Prunus_persica.main_genome.scaffolds.fasta"))
myobs <- paste("peachgenx$scaffold_", 1:8, sep = "")
TATA = "TATAAAAA"
require(Biostrings)
sapply(myobs, function (x) { countPattern(TATA, x, max.mismatch = 1) })

Gives me just zero:

peachgenx$scaffold_1 peachgenx$scaffold_2 peachgenx$scaffold_3 
                   0                    0                    0 
peachgenx$scaffold_4 peachgenx$scaffold_5 peachgenx$scaffold_6 
                   0                    0                    0 
peachgenx$scaffold_7 peachgenx$scaffold_8 
                   0                    0

However individual code works:

countPattern (TATA, peachgenx$scaffold_1, max.mismatch = 1)
r • 4.6k views
ADD COMMENT
11
Entering edit mode
13.2 years ago
Martin Morgan ★ 1.6k

Read the sequences into a DNAStringSet

library(Biostrings)
url <- "http://www.rosaceae.org/sites/www.rosaceae.org/files/Prunus_persica.main_genome.scaffolds.fasta"
peachgenome <- read.DNAStringSet(url)

then do a 'vectorized' count.

vcountPattern("TATAAAA", peachgenome)

This takes about 6 seconds for your 8 strings, or for all of them(!)

> system.time(x <- vcountPattern("TATAAAA", peachgenome[1:8], max.mismatch=1))
   user  system elapsed 
  6.139   0.004   6.148 
> x
[1] 250652 136713 115696 153960  95002 153177 117959 113009
> system.time(x <- vcountPattern("TATAAAA", peachgenome, max.mismatch=1))
   user  system elapsed 
  6.376   0.008   6.391
ADD COMMENT
5
Entering edit mode
13.2 years ago

You do not use sapply like you did above. The construct is much simpler:

sapply(peachgen,function(x) {countPattern(TATA,x,max.mismatch=1)})
ADD COMMENT
3
Entering edit mode
13.2 years ago
Neilfws 49k

This line:

myobs <- paste("peachgenx$scaffold_", 1:8, sep = "")

creates a vector of strings:

"peachgenx$scaffold_1" "peachgenx$scaffold_2" ... "peachgenx$scaffold_8"

You are then using sapply/countPattern to match TATA to those strings, not to the list of sequences. Unsurprisingly, the count = 0, since the string "peachgenx$scaffold_1" does not contain "TATAAAAA".

You need to run lapply or sapply on the list of sequences, stored in peachgenome:

lapply(peachgenome, function (x) { countPattern(TT, x[2]$seq, max.mismatch = 1) })
# OR
sapply(peachgenome, function (x) { countPattern(TT, x[2]$seq, max.mismatch = 1) })
ADD COMMENT

Login before adding your answer.

Traffic: 2734 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6