Question

how to do vmatchPattern() for DNAStringSet [edited]

0

Entering edit mode

11.3 years ago

catherine ▴ 250

I think I'm using the right function,and there is no error message. But I didn't get the result.Here is what I did.

library(BSgenome)
bd<-DNAString("CAGGTAG")
dna <- readDNAStringSet("sequence.txt")
zld_bindingsite<-vmatchPattern(bd,dna, max.mismatch=0)

And my result is:

MIndex object of length 5050
$`dm3_cytoBand_101F1 range=chr4:69669-90134 5'pad=0 3'pad=0 strand=+repeatMasking=none`
IRanges of length 0

$`dm3_cytoBand_102A1 range=chr4:90135-128922 5'pad=0 3'pad=0 strand=+ repeatMasking=none`
IRanges of length 0

$`dm3_cytoBand_102A2 range=chr4:128923-130909 5'pad=0 3'pad=0 strand=+ repeatMasking=none`
IRanges of length 0

...
<5047 more elements>

But the DNAStringSet looks fine.

head(dna)
A DNAStringSet instance of length 6
width       seq                                                                                                            names               
[1] 20466 AATTGTCCTTATTAATTGACGATGCT...ATTATGCGCCTGATTTAAAGTACAAA dm3_cytoBand_101F...
[2] 38788 ATGTGTAAATATATCACCTTACCGTC...CATGCTAATCACTCGCTTTTAAATAG dm3_cytoBand_102A...
[3]  1987 CAACTTGGGCCACCTTCATCACCAAT...GACAATCCTGTATATCCTATAACACT dm3_cytoBand_102A...
[4] 24243 AGAGCCTTATGTAATCCACCTCATTT...TTCACTTGTAATTTTAAATTGTTCAG dm3_cytoBand_102A...
[5] 38787 AAATGTTCGCAAGTCAGAGCAAGGCT...AGCGGGAGCTTCTAAGACTTTGTTTT dm3_cytoBand_102A...
[6]  1987 TCGTGTTTTGCTCGTTCACAAAATCC...ATTATTAAAATACTTTCAAAAATAAA dm3_cytoBand_102

But I expect to see the results like this:

test<-BString("AATTGTCCTTATTAATTCAGGTAGGACGATGCTATATTATTAGTCTTTCCCAGGTAGGACATTACCCGCTAGTGAAATGTGATATTCTGAATCTTTCCGCTGTAAGCGCTTTAAGTTTGTTGCAATTA")
> bd<-DNAString("CAGGTAG")
> zld_bindingsite<-matchPattern(bd,test, max.mismatch=0)
> zld_bindingsite
  Views on a 128-letter BString subject
subject: AATTGTCCTTATTAATTCAGGTAGGACGATGCTATAT...TCTTTCCGCTGTAAGCGCTTTAAGTTTGTTGCAATTA
views:
    start end width
[1]    18  24     7 [CAGGTAG]
[2]    51  57     7 [CAGGTAG]

I actually can extract the seq by using toString(), but in this way I can only have one single string. Is there any function that can extract all the sequences from this StringSet?

R • 4.7k views

ADD COMMENT • link updated 3.9 years ago by Ram 45k • written 11.3 years ago by catherine ▴ 250

Ram · Accepted Answer · 2014-05-09

Haven't worked much with these functions but I'll give it a try.

Your object zld_bindingsite is an MIndex object that contains all the hits in your StringSet (the first 3 sequences don't seem to have a match).

As you are looking for an exact pattern you don't really need the sequences from the string set, only the positions.

unlist(zld_bindingsite) should give you an IRanges object with all positions together with the sequence names the hits are in.

as.matrix(unlist(zld_bindingsite)) gives you the output in matrix format with sequence IDs as rownames, motif start point in the first column and motif width in the second.

You might then also want to run the reverse complement of your motif in the same way and again extract all hits to get the full list.