Given a sequence my_seq
:
my_seq = "AGCTTCATGAAGCTGAGTNGGACGCGATGATGCG"
We can make a collection of 10-bp reads my_reads
:
my_reads = substring(my_seq,1:(nchar(my_seq)-9),10:nchar(my_seq))
which I want to map to a target sequence targetseq
:
targetseq = "AGCTTCATGAAGCTGAGTGGGACGCGATGATGCGACTAGGGACCTTAGCAGC"
treating "N" in my_reads
as a wildcard, so all reads
map to targetseq
.
I want to do this kind of string matching in a large database of strings, and I found the Bioconductor
package Biostrings
be very efficient to do that. But, in order to be efficient, Biostrings requires you to convert your collection of substrings (here my_reads
) into a dictionary of class "pdict"
using the function PDidc()
. You can use this 'dictionary' later in functions like countPDict()
to count matches against a target sequence.
In order to use wildcards, you have to divide your dictionary in 3 parts: a head (left), a trusted band (middle) and a tail (right). You can only have wildcards, like "N", in the head or tail, but not in the trusted band, and you cannot have a trusted band of width = 0. So, for example, PDict() will complain about the read 15
if you use a trusted band of minimum width = 1 like:
> PDict(my_reads[1:15],tb.start=5,tb.end=5)
Error in .Call2("ACtree2_build", tb, pp_exclude, base_codes, nodebuf_ptr, :
non base DNA letter found in Trusted Band for pattern 15
because the "N" is right in the trusted band.
> PDict(my_reads[1:14],tb.start=5,tb.end=5) #is OK
TB_PDict object of length 14 and width 10 (preprocessing algo="ACtree2"):
- with a head of width 4
- with a Trusted Band of width 1
- with a tail of width 5
Is there any way to circumvent this limitation imposed by Biostrings
? Or does anyone have a different idea? I'm almost giving up and using a following approach: generate reads and an index with the target sequence (saving both as fasta), use a mapper like GEM to map the reads to the index, then parse the output file to count the matches!
P.S.: I also have "N" in the target sequence sometimes, but I guess I can address that by setting the argument 'fixed' in countPDict and related functions to 'FALSE'.
I could solve that problem by creating different dictionaries of reads, taking all reads with N's in the trusted band and creating a separate dictionary with a trusted band with a different position. Then I could perform my match reads against target sequence with these different dictionary of reads and sum the matches across dictionaries. Far from elegant though.