Find all fragments of n length in a larger sequence
1
0
Entering edit mode
6.1 years ago
myrmex • 0

I am working on designing dsRNA and I want to check each smaller section of the sequence for possible off-target effects.

I have a 500 bp sequence. I want to write a script that extracts all possible 20 bp fragments within the longer sequence. I am interested in automating this rather than doing it manually because I may repeat the process several times.

Then I want to BLAST each one of those 20 bp sequences against the honey bee genome (with the GOI masked) to make sure each fragment doesn't have perfect alignment anywhere other than the GOI.

Any help is very appreciated!

alignment • 1.3k views
ADD COMMENT
0
Entering edit mode

Dear myrmex, you may be interested in having a look at our SEDA software (https://www.sing-group.org/seda/download.html). It contains several functions for filter, transformation, and manipulation of FASTA files, including operations to perform batch BLAST queries (https://www.sing-group.org/seda/manual/index.html). Regards.

ADD REPLY
1
Entering edit mode
6.1 years ago
thomaskuilman ▴ 850

Here's a way of obtaining the 20-mers in R using the Biostrings package (Bioconductor):

> library(Biostrings)
> DNA_ALPHABET
 [1] "A" "C" "G" "T" "M" "R" "W" "S" "Y" "K" "V" "H" "D" "B" "N" "-" "+" "."
> seq <- paste(sample(DNA_ALPHABET[1:4], size = 500, replace = TRUE), collapse = "")
> seq <- DNAString(seq)
> seq
  500-letter "DNAString" instance
seq: ATTCAAGTAGTAGTTACGGGAATGCCCACAGGGGCCAAGCGCAGTAGAAGGTACCTCCACCGTGCATTGACGGATGGGAGCCTGTGATGCCCGCAATGGTGAGTAAACTCCTGAAG...CTGCAGGTTCCAAACCAGACGCGTTTCCGGTGCAGTAGACGATATACCGATTACGGTCCAAGCTAGCAAGGGGTAGTCGCGAGGTCACCAGCCATCCGAAGGACGCGCCCAGAAA
> views <- Views(seq, start = 1:481, end = 20:500)
> views
  Views on a 500-letter DNAString subject
subject: ATTCAAGTAGTAGTTACGGGAATGCCCACAGGGGCCAAGCGCAGTAGAAGGTACCTCCACCGTGCATTGACGGATGGGAGCCTGTGATGCCCGCAATGGTGAGTAAACTCCTGA...GCAGGTTCCAAACCAGACGCGTTTCCGGTGCAGTAGACGATATACCGATTACGGTCCAAGCTAGCAAGGGGTAGTCGCGAGGTCACCAGCCATCCGAAGGACGCGCCCAGAAA
views:
      start end width
  [1]     1  20    20 [ATTCAAGTAGTAGTTACGGG]
  [2]     2  21    20 [TTCAAGTAGTAGTTACGGGA]
  [3]     3  22    20 [TCAAGTAGTAGTTACGGGAA]
  [4]     4  23    20 [CAAGTAGTAGTTACGGGAAT]
  [5]     5  24    20 [AAGTAGTAGTTACGGGAATG]
  ...   ... ...   ... ...
[477]   477 496    20 [CCATCCGAAGGACGCGCCCA]
[478]   478 497    20 [CATCCGAAGGACGCGCCCAG]
[479]   479 498    20 [ATCCGAAGGACGCGCCCAGA]
[480]   480 499    20 [TCCGAAGGACGCGCCCAGAA]
[481]   481 500    20 [CCGAAGGACGCGCCCAGAAA]
> twenty.mers <- DNAStringSet(views)
> twenty.mers
  A DNAStringSet instance of length 481
      width seq
  [1]    20 ATTCAAGTAGTAGTTACGGG
  [2]    20 TTCAAGTAGTAGTTACGGGA
  [3]    20 TCAAGTAGTAGTTACGGGAA
  [4]    20 CAAGTAGTAGTTACGGGAAT
  [5]    20 AAGTAGTAGTTACGGGAATG
  ...   ... ...
[477]    20 CCATCCGAAGGACGCGCCCA
[478]    20 CATCCGAAGGACGCGCCCAG
[479]    20 ATCCGAAGGACGCGCCCAGA
[480]    20 TCCGAAGGACGCGCCCAGAA
[481]    20 CCGAAGGACGCGCCCAGAAA
> twenty.mers[1]
  A DNAStringSet instance of length 1
    width seq
[1]    20 ATTCAAGTAGTAGTTACGGG

For performing BLAST, you could try the following method using the matchPattern (Biostrings package) as per the following link (see first answer).

ADD COMMENT

Login before adding your answer.

Traffic: 2572 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6