Find all fragments of n length in a larger sequence
1
I am working on designing dsRNA and I want to check each smaller section of the sequence for possible off-target effects.
I have a 500 bp sequence. I want to write a script that extracts all possible 20 bp fragments within the longer sequence. I am interested in automating this rather than doing it manually because I may repeat the process several times.
Then I want to BLAST each one of those 20 bp sequences against the honey bee genome (with the GOI masked) to make sure each fragment doesn't have perfect alignment anywhere other than the GOI.
Any help is very appreciated!
alignment
• 1.3k views
Here's a way of obtaining the 20-mers in R using the Biostrings package (Bioconductor):
> library(Biostrings)
> DNA_ALPHABET
[1] "A" "C" "G" "T" "M" "R" "W" "S" "Y" "K" "V" "H" "D" "B" "N" "-" "+" "."
> seq <- paste(sample(DNA_ALPHABET[1:4], size = 500, replace = TRUE), collapse = "")
> seq <- DNAString(seq)
> seq
500-letter "DNAString" instance
seq: ATTCAAGTAGTAGTTACGGGAATGCCCACAGGGGCCAAGCGCAGTAGAAGGTACCTCCACCGTGCATTGACGGATGGGAGCCTGTGATGCCCGCAATGGTGAGTAAACTCCTGAAG...CTGCAGGTTCCAAACCAGACGCGTTTCCGGTGCAGTAGACGATATACCGATTACGGTCCAAGCTAGCAAGGGGTAGTCGCGAGGTCACCAGCCATCCGAAGGACGCGCCCAGAAA
> views <- Views(seq, start = 1:481, end = 20:500)
> views
Views on a 500-letter DNAString subject
subject: ATTCAAGTAGTAGTTACGGGAATGCCCACAGGGGCCAAGCGCAGTAGAAGGTACCTCCACCGTGCATTGACGGATGGGAGCCTGTGATGCCCGCAATGGTGAGTAAACTCCTGA...GCAGGTTCCAAACCAGACGCGTTTCCGGTGCAGTAGACGATATACCGATTACGGTCCAAGCTAGCAAGGGGTAGTCGCGAGGTCACCAGCCATCCGAAGGACGCGCCCAGAAA
views:
start end width
[1] 1 20 20 [ATTCAAGTAGTAGTTACGGG]
[2] 2 21 20 [TTCAAGTAGTAGTTACGGGA]
[3] 3 22 20 [TCAAGTAGTAGTTACGGGAA]
[4] 4 23 20 [CAAGTAGTAGTTACGGGAAT]
[5] 5 24 20 [AAGTAGTAGTTACGGGAATG]
... ... ... ... ...
[477] 477 496 20 [CCATCCGAAGGACGCGCCCA]
[478] 478 497 20 [CATCCGAAGGACGCGCCCAG]
[479] 479 498 20 [ATCCGAAGGACGCGCCCAGA]
[480] 480 499 20 [TCCGAAGGACGCGCCCAGAA]
[481] 481 500 20 [CCGAAGGACGCGCCCAGAAA]
> twenty.mers <- DNAStringSet(views)
> twenty.mers
A DNAStringSet instance of length 481
width seq
[1] 20 ATTCAAGTAGTAGTTACGGG
[2] 20 TTCAAGTAGTAGTTACGGGA
[3] 20 TCAAGTAGTAGTTACGGGAA
[4] 20 CAAGTAGTAGTTACGGGAAT
[5] 20 AAGTAGTAGTTACGGGAATG
... ... ...
[477] 20 CCATCCGAAGGACGCGCCCA
[478] 20 CATCCGAAGGACGCGCCCAG
[479] 20 ATCCGAAGGACGCGCCCAGA
[480] 20 TCCGAAGGACGCGCCCAGAA
[481] 20 CCGAAGGACGCGCCCAGAAA
> twenty.mers[1]
A DNAStringSet instance of length 1
width seq
[1] 20 ATTCAAGTAGTAGTTACGGG
For performing BLAST, you could try the following method using the matchPattern (Biostrings package) as per the following link (see first answer).
Login before adding your answer.
Traffic: 2597 users visited in the last hour
Dear myrmex, you may be interested in having a look at our SEDA software (https://www.sing-group.org/seda/download.html). It contains several functions for filter, transformation, and manipulation of FASTA files, including operations to perform batch BLAST queries (https://www.sing-group.org/seda/manual/index.html). Regards.