I have a sequence stored in a fasta file, let's say ATCGATCGGGTTTAC and I want to create a matrix of six columns and 10 rows which include a sequnce of six, each time proceeding with a window of 1, meaning I want a matrix like this:
ATCGAT
TCGATC
CGATCG
GATCGG
ATCGGG
TCGGGT
CGGGTT
GGGTTT
GGTTTA
GTTTAC
Any thoughts on how can I do this using R?
Thanks in advance.
Today I learned that R has a
substring
function.Thank you so much for your answer, I tried it and it worked. Can I use this if I have a really long sequence stored in a fasta file, let's say the entire genome of E.coli (more than 2500000 characters)? Also I need the results to be stored in a matrix with dimensions10X6 so I can further process them, if you have any suggestions :)
There are more efficient programs to generate k-mers from long sequences. Do you have to do this with
R
? Jellyfish (LINK) is one.Yep unfortunately I have to use R, it's part of an exercise...
? If it's an exercise, are you not supposed to work on it, by your self? At least you should have tagged it as exercise.
I didn't know that, I will tag it now, thank you