let's say I have a fasta of a protein sequence
> albumin
MKWVTFISLL FLFSSAYSRG ... ... ...
I want to split the sequence into all possible consecutive 8 amino acids (only in 1 direction, amino -> carboxyl) (And no looping(I don't know if it is the right expression), say, GMKWVTFIS)
I need
> fasta.albumin1
MKWVTFIS
>fasta.albumin2
KWVTFISL
> fasta.albumin3
WVTFISLLF
...
> fasta.albumin13
FSSAYSRG
And, I want to do this for all known human protein sequences.
How would I do it??? I need the result as a fasta or fasta files. And IDs for resulting 8-mer seuqeunces need to be unique.
Have you tried to do this yourself? Please post your code you have already tried.
I am a biologist. I can google to search for news, but can't, and don't know any scripting. I was even struggling with my posting questions. Sorry
That is fine. This forum is there to help with that. I have formatted your post yet again. Please do not change the formatting back to plain text.
You say
5
AA in title but8
in the body of post. Which is it?There are ~20K+ validated human proteins at UniProt so results file is going to be gigantic.
Note: Why do you keep changing back the formatting of the data in your post to plain text?
Yes, it's going to be a 'big data' I really want try.
Hey @genomax
There is some issue with the code formatter below. A '(' is missing in my code
print('>'+strrecord.id)+'|kmer_'+str(count)+'\n'+str(my_kmer))
below but looks okay when I check in the edit mode.There is a known issue with python code formatting and biostars display of that code. I think the fix is to put a comment there to add the
(
at proper place. Not elegant but will work for now.Edit: Best solution is to put your code in a gist and then link that in your post. Biostars then looks the code up via
gist
and formats it correctly.I see! I will opt for the
gist
option next time. ThanksIf you know python/perl then this should be easy to do. Someone is bound to put a program up as answer shortly. @Pierre will likely have a one liner in
awk
.