Hello,
I have a list of UniProt Id's in the first column and signal peptide length(eg 25 below) in the second column in a file (say seqid_len.txt). eg of a single entry
tr_GHAT8X 25
tr_GHAMNO 26
I want to extract fasta seq for each UniProt id which is a subset of seqid in a large fasta file(say fasta_db.fasta), print those seq in a new file after removing the length(from 1 to 25, in eq) of the signal peptides. How could I do this?
>tr|GHAT8X|GHAT8X_9GHA Uncharacterized protein
MJHJKAHKJHSKBABXNXNELRVYAISQLNELIADFGGNSARDYLESTISNEGAHPSIRNSAVFSY
GKTFYFSDRNHAENFLKRFSSJKDKAJHKDHKJAJGDAJSIRNP
>tr|GHAmno|GHAMNO_9GHA Uncharacterized protein
MJHAHLFLAJFLJALFNNCLAN;CNALNCLANCLNALNCLANLKLJKHJHKBBHBHBHBHBHBHBHBH
CLNALCNLANCKLNALKNCKDHKJAJGDAJSIRNP
Output file sample(1to25aa removed)
>tr|GHAT8X|GHAT8X_9GHA Uncharacterized protein
ISQLNELIADFGGNSARDYLESTISNEGAHPSIRNSAVFSY
GKTFYFSDRNHAENFLKRFSSJKDKAJHKDHKJAJGDAJSIRNP
Please suggest me if there is an alternate way for the solution in python/perl.
Thanks
using seqkit and bash: input fasta file:
$ cat test.fa
input ids:
code:
output:
Note: Clean up your fasta. It has special characters (;) and spaces in between AA.
Assumptions:
Thank You @cpad0112 for ur help. I was hoping to use python for the purpose.
oh okay. Moving the post as comment.
This wouldn't be too hard in (bio)python, do you have any experience in programming (in python)?
I added markup to your post for increased readability. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:
Please check if the displayed format is still accurate.
Thank You. Everything seems in the correct format. I recently started to learn python basics commands.