I have 5000 FASTA sequences with Uniprot ids. Now, I want to add a unique identifier at the beginning of each FASTA header. An example will explain you a bit better.
>sp|A12345|ref|COG_REF human ribosomal protein
-------------------------------------------------------------
>tr|B57384|ref|DRF_ERG ribosomal protein
-------------------------------------------------------------
And so on
I want to add ABC0001 to ABC5000 at the beginning of the fasta header. And the corresponding gene name from my txt file.
gopA ABC0001 A12345
gopD ABC0002 B57384
........................
fotR ABC5000 C12345
Output:
>ABC0001|gopA|sp|A12345|ref|COG_REF human ribosomal protein
-------------------------------------------------------------
>ABC0002|gopD|tr|B57384|ref|DRF_ERG ribosomal protein
-------------------------------------------------------------
And so on
As I understand, I have to match the uniprot IDs from the txt file to the FASTA sequence file and then grab the ABC ids ( e.g. ABC0001) and Gene name (gopA) and add them at the beginning of the FASTA header.
What have you tried?
I have modified the txt file in a better way
output>
Now trying to match –f1 (P87546) and replacing the Uniprot id (P87546) with “P87546|ABC0447|yohF” in FASTA file below. Later I can change the order of the ids in the FASTA header to >ABC0447|yohF|tr|P87546.
I don’t know any one liner solution that can do it. Should I try in perl by getting each id and then loop through the FASTA sequences followed by matching ID to header. I just added non-unique identifier to each sequence but that’s not what I want. I’m really close, need a bit more time I guess.
Should I try in perl by getting each id and then loop through the FASTA sequences followed by matching ID to header. Yes--you will need to load your gene names first, then apply a substitution to the header. If you don't mind, I'll post a solution candidate...
Join could probably do this. General syntax:
Would join based on file1 column 1 and file2 column2, and output file1 column 3 file2 column1. You might first have to strip the fasta headers from the fasta file. After making new headers you can replace the old ones with the new ones, again with join.