I have FASTA headers with long annotation names, but the program it will be run through for proteomics has a limit of roughly 40 characters or else it will crash.
The file starts off like this:
>TRINITY_DN0_c0_g1_i1.p1 - RecName: Full=E3 ubiquitin-protein ligase CIP8; AltName: Full=COP1-interacting protein 8; AltName: Full=RING-type E3 ubiquitin transferase CIP8
SEQUENCE
>TRINITY_DN10003_c0_g1_i12.p1 - RecName: Full=Polycomb group protein FIE1; AltName: Full=Protein FERTILIZATION-INDEPENDENT ENDOSPERM 2; Short=OsFIE2; AltName: Full=WD40 repeat-containing protein 153; Short=OsWD40-153
SEQUENCE
Ideally I want the FASTA to look like this:
>DN0_c0_g1_i1.p1 - E3 ubiquitin-protein
SEQUENCE
>DN10003_c0_g1_i12.p1 - Polycomb group
SEQUENCE
I used sed and seqkit to cut out the repetitive parts
sed 's/>.*Y_/>/' proteome.fasta
seqkit replace -p " RecName: Full=" -r ' ' proteome.fasta > proteome2.fasta
The fasta looks like this now:
>DN0_c0_g1_i1.p1 - E3 ubiquitin-protein ligase CIP8; AltName: Full=COP1-interacting protein 8; AltName: Full=RING-type E3 ubiquitin transferase CIP8
SEQUENCE
>DN10003_c0_g1_i12.p1 - Polycomb group protein FIE1; AltName: Full=Protein FERTILIZATION-INDEPENDENT ENDOSPERM 2; Short=OsFIE2; AltName: Full=WD40 repeat-containing protein 153; Short=OsWD40-153
SEQUENCE
What can I do to limit the header length? Can I do it with seqkit?
How about this (starting with your last example file)
OP also wants
TRINITY_
removed so ased
might be required before the awk. However, given that OP has already figured out the sed, maybe useproteome.fasta
instead offasta_file
in your code so OP knows not to replace their entire code with your awk.I used the last example that OP showed above.
fasta_file
is just a place holder for file name.Only keep the sequence identifiers:
Where,
Or
But the number of the characters may exceed the limit.
This is a neat use of seqkit, I will definitely keep this in mind for future projects.
Perhaps you should also check that after trimming the names you don't get duplicate IDs. These two commands should give the same output (not checked):