How to limit fasta header to 40 characters?
2
0
Entering edit mode
14 months ago

I have FASTA headers with long annotation names, but the program it will be run through for proteomics has a limit of roughly 40 characters or else it will crash.

The file starts off like this:

>TRINITY_DN0_c0_g1_i1.p1 - RecName: Full=E3 ubiquitin-protein ligase CIP8; AltName: Full=COP1-interacting protein 8; AltName: Full=RING-type E3 ubiquitin transferase CIP8
SEQUENCE

>TRINITY_DN10003_c0_g1_i12.p1 - RecName: Full=Polycomb group protein FIE1; AltName: Full=Protein  FERTILIZATION-INDEPENDENT ENDOSPERM 2; Short=OsFIE2; AltName: Full=WD40 repeat-containing protein 153; Short=OsWD40-153
SEQUENCE

Ideally I want the FASTA to look like this:

>DN0_c0_g1_i1.p1 - E3 ubiquitin-protein 
SEQUENCE

>DN10003_c0_g1_i12.p1 - Polycomb group
SEQUENCE

I used sed and seqkit to cut out the repetitive parts

sed 's/>.*Y_/>/' proteome.fasta
seqkit replace -p " RecName: Full=" -r ' ' proteome.fasta > proteome2.fasta 

The fasta looks like this now:

>DN0_c0_g1_i1.p1 - E3 ubiquitin-protein ligase CIP8; AltName: Full=COP1-interacting protein 8; AltName: Full=RING-type E3 ubiquitin transferase CIP8
SEQUENCE

>DN10003_c0_g1_i12.p1 - Polycomb group protein FIE1; AltName: Full=Protein  FERTILIZATION-INDEPENDENT ENDOSPERM 2; Short=OsFIE2; AltName: Full=WD40 repeat-containing protein 153; Short=OsWD40-153
SEQUENCE

What can I do to limit the header length? Can I do it with seqkit?

unix seqkit fasta • 2.4k views
ADD COMMENT
3
Entering edit mode

How about this (starting with your last example file)

$ awk '(">" ~ $0); {print substr($0,0,40)}' fasta_file
>DN0_c0_g1_i1.p1 - E3 ubiquitin-protein 
SEQUENCE
SEQUENCE2


>DN10003_c0_g1_i12.p1 - Polycomb group p
SEQUENCE
ADD REPLY
0
Entering edit mode

OP also wants TRINITY_ removed so a sed might be required before the awk. However, given that OP has already figured out the sed, maybe use proteome.fasta instead of fasta_file in your code so OP knows not to replace their entire code with your awk.

ADD REPLY
0
Entering edit mode

I used the last example that OP showed above. fasta_file is just a place holder for file name.

ADD REPLY
0
Entering edit mode

Only keep the sequence identifiers:

seqkit seq -i proteome.fasta -o proteome2.fasta

Where,

-i, --only-id                   print IDs instead of full headers

Or

$ seqkit replace -p ' - RecName: Full=(.+?);.+' -r ' $1' proteome.fasta  
>TRINITY_DN0_c0_g1_i1.p1 E3 ubiquitin-protein ligase CIP8
SEQUENCE
>TRINITY_DN10003_c0_g1_i12.p1 Polycomb group protein FIE1
SEQUENCE

But the number of the characters may exceed the limit.

ADD REPLY
0
Entering edit mode

This is a neat use of seqkit, I will definitely keep this in mind for future projects.

ADD REPLY
0
Entering edit mode

Perhaps you should also check that after trimming the names you don't get duplicate IDs. These two commands should give the same output (not checked):

grep '>' before_trimming.fasta | sort | uniq | wc -l
grep '>' after_trimming.fasta | sort | uniq | wc -l
ADD REPLY
2
Entering edit mode
14 months ago
bk11 ★ 3.0k

You can do sth like this-

cat proteome.fasta
>TRINITY_DN0_c0_g1_i1.p1 - RecName: Full=E3 ubiquitin-protein ligase CIP8; AltName: Full=COP1-interacting protein 8; AltName: Full=RING-type E3 ubiquitin transferase CIP8
SEQUENCE

>TRINITY_DN10003_c0_g1_i12.p1 - RecName: Full=Polycomb group protein FIE1; AltName: Full=Protein  FERTILIZATION-INDEPENDENT ENDOSPERM 2; Short=OsFIE2; AltName: Full=WD40 repeat-containing protein 153; Short=OsWD40-153
SEQUENCE

cut -d ' ' -f1-5 proteome.fasta |sed 's/RecName: Full=//g' |sed 's/TRINITY_//g'
>DN0_c0_g1_i1.p1 - E3 ubiquitin-protein
SEQUENCE

>DN10003_c0_g1_i12.p1 - Polycomb group
SEQUENCE
ADD COMMENT
0
Entering edit mode

This won't work if the non-blank-space separated words are too long. Given that the 40 char limit is the primary concern, the cut should be something like -c 1-40 and it should come after the sed operations.

ADD REPLY
3
Entering edit mode

Please check this too if it works.

sed 's/TRINITY_//g' proteome.fasta |sed 's/RecName: Full=//g' |awk '/^>/{print substr($0, 1, 39)} !/^>/{print}'
ADD REPLY
0
Entering edit mode

This is the most complete and concise answer. The two seds instead of using seqkit is great.

The awk for the character limit also works like a charm. If you make this the main post, I will accept it as the answer. Thanks for the nice solution!

ADD REPLY
0
Entering edit mode

Ok. I am making it as the main post. Please go ahead and accept it as answer. Thanks!

ADD REPLY
0
Entering edit mode
14 months ago
bk11 ★ 3.0k

Please check this, it will work-

sed 's/TRINITY_//g' proteome.fasta |sed 's/RecName: Full=//g' |awk '/^>/{print substr($0, 1, 39)} !/^>/{print}'
ADD COMMENT

Login before adding your answer.

Traffic: 2001 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6