Protein fasta file header shorten
2
0
Entering edit mode
3.5 years ago

Dear all,

I want to short my fasta file header, which is like below, I listed two sequences. At the same time I want to keep all the sequences exactly the way they are.

>lcl|VSMA01000001.1_prot_KAB5584702.1_1 [locus_tag=GE09DRAFT_1165795] [db_xref=InterPro:IPR002198,JGIDB:Conioc1_1165795] [protein=tetrahydroxynaphthalene reductase] [protein_id=KAB5584702.1] [location=join(1826..1931,1988..2458,2736..2863,2927..3064)] [gbkey=CDS] 
MPGLTTNTGKYDQIPGPLGLASASLEGKVALVTGAGRGIGREMAQELGRRGAKVIVNYANSQESAEEVVQAIKKSGSDAA SIKANVSDVDQIVRMFDEAVKVFGKLDIVCSNSGVVSFGHVKDVTPEEFDRVFNINTRGQFFVAREAYKHLEVGGRLILM GSITGQAKGVPKHAVYSGSKGTIETFVRCMAIDFGDKKITVNAVAPGGIKTDMYHAVCREYIPNGINLTDDEVDEYACTW SPLHRVGLPIDIARVVCFLASQDGEWINGKVLGIDGAACM 

>lcl|VSMA01000001.1_prot_KAB5584703.1_2 [locus_tag=GE09DRAFT_1165796] [db_xref=InterPro:IPR021840,JGIDB:Conioc1_1165796] [protein=hypothetical protein] [protein_id=KAB5584703.1] [location=complement(join(3193..3215,3871..4374,4440..5628,5725..5886,5941..5989,6050..6066,6130..6234,6286..6495,6547..6561,6622..6728,6843..7103,7155..7719))] [gbkey=CDS] 
MFHPSRRRAEQTAYEYNIQATEDHEHDHGVVNLSAEKRRRPRGKRPNYKPTALKWPFIVAQILVLVIAMGLIIWAEKAMP DSDSTAIIDPLPSKGLPERSVKPEFGKHFRRDNTSGVVETATSQLDVQETTLTGGDGLITPGLGSTNGPADNVKTAVTDD

And I only want to keep the header like this:

>GE09DRAFT_1165795 
MPGLTTNTGKYDQIPGPLGLASASLEGKVALVTGAGRGIGREMAQELGRRGAKVIVNYANSQESAEEVVQAIKKSGSDAA SIKANVSDVDQIVRMFDEAVKVFGKLDIVCSNSGVVSFGHVKDVTPEEFDRVFNINTRGQFFVAREAYKHLEVGGRLILM GSITGQAKGVPKHAVYSGSKGTIETFVRCMAIDFGDKKITVNAVAPGGIKTDMYHAVCREYIPNGINLTDDEVDEYACTW SPLHRVGLPIDIARVVCFLASQDGEWINGKVLGIDGAACM
>GE09DRAFT_1165796
MFHPSRRRAEQTAYEYNIQATEDHEHDHGVVNLSAEKRRRPRGKRPNYKPTALKWPFIVAQILVLVIAMGLIIWAEKAMP DSDSTAIIDPLPSKGLPERSVKPEFGKHFRRDNTSGVVETATSQLDVQETTLTGGDGLITPGLGSTNGPADNVKTAVTDD

I would be super greatful for any help.

Thanks, Yanfang

header shorten fasta • 1.3k views
ADD COMMENT
0
Entering edit mode

See if this works:

$ awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' < id | awk ' {if ($0 ~ /^>/) {split($0,a,"="); split(a[2],b,"]"); print ">"b[1]} {print $9} }' | tr "\t" "\n"  | fold -w 80
>GE09DRAFT_1165795
MPGLTTNTGKYDQIPGPLGLASASLEGKVALVTGAGRGIGREMAQELGRRGAKVIVNYANSQESAEEVVQAIKKSGSDAA
SIKANVSDVDQIVRMFDEAVKVFGKLDIVCSNSGVVSFGHVKDVTPEEFDRVFNINTRGQFFVAREAYKHLEVGGRLILM
GSITGQAKGVPKHAVYSGSKGTIETFVRCMAIDFGDKKITVNAVAPGGIKTDMYHAVCREYIPNGINLTDDEVDEYACTW
SPLHRVGLPIDIARVVCFLASQDGEWINGKVLGIDGAACM
>GE09DRAFT_1165796
MFHPSRRRAEQTAYEYNIQATEDHEHDHGVVNLSAEKRRRPRGKRPNYKPTALKWPFIVAQILVLVIAMGLIIWAEKAMP
DSDSTAIIDPLPSKGLPERSVKPEFGKHFRRDNTSGVVETATSQLDVQETTLTGGDGLITPGLGSTNGPADNVKTAVTDD
ADD REPLY
0
Entering edit mode

Hey GenoMax,

Thanks you so much. I managed to do this, and I adapted your code with a code I read somewhere else.

awk 'BEGIN{FS=" "}{if(NF>1) {split($2,a,"="); split(a[2],b,"]"); printf(">%s\n",b[1])}else{print $0}}' in.fasta > out.fasta

I posted it here, hope it can be useful for others.

Thanks so much for your help. Yanfang

ADD REPLY
0
Entering edit mode
3.5 years ago

input (works with flattened fasta):

$ cat test.fa 
>lcl|VSMA01000001.1_prot_KAB5584702.1_1 [locus_tag=GE09DRAFT_1165795] [db_xref=InterPro:IPR002198,JGIDB:Conioc1_1165795] [protein=tetrahydroxynaphthalene reductase] [protein_id=KAB5584702.1] [location=join(1826..1931,1988..2458,2736..2863,2927..3064)] [gbkey=CDS] i
atgc
>lcl|VSMA01000001.1_prot_KAB5584703.1_2 [locus_tag=GE09DRAFT_1165796] [db_xref=InterPro:IPR021840,JGIDB:Conioc1_1165796] [protein=hypothetical protein] [protein_id=KAB5584703.1] [location=complement(join(3193..3215,3871..4374,4440..5628,5725..5886,5941..5989,6050..6066,6130..6234,6286..6495,6547..6561,6622..6728,6843..7103,7155..7719))] [gbkey=CDS]
cagt

output:

$ awk -F 'locus_tag=|]' 'NR %2 == 1 {print ">"$2 }; NR % 2 == 0 {print}' test.fa

>GE09DRAFT_1165795
atgc
>GE09DRAFT_1165796
cagt

$ sed -r '/^>/ s/.*locus_tag=(.*_[0-9]+)\]\s\[db_xref.*/>\1/' test.fa
>GE09DRAFT_1165795
atgc
>GE09DRAFT_1165796
cagt
  • sed function here expects locus tag and db_xref in each header
ADD COMMENT
0
Entering edit mode
3.5 years ago
Dunois ★ 2.8k

Here's something way more succinct in sed:

sed -E 's/^.*locus_tag\=([A-Za-z0-9_]+).*$/>\1/g' test.fa 
>GE09DRAFT_1165795
MPGLTTNTGKYDQIPGPLGLASASLEGKVALVTGAGRGIGREMAQELGRRGAKVIVNYANSQESAEEVVQAIKKSGSDAA SIKANVSDVDQIVRMFDEAVKVFGKLDIVCSNSGVVSFGHVKDVTPEEFDRVFNINTRGQFFVAREAYKHLEVGGRLILM GSITGQAKGVPKHAVYSGSKGTIETFVRCMAIDFGDKKITVNAVAPGGIKTDMYHAVCREYIPNGINLTDDEVDEYACTW SPLHRVGLPIDIARVVCFLASQDGEWINGKVLGIDGAACM 

>GE09DRAFT_1165796
MFHPSRRRAEQTAYEYNIQATEDHEHDHGVVNLSAEKRRRPRGKRPNYKPTALKWPFIVAQILVLVIAMGLIIWAEKAMP DSDSTAIIDPLPSKGLPERSVKPEFGKHFRRDNTSGVVETATSQLDVQETTLTGGDGLITPGLGSTNGPADNVKTAVTDD
ADD COMMENT
0
Entering edit mode
$ sed -r '/^>/ s/.*locus_tag=([[:alnum:]]+).*$/>\1/' test.fa
ADD REPLY

Login before adding your answer.

Traffic: 2194 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6