make fasta sequences names short
1
0
Entering edit mode
9.5 years ago
radha.jg • 0

Hi,

I'm a newbie so please be patient.

I have a fasta file like this:

>gi|820716087|gb|AKG62099.1| eIF-2 alpha kinase [Leishmania donovani]
MAKKKNECHSCRLVQAYNTCENDEIKDEIDIIVNTYENVRVSGKSAAHYRVLVPLTSESHPSRRVTLEIR
VVPGYPYVVPAINLLFPPGLQPGCEGTLSEYEVKQMAKEVLNNIQPCLPSGMPCMMQIVSTVASIVECSI
DPPSQQQNGKAQGEPKVLSAGQSSSLTPVPLKAKEALKLSLFAFHLLKKCCHMKNPESNEEAASNFDWLV
KYLLDSVRIFPEAARSFFPWNGISSSRAFAANIESALALPPDQQGLPKWLWEDEGRNPRIQQGSEGRYRN
>gi|820957452|pdb|4WZH|B Chain B, Dihydroorotate Dehydrogenase From Leishmania Viannia Braziliensis
MGSSHHHHHHSSGLVPRGSHMASMTGGGQMGRGSMSLQVGILGNTFANPFMNAAGVMCSTEEELAAMTES
TSGSLITKSCTPALREGNPAPRYYTLPLGSINSMGLPNKGFDFYLAYSARHHDYSRKPLFISISGFSAEE
NAEMCKRLAPVAAEKGVILELNLSCPNVPGKPQVAYDFDAMRRYLAAISEAYPHPFGVKMPPYFDFAHFD
AAAEILNQFPKVQFITCINSIGNGLVIDVETESVVIKPKQGFGGLGGRYVFPTALANVNAFYRRCPGKLI
FGCGGVYTGEDAFLHVLAGASMVQVGTALHEEGAAIFERLTAELLDVMAKKGYKALDEFRGKVKAMD

How do I transform the names to have something like this:

>gb_AKG62099.1
MAKKKNECHSCRLVQAYNTCENDEIKDEIDIIVNTYENVRVSGKSAAHYRVLVPLTSESHPSRRVTLEIR
VVPGYPYVVPAINLLFPPGLQPGCEGTLSEYEVKQMAKEVLNNIQPCLPSGMPCMMQIVSTVASIVECSI
DPPSQQQNGKAQGEPKVLSAGQSSSLTPVPLKAKEALKLSLFAFHLLKKCCHMKNPESNEEAASNFDWLV
KYLLDSVRIFPEAARSFFPWNGISSSRAFAANIESALALPPDQQGLPKWLWEDEGRNPRIQQGSEGRYRN
>pdb_4WZH
MGSSHHHHHHSSGLVPRGSHMASMTGGGQMGRGSMSLQVGILGNTFANPFMNAAGVMCSTEEELAAMTES
TSGSLITKSCTPALREGNPAPRYYTLPLGSINSMGLPNKGFDFYLAYSARHHDYSRKPLFISISGFSAEE
NAEMCKRLAPVAAEKGVILELNLSCPNVPGKPQVAYDFDAMRRYLAAISEAYPHPFGVKMPPYFDFAHFD
AAAEILNQFPKVQFITCINSIGNGLVIDVETESVVIKPKQGFGGLGGRYVFPTALANVNAFYRRCPGKLI
FGCGGVYTGEDAFLHVLAGASMVQVGTALHEEGAAIFERLTAELLDVMAKKGYKALDEFRGKVKAMD

The idea is to have just the genebank id, or, if it's not in the name, one of the ids and where is it from

Saludos :)

sequence • 3.5k views
ADD COMMENT
1
Entering edit mode
9.5 years ago
awk 'BEGIN{FS="|"}{if(NF>1) {printf(">%s_%s\n", $3, $4)}else{print $0}}' foo.fa > fixed.fa
ADD COMMENT
0
Entering edit mode

Excellent. Thank you very much. I seriously need to learn how to program in awk.

ADD REPLY
0
Entering edit mode

In case of sp_, how do I use an unique identifier like gi? I assume that $1 and $2 will do

ADD REPLY
0
Entering edit mode

You could check the number of fields (NF) in a more elaborate way and use $1 and $2 as needed.

ADD REPLY
0
Entering edit mode

Hey Devon Ryan,

Could you please help me with some modification of your code for my problem?

I also want to shorten the fasta file sequence header, which looks like this:

>lcl|VSMA01000001.1_prot_KAB5584702.1_1 [locus_tag=GE09DRAFT_1165795] [db_xref=InterPro:IPR002198,JGIDB:Conioc1_1165795] [protein=tetrahydroxynaphthalene reductase] [protein_id=KAB5584702.1] [location=join(1826..1931,1988..2458,2736..2863,2927..3064)] [gbkey=CDS]
MPGLTTNTGKYDQIPGPLGLASASLEGKVALVTGAGRGIGREMAQELGRRGAKVIVNYANSQESAEEVVQAIKKSGSDAA
SIKANVSDVDQIVRMFDEAVKVFGKLDIVCSNSGVVSFGHVKDVTPEEFDRVFNINTRGQFFVAREAYKHLEVGGRLILM
GSITGQAKGVPKHAVYSGSKGTIETFVRCMAIDFGDKKITVNAVAPGGIKTDMYHAVCREYIPNGINLTDDEVDEYACTW
SPLHRVGLPIDIARVVCFLASQDGEWINGKVLGIDGAACM
>lcl|VSMA01000001.1_prot_KAB5584705.1_4 [locus_tag=GE09DRAFT_52] [db_xref=InterPro:IPR010730,JGIDB:Conioc1_52] [protein=heterokaryon incompatibility protein-domain-containing protein] [protein_id=KAB5584705.1] [location=10796..11233] [gbkey=CDS]
MPTRLLEIDPQANSRHIRLVSDTGILLKERYAALSHCWGKSPTNTTTKAVFVSHTQGIDILSLSKTFQHTIFVTRELGIR
YLWIDSLCIIQDDEDDWKREAENMADVFANAFVTIAASASTDGDGGLFYPRALETERSGTVRWTI

And I want the header to be this:

>GE09DRAFT_1165795
>GE09DRAFT_52

And I tried your code with

awk 'BEGIN{FS=" "}{if(NF>1) {printf(">%s\n", $2)}else{print $0}}' in.fasta > out.fasta

It gave me the results:

>[locus_tag=GE09DRAFT_1165795]
..

SO How can I cut the [locus_tag= and ]?

I would be really appreciated for any help.

Thanks!
Yanfang

ADD REPLY
0
Entering edit mode

I think I figured this out by adapting two codes together.

awk 'BEGIN{FS=" "}{if(NF>1) {split($2,a,"="); split(a[2],b,"]"); printf(">%s\n",b[1])}else{print $0}}' in.fasta > out.fasta

Thanks all the help!

Yanfang

ADD REPLY

Login before adding your answer.

Traffic: 2077 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6