Entering edit mode
9.5 years ago
radha.jg
•
0
Hi,
I'm a newbie so please be patient.
I have a fasta file like this:
>gi|820716087|gb|AKG62099.1| eIF-2 alpha kinase [Leishmania donovani]
MAKKKNECHSCRLVQAYNTCENDEIKDEIDIIVNTYENVRVSGKSAAHYRVLVPLTSESHPSRRVTLEIR
VVPGYPYVVPAINLLFPPGLQPGCEGTLSEYEVKQMAKEVLNNIQPCLPSGMPCMMQIVSTVASIVECSI
DPPSQQQNGKAQGEPKVLSAGQSSSLTPVPLKAKEALKLSLFAFHLLKKCCHMKNPESNEEAASNFDWLV
KYLLDSVRIFPEAARSFFPWNGISSSRAFAANIESALALPPDQQGLPKWLWEDEGRNPRIQQGSEGRYRN
>gi|820957452|pdb|4WZH|B Chain B, Dihydroorotate Dehydrogenase From Leishmania Viannia Braziliensis
MGSSHHHHHHSSGLVPRGSHMASMTGGGQMGRGSMSLQVGILGNTFANPFMNAAGVMCSTEEELAAMTES
TSGSLITKSCTPALREGNPAPRYYTLPLGSINSMGLPNKGFDFYLAYSARHHDYSRKPLFISISGFSAEE
NAEMCKRLAPVAAEKGVILELNLSCPNVPGKPQVAYDFDAMRRYLAAISEAYPHPFGVKMPPYFDFAHFD
AAAEILNQFPKVQFITCINSIGNGLVIDVETESVVIKPKQGFGGLGGRYVFPTALANVNAFYRRCPGKLI
FGCGGVYTGEDAFLHVLAGASMVQVGTALHEEGAAIFERLTAELLDVMAKKGYKALDEFRGKVKAMD
How do I transform the names to have something like this:
>gb_AKG62099.1
MAKKKNECHSCRLVQAYNTCENDEIKDEIDIIVNTYENVRVSGKSAAHYRVLVPLTSESHPSRRVTLEIR
VVPGYPYVVPAINLLFPPGLQPGCEGTLSEYEVKQMAKEVLNNIQPCLPSGMPCMMQIVSTVASIVECSI
DPPSQQQNGKAQGEPKVLSAGQSSSLTPVPLKAKEALKLSLFAFHLLKKCCHMKNPESNEEAASNFDWLV
KYLLDSVRIFPEAARSFFPWNGISSSRAFAANIESALALPPDQQGLPKWLWEDEGRNPRIQQGSEGRYRN
>pdb_4WZH
MGSSHHHHHHSSGLVPRGSHMASMTGGGQMGRGSMSLQVGILGNTFANPFMNAAGVMCSTEEELAAMTES
TSGSLITKSCTPALREGNPAPRYYTLPLGSINSMGLPNKGFDFYLAYSARHHDYSRKPLFISISGFSAEE
NAEMCKRLAPVAAEKGVILELNLSCPNVPGKPQVAYDFDAMRRYLAAISEAYPHPFGVKMPPYFDFAHFD
AAAEILNQFPKVQFITCINSIGNGLVIDVETESVVIKPKQGFGGLGGRYVFPTALANVNAFYRRCPGKLI
FGCGGVYTGEDAFLHVLAGASMVQVGTALHEEGAAIFERLTAELLDVMAKKGYKALDEFRGKVKAMD
The idea is to have just the genebank id, or, if it's not in the name, one of the ids and where is it from
Saludos :)
Excellent. Thank you very much. I seriously need to learn how to program in awk.
In case of
sp_
, how do I use an unique identifier like gi? I assume that$1
and$2
will doYou could check the number of fields (
NF
) in a more elaborate way and use$1
and$2
as needed.Hey Devon Ryan,
Could you please help me with some modification of your code for my problem?
I also want to shorten the fasta file sequence header, which looks like this:
And I want the header to be this:
And I tried your code with
It gave me the results:
SO How can I cut the
[locus_tag=
and]
?I would be really appreciated for any help.
Thanks!
Yanfang
I think I figured this out by adapting two codes together.
Thanks all the help!
Yanfang