Question

How to edit fasta header with underscores

0

Entering edit mode

5.7 years ago

imda ▴ 10

Hi everyone! I want to remove one part of my fasta headers, could somebody help me??? please

I have this:

>Capsicum_annuum_cvCM334_CA01g24260 PREDICTED_ transmembrane 9 superfamily member 4-like [Solanum tuberosum]
MIAWISFPAVPLNFFGKEVVFSFATAVDNPLHVDLATQNKTRPSCAKVKMKINLLGEFPKRINVGMRMKTREVKEKGVNISYDYVPKYCKTFKLQDYNEKECFILHPKLYPKD

and I just want this part

 >CA01g24260
MIAWISFPAVPLNFFGKEVVFSFATAVDNPLHVDLATQNKTRPSCAKVKMKINLLGEFPKRINVGMRMKTREVKEKGVNISYDYVPKYCKTFKLQDYNEKECFILHPKLYPKD

or

 >CA01g24260 PREDICTED_ transmembrane 9 superfamily member 4-like [Solanum tuberosum]
MIAWISFPAVPLNFFGKEVVFSFATAVDNPLHVDLATQNKTRPSCAKVKMKINLLGEFPKRINVGMRMKTREVKEKGVNISYDYVPKYCKTFKLQDYNEKECFILHPKLYPKD

In my same fasta file, I have other sequences which are not in the same format as the sequence above:

>Capsicum_annuum_glabriusculum_Capang01g001768
VNKMENLQFVVIGKFTUEWTDIEELRNIIPQQCNINGDCQIGVFRNKHIF
IRUTQESDFINIIISKGAFYLHCKDVLLFDANTLTLIYDPLFKETLETTK
TFAWISFPNLLPTYYVKECLFSLAATVGKPVQLDLATINRTRPSCARIKV
LVDLKADFSKSVRMDIENEESGKCRTIVKRIKFDHIPKYCHECNMQVHAK
NQCRNL

But in general, I just want the last part:

>Capang01g001768
VNKMENLQFVVIGKFTUEWTDIEELRNIIPQQCNINGDCQIGVFRNKHIF
IRUTQESDFINIIISKGAFYLHCKDVLLFDANTLTLIYDPLFKETLETTK
TFAWISFPNLLPTYYVKECLFSLAATVGKPVQLDLATINRTRPSCARIKV
LVDLKADFSKSVRMDIENEESGKCRTIVKRIKFDHIPKYCHECNMQVHAK
NQCRNL

Because the program that I am using add the name of the species to the ID.

fasta • 2.1k views

ADD COMMENT • link 5.7 years ago by imda ▴ 10

1

Entering edit mode

I appreciate your answers, your scripts worked well for some kinds of sequences but not for all. The problem is that the headers of my sequences are not uniform. I have thirteen kinds of sequences (from different species = different headers). I want to extract the headers to get the CDS from another fasta file to carry out selection analysis. Therefore, I need that the headers can match with the headers of my CDS fasta file. For some reason, a previous analysis adds the name of the species to the original sequences headers.

These are the thirteen kinds of different sequences that I have and I am pointing out the header that I need:

>Capsicum_annuum_cvCM334_CA01g24260 PREDICTED_ transmembrane 9 superfamily member 4-like [Solanum tuberosum]
MIAWISFPAVPLNFFGKEVVFSFATAVDNPLHVDLATQNKTRPSCAKVKMKINLLGEFPKRINVGMRMKTR

Required

>cvCM334_CA01g24260
--------------------------------------

input

>Capsicum_annuum_glabriusculum_Capang04g001871
SLSSSVEPIPIKKPCFNNGMSRVIWTEKEVERMKTTENLQYVVIGKFMDG
QILMNYESKFDTNVKRECQIGVLKNRHILMRFNSEEDFINITLKPSYYIL
SKDGYSYMMRTIIYDTKFNVKEVTTLAMAWISFLDLQPTFFVKESIFSIA
LDIEKP

Required

>Capang04g001871
---------------------------------------

input

>Datura_stramonium_Teo1_Datura_stramonium_Teo12749-RA gene=Datura_stramonium_Teo12749 name=Datura_stramonium_Teo12749 seq_id=opera_scaffold_353_pilon_pilon type=cds
MSPPPPETSTEDGNTQFPPLPTTQTQKTHQTQPPIDYGKLFTNSTTQTKPQIDPIPMKPV

Required

>Datura_stramonium_Teo12749-RA 
--------------------------------------------

input

>Datura_stramonium_Tic23_Datura31638-RA gene=Datura_stramonium30201 name=Datura_stramonium30201 seq_id=opera_scaffold_7375_pilon type=cds
MTRINVIENIQHAIVRKFSHDWPSLEELRALIPKQYDYSRNKHVLNRFKLMEDFSNIMSK
SSYHMHPLIYDAKFRTNEETTQAMEQ

Required

>Datura31638-RA
-------------------------------------------------

input

>Nicotiana_attenuata_NIATv7_g62846.t1   unknown
MEVGQSSFNPKPLPQIASNPNPIQNYAKLLQPQAFNAPMHVNSINLKPVELLHGEPMVRW
KKSEVKKSIIQQGFHLAVLGKFSYGKPVIQELRKAIPIQCELKGSCLVGLIEDSHVLIKL
SFMEDYIHLLSKPAFYLKAQGEF

Required

>NIATv7_g62846.t1 
------------------------------------------

input

>Nicotiana_sylvestris_mRNA_25148_cds mRNA_25148 gene_14162|id=AT2G01050.1_evalue=7e-07_annot='zinc ion binding';id=Solyc01g021700.1.1_evalue=4e-12_annot='Unknown Protein'
MNQIERLEFAVVGKFTYDWSDLEELRKIIPQQCGVKGGCQIGLFRSKHILIRLSLQEDFVNLVSKGAFYIT

Required

>mRNA_25148 gene_14162|id=AT2G01050.1_evalue=7e-07_annot='zinc ion binding';id=Solyc01g021700.1.1_evalue=4e-12_annot='Unknown Protein'
------------------------------------------------

input

>Nicotiana_tabacum_Nitab4.5_0003269g0070.1
MATMASGQLPANTRTPPQPPLNITQPCTTTINVPKTMDYANAVKPTTSTSTMQDRAAVVD
PIPPRQAQFFQGQPTCGIKADCNIGYLRDR

Required

>Nitab4.5_0003269g0070.1
---------------------------------------------------

input

>Nicotiana_tomentosiformis_mRNA_3163_cds mRNA_3163 gene_1805|id=AT5G32613.1_evalue=2e-04_annot='Zinc knuckle _CCHC-type_ family protein';id=Solyc03g071760.1.1_evalue=3e-30_annot$
MATNASPQPLVAGELIQNNVNPNPNPTLQTPYAATLKQQPTIQNLPISKLKPVEFVHGEPTLK

Required

>mRNA_3163_cds mRNA_3163 gene_1805|id=AT5G32613.1_evalue=2e-04_annot='Zinc
------------------------------------------------

input

>Petunia_inflata_Peinf101Ctg13805532g00002.1 Unknown protein
MKYDVWFDPLEETSIVVTWISFPGILPEFFVQETAIRKPLQFDIAPKSKTRPGGAKVKVEMDLLVNHPHH

Required

>Peinf101Ctg13805532g00002.1
-------------------------------------------------------

input

>Solanum_lycopersicum_Solyc02g030550.1.1 LOW QUALITY_MLP-like protein 423 _AHRD V3.3 --* AT1G24020.2_
MAKIDSPQPQAEKERPEKPSHATIPNPSTCIQK

Required

>Solyc02g030550.1.1
------------------------------------------------------------

input

>Solanum_pennellii_Sopen10g018820.1 hypothetical protein
MRNQSGEVMEKWIKIRYDYVPKDCKTCMIQGHNKEQCYVIHQELYPKEKTGHKEGQTQEHR

Required

>Sopen10g018820.1
------------------------------------------------------

input

>Solanum_pimpinellifolium_Sopim01g017000.0.1
MPMYCKNYNLQGHKESECFILHPELRMEEEKVDVSEEPRGNSPIDKDKNIGNDEMNTLIK
ILKFTERDNDVLP

Required

>Sopim01g017000.0.1
-----------------------------------------

input

>Solanum_tuberosum_Sotub01g015640.1.1 - [64]
MAVTTACGSSPPEDFPPLPNRSKPGATPIPSSPQTNQYANLLKPRSLLPQITKVLPKPVNIVHE

Required

>Sotub01g015640.1.1

ADD REPLY • link updated 5.7 years ago by cpad0112 21k • written 5.7 years ago by imda ▴ 10

0

Entering edit mode

Thank for providing the detailed examples. This is not a trivial task, because there is no clean pattern for the names you like to keep.

For some reason, a previous analysis adds the name of the species to the original sequences headers.

So you have a file with "original sequence headers"? How does they look like there?

ADD REPLY • link 5.7 years ago by finswimmer 16k

0

Entering edit mode

Hi! I have a .fasta file with proteins from every species. They look like this:

>CA01g00010 Detected protein of unknown function
FRRNLELVRADRPNAFSN...
>CA01g00020 PREDICTED: protein ECERIFERUM 3-like [Solanum tuberosum]
MLTSSTERFQKIQKGAPAEYQKYLV...

However, the program that I used to detect orthologues can give me also all the proteins sequences that belong to each ortogroup or gene family. Therefore, I want to carry out some analysis using Hyphy program, but this program required CDS sequences to work. So I also have all the CDS for each species. I need to use the headers from all the sequences that belong to each gene family (from Orthofinder) in order to obtain the CDS.

ADD REPLY • link updated 5.7 years ago by finswimmer 16k • written 5.7 years ago by imda ▴ 10

0

Entering edit mode

you would need seqkit to linearize your fasta file.

input:

$ cat test.fa                                                        
>Capsicum_annuum_cvCM334_CA01g24260 PREDICTED_ transmembrane 9 superfamily member 4-like [Solanum tuberosum]
MIAWISFPAVPLNFFGKEVVFSFATAVDNPLHVDLATQNKTRPSCAKVKMKINLLGEFPKRINVGMRMKTREVKEKGVNISYDYVPKYCKTFKLQDYNEKECFILHPKLYPKD
>Capsicum_annuum_glabriusculum_Capang01g001768
VNKMENLQFVVIGKFTUEWTDIEELRNIIPQQCNINGDCQIGVFRNKHIF
IRUTQESDFINIIISKGAFYLHCKDVLLFDANTLTLIYDPLFKETLETTK
TFAWISFPNLLPTYYVKECLFSLAATVGKPVQLDLATINRTRPSCARIKV
LVDLKADFSKSVRMDIENEESGKCRTIVKRIKFDHIPKYCHECNMQVHAK
NQCRNL

output:

$ seqkit seq -w0 test.fa | sed '/^>/ s/^>\w\+_/>/1'  |sed 's/\s.*//g'

>CA01g24260
MIAWISFPAVPLNFFGKEVVFSFATAVDNPLHVDLATQNKTRPSCAKVKMKINLLGEFPKRINVGMRMKTREVKEKGVNISYDYVPKYCKTFKLQDYNEKECFILHPKLYPKD
>Capang01g001768
VNKMENLQFVVIGKFTUEWTDIEELRNIIPQQCNINGDCQIGVFRNKHIFIRUTQESDFINIIISKGAFYLHCKDVLLFDANTLTLIYDPLFKETLETTKTFAWISFPNLLPTYYVKECLFSLAATVGKPVQLDLATINRTRPSCARIKVLVDLKADFSKSVRMDIENEESGKCRTIVKRIKFDHIPKYCHECNMQVHAKNQCRNL

ADD REPLY • link 5.7 years ago by cpad0112 21k

0

Entering edit mode

Dear cpad0112, Could you help me with the questions that I pointed out below. Thank you.

ADD REPLY • link 5.7 years ago by imda ▴ 10

score 2 · Accepted Answer · 2019-03-22

2

Entering edit mode

5.7 years ago

finswimmer 16k

Try this:

$ awk -v FS=" " '/^>/ {n=split($1, id, "_"); $0=">"id[n]}1' input.fasta > ouput.fasta

ADD COMMENT • link 5.7 years ago by finswimmer 16k