Entering edit mode
5.8 years ago
fec2
▴
50
Hi, I have a multifasta file and I need to delete some part of the header for every fasta file. For example:
>Viridibacillus_arenosi_FSL_R5_0213-BK137_RS04360-22-CBS_domain-containing_protein <unknown description>
GCTAATGAAGTTATTGGCCTAGTGACAGAAAGGGATATAAAAAACGCGCTTCCTTCTTCC
CTGCTC------AAA
>Viridibacillus_arvi_DSM16317-AMD00_RS08865-16-acetoin_utilization_protein_AcuB <unknown description>
GCGAATGAAGTTATTGGCCTAGTAACAGAAAGGGATATAAAAAACGCCCTTCCATCTTCC
CTGCTC------AAA
I need to delete the part after "-" in the header which is "-BK137_RS04360-22-CBS_domain-containing_protein <unknown description="">" and "-AMD00_RS08865-16-acetoin_utilization_protein_AcuB <unknown description="">".
I tried
cut -d '-' -f 1 your_file.fasta > new_file.fasta
and
awk '{split($0,a,"-"); if(a[1]) print ">"a[1]; else print; }' my_file.fasta > new_file.fasta
but this is an alignment file, it removed the "-" in my sequence as well, which of course I don't want.
Thanks for your help!
Best regards,
Felix
Try the solutions out in this thread (modify as needed) : A: Fasta header trimming
There are multiple other threads that refer to fasta header manipulation. Please use google to do an external search on biostars.
Thanks. I am trying to use the "cut" command. However, if i use: cut -d '-' -f1 your_file.fasta > new_file.fasta. It will removed the "-" in my sequence. May I know any option for the cut command to be only apply for the fasta header?
Apologies. Did not realize that you have
-
elsewhere in your sequences.