How to rename headers in fasta files keeping some fractions and adding a name?
1
0
Entering edit mode
7.8 years ago
mirza ▴ 180

Hi,

I have different fasta files. I want to keep some part of the headers and add a name to simplify the downstream analysis and since the ids in files are not in continuation, so simply renaming in series using awk won't help. Some of my fasta headers are like this (augustus output file)

>g1134t1 geneg1134

I want to keep the header and just add the species_genus name after >

or better like this

>Species_genus gene1134

Similarly, for file with headers like this,

>AG1IA_00006 contig1:1338:4722:+ [translate_table: standard]

I want to keep >AG1IA_00006

p.s. my OS= Ubuntu16.04

p.p.s. I couldn't find a suitable command in the other similar posts and I also asked there but couldn't get any help. It's a bit urgent.

Thanks in advance.

fasta • 5.3k views
ADD COMMENT
0
Entering edit mode

On Ubuntu you can use the sed command to remove anything followed by a space.

sed -e 's/ .*//g' test.fa
ADD REPLY
0
Entering edit mode

For future reference you can use this book to learn basic Unix and Perl.

http://korflab.ucdavis.edu/Unix_and_Perl/current.pdf

ADD REPLY
0
Entering edit mode

@Sej

Thank you very much for the document and the answer. Let me try the command.

ADD REPLY
0
Entering edit mode

I want to keep the header and just add the species_genus name after > or better like this Species_genus gene1134

That is not necessarily a good idea, a lot of tools need a unique sequence identifier. Where do you get the species name from by the way?

ADD REPLY
0
Entering edit mode

well, we sequenced and assembled a few genomes, so for the ease of identification, I want to add the respective species_genus name. Right now I want to name the sequences this way for orthofinder and related analysis. It will be easier to visualize the orthlogs/ paralogs. I am keeping the original files for other analysis/ tools.

ADD REPLY
0
Entering edit mode

I would try smth like sed -e 's/>/>species_name_/g' the > is not supposed to occur anywhere else in a fasta file, that way you get both species name and unique id.

>blah blubb
>species_name_blah blubb
ADD REPLY
0
Entering edit mode

Thanks Michael. I'll try tomorrow and let you know.

ADD REPLY
0
Entering edit mode

@Michael

Hi, it did work, thank you. But, what if I want to keep one out of the two terms here. For

g1134t1 geneg1134, I want to keep

Species_genus g1134

and for

AG1IA_00006 contig1:1338:4722:+ [translate_table: standard]

I just want to keep >AG1IA_00006

I did searched for sed. Also in the pdf sent above by Sej. But I could only find, That ‘s’ part of the sed command puts sed in ‘substitute’ mode, where you specify one pattern (between the first two forward slashes) to be replaced by another pattern (specified between the second set of forward slashes). Couldn't find an option to delete some parts selectively. I am a newie and will be grateful if you can help. thanks.

ADD REPLY
0
Entering edit mode
7.8 years ago
mirza ▴ 180

I am writing my answer will hopefully help newbies like me. I finally used Fasta manipulation in Galaxy. Used fasta to tab function to convert my files to tabular format, open it in excel, did the necessary changes and converted back it to fasta using Tabular to Fasta function!

ADD COMMENT

Login before adding your answer.

Traffic: 1932 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6