Question

How to rename headers in fasta files keeping some fractions and adding a name?

0

Entering edit mode

7.8 years ago

mirza ▴ 180

Hi,

I have different fasta files. I want to keep some part of the headers and add a name to simplify the downstream analysis and since the ids in files are not in continuation, so simply renaming in series using awk won't help. Some of my fasta headers are like this (augustus output file)

>g1134t1 geneg1134

I want to keep the header and just add the species_genus name after >

or better like this

>Species_genus gene1134

Similarly, for file with headers like this,

>AG1IA_00006 contig1:1338:4722:+ [translate_table: standard]

I want to keep >AG1IA_00006

p.s. my OS= Ubuntu16.04

p.p.s. I couldn't find a suitable command in the other similar posts and I also asked there but couldn't get any help. It's a bit urgent.

Thanks in advance.

fasta • 5.3k views

ADD COMMENT • link updated 20 months ago by Ram 44k • written 7.8 years ago by mirza ▴ 180

0

Entering edit mode

On Ubuntu you can use the sed command to remove anything followed by a space.

sed -e 's/ .*//g' test.fa

ADD REPLY • link 7.8 years ago by Sej Modha 5.3k

0

Entering edit mode

For future reference you can use this book to learn basic Unix and Perl.

http://korflab.ucdavis.edu/Unix_and_Perl/current.pdf

ADD REPLY • link 7.8 years ago by Sej Modha 5.3k

0

Entering edit mode

@Sej

Thank you very much for the document and the answer. Let me try the command.

ADD REPLY • link 7.8 years ago by mirza ▴ 180

0

Entering edit mode

I want to keep the header and just add the species_genus name after > or better like this Species_genus gene1134

That is not necessarily a good idea, a lot of tools need a unique sequence identifier. Where do you get the species name from by the way?

ADD REPLY • link 7.8 years ago by Michael 55k

0

Entering edit mode

well, we sequenced and assembled a few genomes, so for the ease of identification, I want to add the respective species_genus name. Right now I want to name the sequences this way for orthofinder and related analysis. It will be easier to visualize the orthlogs/ paralogs. I am keeping the original files for other analysis/ tools.

ADD REPLY • link 7.8 years ago by mirza ▴ 180

0

Entering edit mode

I would try smth like sed -e 's/>/>species_name_/g' the > is not supposed to occur anywhere else in a fasta file, that way you get both species name and unique id.

>blah blubb
>species_name_blah blubb

ADD REPLY • link 7.8 years ago by Michael 55k

0

Entering edit mode

Thanks Michael. I'll try tomorrow and let you know.

ADD REPLY • link 7.8 years ago by mirza ▴ 180

0

Entering edit mode

@Michael

Hi, it did work, thank you. But, what if I want to keep one out of the two terms here. For

g1134t1 geneg1134, I want to keep

Species_genus g1134

and for

AG1IA_00006 contig1:1338:4722:+ [translate_table: standard]

I just want to keep >AG1IA_00006

I did searched for sed. Also in the pdf sent above by Sej. But I could only find, That ‘s’ part of the sed command puts sed in ‘substitute’ mode, where you specify one pattern (between the first two forward slashes) to be replaced by another pattern (specified between the second set of forward slashes). Couldn't find an option to delete some parts selectively. I am a newie and will be grateful if you can help. thanks.

ADD REPLY • link 7.8 years ago by mirza ▴ 180

score 0 · Answer 1 · 2017-02-06

0

Entering edit mode

7.8 years ago

mirza ▴ 180

I am writing my answer will hopefully help newbies like me. I finally used Fasta manipulation in Galaxy. Used fasta to tab function to convert my files to tabular format, open it in excel, did the necessary changes and converted back it to fasta using Tabular to Fasta function!

ADD COMMENT • link 7.8 years ago by mirza ▴ 180