Question

Renaming Entries In A Fasta File

13

Entering edit mode

12.6 years ago

thiago84naka ▴ 130

Hello,

I have never made a script in my life.

The ploblem is how to change the fasta names like this input file:

>Glyma04g14800|Glyma04g14800.3
MMLETVAAVPGMVAGMLLHCKSLRRFEHSGGWIKALLEEAENERMHLMTFMEVAKPKWYE
>Glyma05g24460|Glyma05g24460.1
SNVSIDLTKHHVPKNFLDKVAYRTVKLLRIPTDLFFKRRYGCRAMMLETVAAVPGMVGGM

in this output file (change original names to numbers in ascending order, starting with 1):

>1
MMLETVAAVPGMVAGMLLHCKSLRRFEHSGGWIKALLEEAENERMHLMTFMEVAKPKWYE
>2
SNVSIDLTKHHVPKNFLDKVAYRTVKLLRIPTDLFFKRRYGCRAMMLETVAAVPGMVGGM

I'm so grateful for helping. Regards, Naka

fasta • 56k views

ADD COMMENT • link updated 9.0 years ago by noirot.celine ▴ 50 • written 12.6 years ago by thiago84naka ▴ 130

8

Entering edit mode

12.6 years ago

Istvan Albert 102k

The Fastx Renamer tool can do this as well: http://hannonlab.cshl.edu/fastx_toolkit/commandline.html#fastx_renamer_usage

$ more test.fa 
>GS6SIDE04J1T1R xy=4004_1485
CATAGTAGTGAGAGTTGATCATGGCTCAGCCATCTCATCCAGCAGCCGCGGTAATCACTACTAT
>GS6SIDE04J0352 xy=3996_712
ACGAGTGCGTAGAGTTGATCATGGCTCAGCAGCCTCCTCGTGCCAGCAGCCGCGGTAATACGCACTCG
>GS6SIDE04JM7EM xy=3837_2988
AGCACTGTAGAGAGTTGATCCTGGCTCAGGGATAGGCCAGCAGCCGCGGTAATCTACAGTGC

$ ~/Downloads/bin/fastx_renamer -i test.fa -n COUNT 
>1
CATAGTAGTGAGAGTTGATCATGGCTCAGCCATCTCATCCAGCAGCCGCGGTAATCACTACTAT
>2
ACGAGTGCGTAGAGTTGATCATGGCTCAGCAGCCTCCTCGTGCCAGCAGCCGCGGTAATACGCACTCG
>3
AGCACTGTAGAGAGTTGATCCTGGCTCAGGGATAGGCCAGCAGCCGCGGTAATCTACAGTGC

ADD COMMENT • link 12.6 years ago by Istvan Albert 102k

6

Entering edit mode

12.6 years ago

David Langenberger 11k

Try this:

cat youFile.fa | perl -ane 'if(/\>/){$a++;print ">$a\n"}else{print;}' > youFile_new.fa

ADD COMMENT • link 12.6 years ago by David Langenberger 11k

9

Entering edit mode

Instead of useless cat, try: perl -ane 'if(/\>/){$a++;print ">$a\n"}else{print;}' youFile.fa > youFile_new.fa

ADD REPLY • link 12.6 years ago by Matt Shirley 10k

5

Entering edit mode

9.0 years ago

noirot.celine ▴ 50

Here is a generic way to convert ncbi headers to simple header

>gi|1002620271|ref|NC_029525.1| Coturnix japonica isolate 7356 chromosome 10, Coturnix japonica 2.0, whole genome shotgun sequence
TACTCCCCAAGAA

to

>NC_029525.1
TACTCCCCAAGAA

By sed :

sed 's/^[^ ]*[|]\([^|]*\)[|] .*$/>\1/' Coturnix_japonica.fasta > Coturnix_japonica_rename.fasta

ADD COMMENT • link 9.0 years ago by noirot.celine ▴ 50

0

Entering edit mode

@noirot.celine Hi, I have a similar problem and your above command didn't work for me (I am really new to linux environment). I have different fasta files. Some of my fasta headers are like this (augustus output file)

g1134t1 geneg1134 I want to keep the header and just add the species_genus name after >

or better like this

Species_genus gene1134

Similarly, for file with headers like this,

AG1IA_00006 contig1:1338:4722:+ [translate_table: standard]

I want to keep >AG1IA_00006 and since the ids in files are also not in continuation, so simply renaming in series won't help.

p.s. my OS= Ubuntu16.04

ADD REPLY • link 8.2 years ago by mirza ▴ 180

1

Entering edit mode

12.6 years ago

AGS ▴ 250

I'd use faSimplify

ADD COMMENT • link 12.6 years ago by AGS ▴ 250

Ram · Accepted Answer · 2012-09-19

36

Entering edit mode

12.6 years ago

Pierre Lindenbaum 166k

awk '/^>/{print ">" ++i; next}{print}' < file.fasta

ADD COMMENT • link updated 5.2 years ago by Ram 45k • written 12.6 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Sir can we modify above awk syntax in this way

instead of printing like

>1
>2
....

it prints like

>chromosome1
>chromosome2
...

for that purpose, where and how do I put the text "chromosome"

please help me out

ADD REPLY • link updated 5.3 years ago by Ram 45k • written 10.9 years ago by Raghav ▴ 100

2

Entering edit mode

@Raghav: If you wanted to add chromosome in the header with the counter, simply add it in the ">" portion of the one-liner.

awk '/^>/{print ">chromosome" ++i; next}{print}' < file.fasta

ADD REPLY • link updated 5.2 years ago by Ram 45k • written 10.7 years ago by st.ph.n ★ 2.7k

0

Entering edit mode

How can we add "chr" just after >? I don't want to change anything else. For example:

2L I want it to become chr2L

ADD REPLY • link 8.8 years ago by saswati.s2010 • 0

0

Entering edit mode

Hello there, (I already solved this)

I am trying to understand your script line to modified. So, is the script saying?:

For every line ('/) where you find a > (^>/) print the > and then add (+) a counter (+), then next print what follows.

In my case the names are like:

> M02137:143:000000000-APU54:1:1101:21985:13014 1:N:0:10
> M02137:143:000000000-APU54:1:1112:18691:9995 1:N:0:10

etc.

I want to leave only what is different.

awk '/^>/{print ">" remove "M02137:143:000000000-APU54:1:"; next}{print}' < file.fasta

And can I do this in ssh? ( I don't think I have awk installed)

Many thanks in advance for your time,

Caro

PS: I am new to HTS/NGS and don't know much about programming

ADD REPLY • link updated 5.2 years ago by Ram 45k • written 8.7 years ago by cdiaza • 0

0

Entering edit mode

This doesn't work if the read spans in multiple lines ?

ADD REPLY • link 8.5 years ago by Picasa ▴ 680

0

Entering edit mode

@Pierre Lindenbaum

Hi,

how can I modify this command to add genus_species name after > in every entry and yet keep most of the information in the the header

ie. my entries are like this

>lcl|HF546977.1_cds_CCO27433.1_1 [gene=cox1] [protein=cytochrome c oxidase subunit 1] [protein_id=CCO27433.1] [location.......]

and want to have the entries name like this

>genus_species HF546977.1_cds_CCO27433.1_1 [gene=cox1] [protein=cytochrome c oxidase subunit 1]

By using

awk '/^>/{print ">genus_species gene." ++i; next}{print}' < file.fa

I got,

>genus_species gene.1

and so on

and how can I add output file in the command line

Having the genus_species name in the beginning is requires as I'll be comparing different species and also, I don't want to loose the ids and protein names for ease of downstream analysis.

ADD REPLY • link updated 5.2 years ago by Ram 45k • written 8.3 years ago by mirza ▴ 180

0

Entering edit mode

Hello Pierre, Thank you for your useful code. May I please ask how can I modify the code to keep everything else in the sequence and just to add the sample name in front and that too for the batch of files.

e.g. my file looks like :

>M03691:51:000000000-BD94Y:1:1101:14841:1381 1:N:0:1

ACTGGGTGTAAAGGGCGTGTAGGCGGAGAAGCAAGTCAGAAGTGAAATCCATGGGCTTAACCCATGAACTGCTTTTGAAACTGTTTCCCTTGAGTATCGGAGAGGCAGGCGGAATTCCTAGTGTAGCGGTGAAATGCGTAGATATTAGGAGGAACACCAGTGGCGAAGGCGGCCTGCTGGACGACAACTGACGCTGAGGCGCGAAAGCGTGGGGAGCAAACAGGATTAGATACCCCGGT

M03691:51:000000000-BD94Y:1:1101:15960:1389 1:N:0:1 TACTGGGGTATCTAATCCTATTTGCTCCCCACGCTTTCGGGACTGAGCGTCAGTTATGCGCCAGATCGTCGCCTTCGCCACTGGTGTTCCTCCATATATCTACGCATTTCACCGCTACACATGGAATTCCACGATCCTCTCACACACTCTAGCTCTACGGTTTCCATGGCTTACCGAAGTTAAGCTTCGATCTTTCACCACAGACCCTTAGTGCCGCCTGCTCCCTCTTTACACCCAGT M03691:51:000000000-BD94Y:1:1101:15662:1415 1:N:0:1 ACTGGGTGTAAAGGGCTCGTAGGCGGTTCGTCGCGTCCGGTGTGAAAGTCCATCGCTTAACGGTGGATCTGCGCCGGGTACGGGCGGGCTGGAGTGCGGTAGGGGAGACTGGAATTCCCGGTGTAACGGTGGAATGTGTAGATATCGGGAAGAACACCAATGGCGAAGGCAGGTCTCTGGGCCGTTACTGACGCTGAGGAGCGAAAGCGTGGGGAGCGAACAGGATTAGATACCCCCGTA

Now I want to add Sample name after > and keep everything else as it it.

This process I want to do for a batch of files. Any help will be really great. Thanks, Mitra

ADD REPLY • link 7.6 years ago by Mitra • 0

0

Entering edit mode

Hello Pierre, when I use awk '/^>/{print ">" ++i; next}{print}' < file.fasta, the changes are made but not saved. I want to distinguish between two numbered contig.fasta files (each fasta is numbered 'contig 00001', 'contig00002' etc, I want to name the 1st contig.fasta 'shorter_contig 00001, 'shorter_contig 00002' and the 2nd.contig.fasta 'longer_contig 00001, 'longer_contig 00002' ) is there a way make the header modifications permanent? Thanks

ADD REPLY • link 6.7 years ago by Audrey • 0

1

Entering edit mode

the changes are made but not saved.

http://wiki.bash-hackers.org/howto/redirection_tutorial

ADD REPLY • link 6.7 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Hello, i want to change the fasta name of this input file :

M04631:312:000000000-C6V6K:1:2107:11495:1734 1:N:0:ACTGAGCG+TTATGCGA

M04631:312:000000000-C6V6K:1:2107:13059:1785 1:N:0:ACTGAGCG+TTATGCGA

In this fasta name:

adjH001

adjH002

adjH....

adjH099

adjH100

what script do I have to use??

ADD REPLY • link 6.2 years ago by kari_vo3 • 0

1

Entering edit mode

With respect, if you are already stuck at this most simple task, better spend some quality time on Unix and NGS basics before diving into any analysis. In the end, you as the analyst have to stand up for your analysis.

ADD REPLY • link 6.2 years ago by ATpoint 87k

0

Entering edit mode

A similar question+answer was posted above. What are you missing ?

ADD REPLY • link 6.2 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

my question was resolved, thank you. I needed it in this manner because i have other script that only works with these fasta names . thanks.

ADD REPLY • link 6.2 years ago by kari_vo3 • 0