How To Rename FASTA Headers
4
4
Entering edit mode
8.0 years ago
mollysil ▴ 40

Hi,

I have very little experience with scripts. I want to change my FASTA sequence headers (I have 100's of FASTA sequences per file) from very long headers to headers with the sample name (CM1) and then ascending numbers. Basically, I want to go from this:

>M00505:63:000000000AWRE0:1:1101:11224:1105_1:N:0:NCTACGCT+CTAAGCCT/M00505:63:000000000AWRE0:1:1101:11224:1105_2:N:0:NCTACGCT+CTAAGCCT;size=290797;
GGGTTAGTAGGTTGGTCATGCCTCTGGTATGTACTGGTCTCACTGATTCCTAACCCCTGATGAACCGTAATGCCATTAATTTGGTGTTGCGGGGAATTTGGACTGAAGGGGGAAAAAATTAGAGTGTTTAAAGCAAGCTA
>M00505:63:000000000AWRE0:1:1107:23836:6960_1:N:0:GCTACGCT+CTAAGCCT/M00505:63:000000000AWRE0:1:1107:23836:6960_2:N:0:GCTACGCT+CTAAGCCT;size=2;
GGGTTAGTAGGTTGGTCAGCCCTCTGGTATGTACTGGTCTCACTGATTCCTCCTTTTCCATGAACCGTAATGCCATTAATTTTGAATTGCGGGAAATTTGAACTGTTACTTTGAAAAAATTAGAGTGTTTAAAGCAAGCT
>M00505:63:000000000AWRE0:1:1103:16981:11028_1:N:0:GATACGCT+CTAAGCCT/M00505:63:000000000AWRE0:1:1103:16981:11028_2:N:0:GATACGCT+CTAAGCCT;size=1;
GGGTTAGTAGGTTGGTCATGCCTCTGGTATGTACTGRTCTCACTGATTCCTCCTTCCTGACGAACTGTAATGCCATTAATTTGGTGTTGCAGGRAATTTGGACTGTTACTTTGAAAAAATTAGAGTGTTTAAAGCAAGCT

To this:

>CM1_1
GGGTTAGTAGGTTGGTCATGCCTCTGGTATGTACTGGTCTCACTGATTCCTAACCCCTGATGAACCGTAATGCCATTAATTTGGTGTTGCGGGGAATTTGGACTGAAGGGGGAAAAAATTAGAGTGTTTAAAGCAAGCTA
>CM1_2
GGGTTAGTAGGTTGGTCAGCCCTCTGGTATGTACTGGTCTCACTGATTCCTCCTTTTCCATGAACCGTAATGCCATTAATTTTGAATTGCGGGAAATTTGAACTGTTACTTTGAAAAAATTAGAGTGTTTAAAGCAAGCT
>CM1_3
GGGTTAGTAGGTTGGTCATGCCTCTGGTATGTACTGRTCTCACTGATTCCTCCTTCCTGACGAACTGTAATGCCATTAATTTGGTGTTGCAGGRAATTTGGACTGTTACTTTGAAAAAATTAGAGTGTTTAAAGCAAGCT

I am able to do this with two separate scripts using:

sed 's/>.*/&CM1/' file.fa > output.fa  
cat output.fa | perl -ane 'if(/\>/){$a++;print ">$a\n"}else{print;}' > output2.fa

But I would like to do it all in one step. Any ideas?
THANK YOU!! Molly

FASTA Header • 22k views
ADD COMMENT
7
Entering edit mode
8.0 years ago
venu 7.1k

You can do following

sed '/^>/d' file.fa | wc -l # say you got 100

Then

for i in {1..100}; do echo CM1_$i; done | paste - <(sed '/^>/d' file.fa) | sed -e 's/^/>/' -e 's/\t/\n/' > new_file.fa

Update: Note that this assumes each sequences in the file is single line without any newline characters.

cat file.fa

>chr1
AGGATGACAGA
>chr2
GACTTTAAGAC

number of sequences in file.fa

sed '/^>/d' file.fa | wc -l
# 2

for loop to generate headers like CM1_1, CM1_2

for i in {1..2}; do echo CM1_$i; done
# CM1_1
# CM1_2

Paste these new headers to sequences whose headers are removed

for i in {1..2}; do echo CM1_$i; done | paste - <(sed '/^>/d' file.fa)

# CM1_1   AGGATGACAGA
# CM1_2   GACTTTAAGAC

Finally arrange it to a standard fasta format

for i in {1..2}; do echo CM1_$i; done | paste - <(sed '/^>/d' file.fa) |sed -e 's/^/>/' -e 's/\t/\n/'

# >CM1_1
# AGGATGACAGA
# >CM1_2
# GACTTTAAGAC
ADD COMMENT
0
Entering edit mode

im bit new to scripting so can you explain what does the for loop after initialization does from (1..100) i understood since the total no of lines perhaps , so what after that loop does if you can explain it would be really good....

ADD REPLY
0
Entering edit mode

Check the updated answer.

ADD REPLY
0
Entering edit mode

This was a cool approach! When I did your suggestion, it got the header name correct (finally!) but it cut huge portions of my sequences... How do I get it to not do that? (I first linearized all the fasta files and then ran your scripts...but I still get this issue of cutting sequences). Now it looks like:

CM1_1 GGGTCAGCAGGTTGGTCGTGCCACTGGTATGAACTGGCCTTGCTGATTCCTCCTTCTTGAAGAACCGTGATGCCATTAAT CM1_2 TTGGTGTTGCGGGGAAACAGGACTGTTACTTTGAAAAAATTAGAGTGTTTAAAGCAGGCTAACGCTTGAATACATTAGCA CM1_3 TGGAATAATAAAATAGGACGTTTAATCCTATTTTGTTGGTTTCTAGGGTTGACGTA

But my sequences are supposed to be a lot longer than that. Thanks so much for your input! Any suggestion for this problem?

ADD REPLY
4
Entering edit mode
8.0 years ago

Here's a solution using SeqKit.

It provides executable binary files for Windows/Linux/Mac OS, so just download and run, no any dependencies :)

for f in *.fasta *.fa; do
    seqkit replace -p '.+' -r 'CM1_{nr}' $f > $f.rename.fa
done
ADD COMMENT
0
Entering edit mode

Can you provide me with the code for downloading SeqKit on the Linux platform?

ADD REPLY
0
Entering edit mode

well

wget https://github.com/shenwei356/seqkit/releases/download/v0.4.0/seqkit_linux_amd64.tar.gz
tar -zxvf seqkit_linux_amd64.tar.gz
mkdir -p ~/bin
mv seqkit ~/bin
ADD REPLY
3
Entering edit mode
8.0 years ago

You can do that with the BBMap package:

rename.sh in=x.fasta out=y.fasta prefix=CM1
ADD COMMENT
0
Entering edit mode

I am using Linux. rename doesn't work. Is there a linux version of this?

ADD REPLY
1
Entering edit mode

It does work in linux. Could you post how you've used?

ADD REPLY
1
Entering edit mode

Its only dependency is having Java installed.

ADD REPLY
1
Entering edit mode

I think @mollysil 's confusion is that BBMap package must be installed in order for this command to work. It's true that rename.sh is not a standard Unix tool...

ADD REPLY
1
Entering edit mode

True! Unfortunate, but true :)

I'm kidding, of course. BBMap is irrelevant to at least 99.9% of Linux users.

ADD REPLY
0
Entering edit mode

Can you provide me with step by step instructions on how to get BBMap? I have java but the tar script on the bbmap website didn't work. (BBMap_35.74.tar.gz: Cannot open: No such file or directory). My apologies for my ignorance of this stuff. Thanks for all your help!!

ADD REPLY
0
Entering edit mode

If you are in the same directory as BBMap_35.74.tar.gz (which is not the latest version, by the way; I recommend using the latest version, which is 36.67), you would execute:

tar -xzf BBMap_35.74.tar.gz

It sounds like you are in a different directory. What do you get when you run "ls"?

ADD REPLY
0
Entering edit mode

Yes you are correct. I hadn't downloaded the tar.gz file into my working directory. Did that. Now your script ran just fine. After installing with that code, do I still need to download certain tools within BBMap? Because the rename script doesn't work still. If so, what is the code for that?

ADD REPLY
0
Entering edit mode

As long as Java is installed and you are using the bash shell, everything should work fine... Can you post the exact command and screen output?

ADD REPLY
0
Entering edit mode

Also, will your original rename script do both of the two tasks I want (adding the CM1 sample number AND add ascending OTU numbers, in that order)? Or will it just put the CM1 in the header? Which part of the code is for adding the ascending OTU numbers?

ADD REPLY
0
Entering edit mode

The command I posted will rename the sequences like this:

CM1_1

CM1_2

...etc. The default is to use ascending numbers. You can get more information by running rename.sh with no parameters or just looking at it in a text editor.

ADD REPLY
0
Entering edit mode

I got it all to work. Thank you so much for your assistance!!

ADD REPLY
0
Entering edit mode
2.9 years ago

I think the best way to do it, it's to to seqtk tool. like this:

seqtk rename file.fa chr_

ADD COMMENT

Login before adding your answer.

Traffic: 2551 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6