Question

Help for changing fasta seq ID

1

Entering edit mode

9.2 years ago

vtefnfqp ▴ 10

Hi, everyone! I want to change the fasta seq ID like follow:

I have 300 seqs like this:

>CDM35446 CDM35446.1 Acyl-CoA N-acyltransferase [Penicillium roqueforti FM164].
MASSSIFPFHVGEASNER.................

I want to change the seq ID to:

>CDM35446|Penicillium_roqueforti
MASSSIFPFHVGEASNER.................

like this: ID|species_name.

I know a simple perl script will fix this, but it's really not easy for me to write script. I really appreciate it if anyone can help me. ~~Also, you can send the script to me:~~ <REMOVED>.

perl script fasta • 2.9k views

ADD COMMENT • link updated 2.4 years ago by Ram 44k • written 9.2 years ago by vtefnfqp ▴ 10

0

Entering edit mode

Please do not open with a request to take the discussion off the forum. I have removed that part of your post.

ADD REPLY • link 9.2 years ago by Ram 44k

Ram · Answer 1 · 2015-11-12

5

Entering edit mode

9.2 years ago

george.ry ★ 1.2k

If the ID lines are always in the same format, and have 2 word 'genus species' latin names, as per the example:

cat yourfile.fa | sed 's/^>\([[:alnum:]]*\).*\[\([[:alpha:]]* [[:alpha:]]*\).*/>\1|\2/' > yournewfile.fa

ADD COMMENT • link updated 5.2 years ago by Ram 44k • written 9.2 years ago by george.ry ★ 1.2k

0

Entering edit mode

Great, although I think you should have some word boundaries in there...

ADD REPLY • link 9.2 years ago by Matt Shirley 10k

score 0 · Answer 2 · 2015-11-12

You need to discover a common pattern to do what you are asking

For example, if you want the first word (CDM35446) which is followed by a tab or empty space, and then add as the next two words (gender and specie) what is contained between the first set of brackets [], you can do it

But if the gender and species is not always contained between brackets, or the information of you fasta sequence is not columnar, this is a hard task to accomplish

Ram · Answer 3 · 2015-11-12

Dear vtefnfqp,

The one liner above is great and works perhaps. Here is my commented version. I tested the script with a file and it works. Save the below script in a .pl file and run in the same directory of your fasta file. Change the extension from fa to txt if you want or vice versa in the script.:

#!/usr/bin/perl
use strict;
use warnings;

#open your file which is in the same location of the script.
open(my $fastafile, '<',"./sampleFile.txt");

#initiate an empty array which will contain each line of your file.
my @fasta_array;

#read your file line by line. Then push the lines in the array above.
while(<$fastafile>) {
    push(@fasta_array,$_);
}

#if one of the lines start with greater sign do the conversion you want based on the following regex.
for (my $i =0;$i<scalar(@fasta_array);$i++) {
    if($fasta_array[$i] =~ />([A-Z0-9]+)\s.*\[(.*)\s.*\]/gi) {
        $fasta_array[$i] = $1."|".$2;
    }
}

#combine the array elements with new lines. Store them in a variable.
my $result = join("\n",@fasta_array);

#initiate a file handle which will contain your result.
open(my $resultfile, '>',"./resultFile.txt");

#write your result to file.
print $resultfile $result;

I hope this is helpful,

Good luck with your research,