Make multi-fasta using perl
3
0
Entering edit mode
9.6 years ago
sun.nation ▴ 140

Header and sequence are coming in loop:

$header = "seq1";

$sequence = "GTGGTCAGAAAAGTGGCCTCAGCCCATTTCCAGCATCCTATGGCTATCTAAGTAGGAAGACGATGAGCTATGGAA";

Sequence is single line without newline.

I want output multi-fasta like this below. Sequence should have newline after each 60 bases.

>seq1
GTGGTCAGAAAAGTGGCCTCAGCCCATTTCCAG
CATCCTATGGCTATCTAAGTAGGAAGACGATGA
GCTATGGAA

>seq2
AAGTGGCCTCAGCCCATTTCCAGCATCCTATGGC
TATCTAAGTAGGAAGACGATGAGCTATGGAA

Thanks

perl multi-fasta fasta • 4.6k views
ADD COMMENT
1
Entering edit mode

This was asked before here and the solution (pre-v5.10) is:

$sequence =~ s/(.{60})/$1\n/gs;

That formats the sequence, then you just need to join the id and sequence with a newline when you print.

ADD REPLY
0
Entering edit mode

Thanks a lot SES.

It was a fantastic solution with a line code. It worked good when I added this in my script.

ADD REPLY
0
Entering edit mode

Can you please give me some hint to create loop in perl? I need that at the end of my perl code.

Thanks Ram and Dylan

ADD REPLY
0
Entering edit mode

Look at the Text::Wrap module if you need a perl specific and easy method.

ADD REPLY
0
Entering edit mode

Thanks. I will try that

ADD REPLY
0
Entering edit mode

I think the following lines could be an alternative solution.

print ">$header\n";
$seq_len = length($sequence);

for ($i = 0; $i < $seq_len; $i + 60) {
    $segment = substr($sequence, $i, 60);
    print "$segment\n";
}
ADD REPLY
0
Entering edit mode

Yes, the loop is a good basic alternative, but the RegEx is better, this being Perl.

ADD REPLY
3
Entering edit mode

The problem with the regex is that it is limited to a max sequence length of ~32k, which gives you trouble with contigs, etc. Here is a loop solution that should be faster than regex.

my $single_line_seq = "ATTGC"x1000;
my $line_width = 80;
print $_,"\n" for unpack("(A$line_width)*", $single_line_seq);
ADD REPLY
3
Entering edit mode
9.6 years ago
SES 8.6k

Here is a comparison of the two methods:

Here are the results:

$ perl biostars140255.pl 2>/dev/null
Benchmark: timing 10 iterations of regex, unpack...
     regex: 17 wallclock secs (15.48 usr +  1.44 sys = 16.92 CPU) @  0.59/s (n=10)
    unpack: 66 wallclock secs (29.63 usr + 34.99 sys = 64.62 CPU) @  0.15/s (n=10)
       s/iter unpack  regex
unpack   6.46     --   -74%
regex    1.69   282%     --

The benchmark shows that regex is about 6X faster than unpack. Also, there is no length limit (the input in the script is a 100 Mb string). Are you referring to a 32-bit system with those limits?

ADD COMMENT
1
Entering edit mode

How I see SES now:

image: dwight meme

ADD REPLY
0
Entering edit mode

I was looking for a mic drop gif but just gave up myself.

ADD REPLY
0
Entering edit mode

For your viewing pleasure: http://giphy.com/search/mic-drop/3

EDIT: Removing GIF so I don't encourage image uploads just for the sake of uploads.

ADD REPLY
0
Entering edit mode

Ha! I might have been a bit terse but I honestly wasn't sure until I did the benchmark so I wanted to share. Also, I use 5.20 in the script because there is an awesome optimization (called "copy on write") that is lightening fast for copying strings. Here is an example script:

my $x = "A" x 1_000_000;
my $y = $x for 1...1_000_000;

That makes a long string and copies it a bunch of times in memory. Super useful for genome analysis. With Perl version 5.10.1 that takes 250.8 seconds on my machine. With Perl 5.20.2, that takes 0.15 seconds!

ADD REPLY
0
Entering edit mode

I had more than 200k in a single fasta, regex worked good. Thanks

ADD REPLY
0
Entering edit mode

Okay, my bad - I shouldn't have mentioned speed in the first place. unpack can be faster, especially if I'm not too lazy to put print outside the loop. I am also sure, one could optimize things further...

'unpack' => sub {
    say STDERR join("\n", unpack("(A60)*", $seq));
}

         Rate  regex unpack
regex  1.74/s     --   -47%
unpack 3.27/s    88%     --

Regarding the limit of 32766, I got confused there. It only applies to a maximum size of the capture group and hasn't any effect here, unless you want your lines to be longer than 32 kbp ;)

ADD REPLY
0
Entering edit mode
9.6 years ago
Ram 44k

Either use a loop and introduce a new line characters after every 60 characters in the sequence, or use Picard's NormalizeFasta if you'd like a tried and tested method.

ADD COMMENT
0
Entering edit mode

You can also try cat *.fasta | fold > all.fasta if you know that the headers are less than 60 bp.

ADD REPLY
0
Entering edit mode
9.6 years ago
Prakki Rama ★ 2.7k
fasta_formatter -i input.fasta -o output.fasta -w 60

fasta_formatter: http://hannonlab.cshl.edu/fastx_toolkit/commandline.html#fasta_formatter_usage

ADD COMMENT

Login before adding your answer.

Traffic: 1717 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6