Hi All,
I'm using Perl's Text::Wrap module to break a long DNA string to 60 columns per line. The code is as follows. Though I'm getting apparently right results ,I'm a bit sceptical if am I doing it correctly as I didn't use this module before.
use Text::Wrap;
$Text::Wrap::columns = 60;
my $str_60 =Text::Wrap::fill( '', '', join '', uc($longdna_string) );
print $str_60;
There may be a reason to use this module, for example to make your code more portable. Though, it seems a little silly to use a module for a such a simple task, so I wanted to provide another solution. Here is a simple solution (borrowing from the answer of Kenosis) that does not use any modules.
#!/usr/bin/env perl
use strict;
use warnings;
my $longdna_string = <<END;
ACAAGATGCCATTGTCCCCCGGCCTCCTGCTGCTGCTGCTCTCCGGGGCCACGGCCACCGCTGCCCTGCCCCTGGAGGGTGGCCCCACCGGCCGAGACAGCGAGCATATGCAGGAAGCGGCAGGAATAAGGAAAAGCAGCCTCCTGACTTTCCTCGCTTGGTGGTTTGAGTGGACCTCCCAGGCCAGTGCCGGGCCCCTCATAGGAGAGGAAGCTCGGGAGGTGGCCAGGCGGCAGGAAGGCGCACCCCCCCAGCAATCCGCGCGCCGGGACAGAATGCCCTGCAGGAACTTCTTCTGGAAGACCTTCTCCTCCTGCAAATAAAACCTCACCCATGAATGCTCACGCAAGTTTAATTACAGACCTGAA
END
$longdna_string =~ s/(.{60})/$1\n/gs;
print $longdna_string;
Instead of substituting entire strings, using the \K eep assertion (Perl v5.10+) allows for inserting only the newline at every 60th position, and then finally ucing the results, as the OP had on the original string.
+1 Thanks for the suggestion, that is a more elegant solution. By the way, why would you want to uc the string after the substitution? It's not clear to me how it would get modified.
Seems that the OP wanted to insure an uppercase sequence, so a uced string was sent as a paramater to wrap. The print uc $longdna_string notation just uppercases the string after the (substitution) fact, yet produces the same final results.
We can see how Perl parses it by executing the following at the command line:
perl -MO=Deparse,-p -e 'print uc $longdna_string'
Output:
print(uc($longdna_string));
-e syntax OK
You'll note the nesting, where the results of uc are passed to print.
I understand what it does, I was curious why you think it's necessary. In other words, why uppercase an uppercase string? This would be a more general solution but I can't figure out why it's necessary here. It would seem pretty unsettling if Perl was randomly changing case because, of course, that is very important in many different contexts.
The only reason I used uc was because the OP did--assumingly for good reason. Certainly, if the original sequence was all uppercase, uc wouldn't be necessary.
FWIW historically nucleotide sequences have been represented using lower case letters, and protein sequences using upper case. This provides a hint regarding the sequence type and helps avoid handling the sequence as the incorrect type. This convention can be seen the the major databases:
You can just do the following to parse the long DNA string into 60 columns using Text::Wrap:
use strict;
use warnings;
use Text::Wrap;
$Text::Wrap::columns = 61;
my $longdna_string = <<END;
ACAAGATGCCATTGTCCCCCGGCCTCCTGCTGCTGCTGCTCTCCGGGGCCACGGCCACCGCTGCCCTGCCCCTGGAGGGTGGCCCCACCGGCCGAGACAGCGAGCATATGCAGGAAGCGGCAGGAATAAGGAAAAGCAGCCTCCTGACTTTCCTCGCTTGGTGGTTTGAGTGGACCTCCCAGGCCAGTGCCGGGCCCCTCATAGGAGAGGAAGCTCGGGAGGTGGCCAGGCGGCAGGAAGGCGCACCCCCCCAGCAATCCGCGCGCCGGGACAGAATGCCCTGCAGGAACTTCTTCTGGAAGACCTTCTCCTCCTGCAAATAAAACCTCACCCATGAATGCTCACGCAAGTTTAATTACAGACCTGAA
END
print wrap('', '', uc $longdna_string);
ADD COMMENT
• link
updated 5.0 years ago by
Ram
44k
•
written 11.2 years ago by
Kenosis
★
1.3k
0
Entering edit mode
Strangely, this does not produce the correct result. Those lines are 59 columns (not 60), but I can't say why that would be. That may be a bug in the module or something funny with the input (though I copied and pasted your code and it worked for me).
You're quite correct. Excellent eye! This is my error, as the module's documentation clearly says, "...every resulting line will have length of no more than $columns - 1 ." Thus, the wrap value in the case above should have been columns+1 (61, not the original 60), and this has been corrected.
Interesting, I guess we should call that a 'feature' and not a bug then :). I would bet a lot of people have unknowingly done the same thing since it is not exactly obvious. This is unrelated, but that module Copyright belongs to Google, which is not something I've seen a lot in the Perl world.
And here's that Copyright line:
Copyright (C) 1996-2009 David Muir Sharnoff. Copyright (C) 2012 Google, Inc.
David's been at Google since January 2011.
Yours is a nice solution! However, consider the following minor modifications:
Instead of substituting entire strings, using the
\K
eep assertion (Perl v5.10+) allows for inserting only the newline at every 60th position, and then finallyuc
ing the results, as the OP had on the original string.+1 Thanks for the suggestion, that is a more elegant solution. By the way, why would you want to
uc
the string after the substitution? It's not clear to me how it would get modified.You're most welcome!
Seems that the OP wanted to insure an uppercase sequence, so a
uc
ed string was sent as a paramater towrap
. Theprint uc $longdna_string
notation just uppercases the string after the (substitution) fact, yet produces the same final results.We can see how Perl parses it by executing the following at the command line:
perl -MO=Deparse,-p -e 'print uc $longdna_string'
Output:
You'll note the nesting, where the results of
uc
are passed toprint
.I understand what it does, I was curious why you think it's necessary. In other words, why uppercase an uppercase string? This would be a more general solution but I can't figure out why it's necessary here. It would seem pretty unsettling if Perl was randomly changing case because, of course, that is very important in many different contexts.
Ah. My apologies for misunderstang your question.
The only reason I used
uc
was because the OP did--assumingly for good reason. Certainly, if the original sequence was all uppercase,uc
wouldn't be necessary.FWIW historically nucleotide sequences have been represented using lower case letters, and protein sequences using upper case. This provides a hint regarding the sequence type and helps avoid handling the sequence as the incorrect type. This convention can be seen the the major databases:
Thank you, Hamish, for providing an excellent context (with references) for the OP's original use of
uc
.