The sequences may not be on a single line, so the script needs to accomodate for that. Give this, consider the following:
use strict;
use warnings;
my $seq;
while (<>) {
if (/^>/) {
print "$seq\n" if $seq;
undef $seq;
next;
}
chomp;
$seq .= $_;
print "$seq\n" if $seq and eof ARGV;
}
Usage: perl script.pl inFile [>outFile]
The last, optional parameter directs output to a file.
Output on your dataset:
ATGCTGTAGCGGAGGCATTATCGGCTA
TAGCGTATTTTAG
If a header line is detected, the current sequence is print
ed (if there is one), the sequence variable is undef
ined and the next
line is read. For non-header lines, each is chomp
ed and added to the sequence variable. Finally, the sequence variable is print
ed if the next file read would encounter the end of the file.
It's unclear to me what you mean by "save the sequences in diferent variables." However, you could push
each sequence onto an array (after, for example, declaring my @seqArray
): push @seqArray, $seq
in both places where $seq
is print
ed, and then process that array of sequences outside the while
loop.
Hope this helps!
is this a homework question?
Why not bioperl? That would be the easiest way to do this problem
Don't use bioperl when it is easy to write plain perl. Read my answer in How to efficiently parse a huge fastq file?. On fastq parsing, bioperl can be tens of times slower than plain perl.
Yes--you're quite right! It's best to
use Bio::SeqIO
or the like, in this situation, as it makes little sense to reinvent the wheel. Although many do simply enjoy reinventing the wheel... For some reason... (?) However, in this case the OP excluded BioPerl, so the wheel was reinvented. One nice result, though, is that the OP may only have to wait around for only ~10 milliseconds for the 'raw' Perl code to execute instead of a lengthy ~200 milliseconds for BioPerl. Oh, my...Real-world data are frequently at the scale of GB in bioinfo. The difference between a fast C program and BioPerl is like 0.5 min vs. 1 hour or occasionally 1 day vs. half a year. When you learn to program in bioinfo, you have to learn the right solution for large-scale data at the beginning. In addition, you have to reinvent simple wheels to learn to solve complex real problems. If you cannot parse fasta with plain perl, you will not be able to parse a lot of customized data formats not supported by BioPerl. If this is a homework question, the teacher is right in all ways to disallow bioperl.
The OP's case is simply not an issue of "real-world data," so an appeal to (eventually) using C for this purpose doesn't work. The choice of a programming language and algorithms is best done case-by-case, and shouldn't be informed, a priori, by a single construct (e.g., "speed"). One can't cogently (or rationally) argue for using a race car because of its "speed" if the current need is for a vehicle for city driving. Think case-by-case.
You said, "If this is a homework question, the teacher is right in all ways to disallow bioperl." This categorical statement expresses an alarming fundamentalism which effectively thwarts efforts to thoughtfully choose the best tool for the job at hand. Cultivating this attitude of a priori exclusion would inextricably hobble a student's ability to find the best programmatic solution(s) for a particular job, so should be avoided--not merely for the student's benefit, but employers greatly appreciate cognitive openness/adaptiveness. Additionally, if this particular problem arose within a learning environment, a cognitivist's pedagogy would support an instructional design which moves from simpler, module-based programming solutions (e.g., BioPerl) to more complex, individually-written solutions as an effective means toward developing a programmer's cognitive schema and programming-language proficiencies within students.
Over 20 years ago, I wrote different program segments in different languages: some parts in a high-level language, others in assembly because--in a particular case for a particular routine--assembly was the best choice. Think case-by-case.
As a side note, having to process GBs of data has nothing inherently to do with choosing a specific programming language for that job. Perl may be just as efficient as C for processing GBs of data, depending on how the data needs to be processed, as the disk i/o may actually render both languages equally 'fast' at the job. Think case-by-case. Perseverating on a single construct (e.g., "speed") is just too constraining.
If you're so interested in the execution speed of code, why are you not evangelizing for using Fortran or Objective-Caml? Perhaps the following read may be interesting to you: The 'C is Efficient' Language Fallacy.
The best to you...