Another option is to use Bio::SeqIO and a hash which pairs the id with the larger description--used as a filter. This method makes only two passes through the fasta file (no sorting):
use strict;
use warnings;
use Bio::SeqIO;
my ( $file, %hash, %seen ) = shift;
for my $i ( 0 .. 1 ) {
my $in = Bio::SeqIO->new( -file => $file, -format => 'Fasta' );
while ( my $seq = $in->next_seq() ) {
if ( !$i ) {
$hash{ $seq->id } = $seq->desc if !defined $hash{ $seq->id } or length $seq->desc > length $hash{ $seq->id };
}
else {
print '>'. $seq->id . ' ' . $hash{ $seq->id } . "\n" . $seq->seq . "\n" if !$seen{ $seq->id }++;
}
}
}
Usage: perl script.pl inFile [>outFile]
The last, optional parameter directs output to a file.
Output on your dataset:
>Seq_1 this_is_the_description_of_seq1
ATGACCAAGAGATAGATAACG
>Seq_2 this_is_the_description_of_seq2
ATATTTTTGTAGTTTGACAATAAAATAATTAAAAATGTAAAAAATAAAAATCCCAAAATA
If you don't have access to the Bio::SeqIO module, here's a solution that produces the same output:
use strict;
use warnings;
my ( $file, %hash, %seen ) = shift;
local $/ = '>';
for my $i ( 0 .. 1 ) {
push @ARGV, $file;
while (<>) {
chomp;
next unless my ( $id, $desc, $seq ) = $_ =~ /(\S+)\s+([^\n]+)\s+(.+)/s;
if ( !$i ) {
$hash{$id} = $desc if !defined $hash{$id} or length $desc > length $hash{$id};
}
else {
print ">$id $hash{$id}\n$seq" if !$seen{$id}++;
}
}
}
Although you show sequences only on one line, both of the scripts above handle multi-line sequences--just in case your actual dataset contains such.
Hope this helps!
Are they always ordered like in your example and/or do they always have the same name (but a possibly different description)?