Hi everyone,
I have 27000 fasta files (out_1 to out_27000), each of which contains only two sequences. One sequence ID starts with >AT (Arabidopsis) while the other doesn't start with >AT (another species). For all files, I need to keep Arabidopsis sequence (>AT... ) on top followed by non-Arabidopsis sequence (non >AT...).
I tried to write a perl script, but failed (no error message, but output files are uncorrect). I am really a perl beginner. My script is attached here. Could anyone please help to correct my script or maybe a Linux command for batch running? Thank you very much!! Note: these files are in multi-line sequence format.
#!/usr/bin/perl
use strict;
use warnings;
my $i;
for ($i=1; $i<=27000; $i++){
open my $file1, '<', "out_$i.afa"; #read files;
open my $file2, '>', "newout_$i.afa"; #output to new files;
my %h; #the following stores sequences into a hash;
local $/ = '>';
while (<$file1>) {
chomp;
/(.+?)\n(.+)/s and $h{$1} = $2 or next;
}
my $k=''; #the following runs hash to produce k1 k2 v1 v2;
my $k1='';
my $k2='';
my $v1='';
my $v2='';
foreach $k (keys %h){
if ($k =~/^\>AT/){
$k1=$k; $v1=$h{$k}; # k1 v1 are Arabidopsis;
}else{
$k2=$k; $v2 = $h{$k}; # k2 v2 are another species;
}
}
print $file2 ">$k1\n$v1\n>$k2\n$v2";
close $file1;
close $file2;
}
Why do you need this?
Pgibas, Thanks for your attention. I finally sorted out the problem myself. The condition
$k =~/^\>AT/
should be changed to$k =~/^AT/.
Actually these 27000 files are orthologous gene seqeunces of Arabidopsis thaliana and Arabidopsis lyrata. This is only an intermediate step. As the next steps, I need to merge all 27000 files into one file, which should be well ordered, i.e., each A.thaliana gene followed by its orthologous A.lyrata gene, and then subjected to miRNA target prediction. The input format is compulsory, that's why I need to make Athaliana sequence on the top followed by Alyrata sequence in each of these 27000 files.