Hi guys! I am a newbie to perl and I need help with finding list of fasta seuences which has specefic 3 letter code in the header. The question is explained below
Scenario: I have a fasta file which contains list of sequences as follows (test.fa)
>cel-let-7-5p MIMAT0000001 Caenorhabditis elegans let-7-5p
UGAGGUAGUAGGUUGUAUAGUU
>cfa-miR-761 MIMAT0009936 Canis familiaris miR-761
GCAGCAGGGUGAAACUGACACA
>cfa-miR-764 MIMAT0009937 Canis familiaris miR-764
GGUGCUCACUUGUCCUCCU
>lus-miR167c MIMAT0027158 Linum usitatissimum miR167c
UGAAGCUGCCAGCAUGAUCU
I have set of organism codes in a separate file (codes.txt)
lus
cel
cfa
...
what I want to do is search through test.fa for the codes and only print out the sequences which has that particular code in the header The example above contains only few sequences (not the entire file) So far I managed to create a hash and store the headers into keys and sequences into values. And I read through codes.txt and stored the codes in an array
The problem is when I use 'if exists' function to find whether each code in array exists in the hash here is my code,
#!/usr/bin/perl
use warnings;
use Bio::Perl;
use Bio::Seq;
use Bio::SeqIO;
my $filename = 'report.fa';
# Reading the first file and store it into a hash
#the sequence header is stored in the hash key and sequence is stored in the value
my $FastaFile1 = Bio::SeqIO->new(-file => "test.fa", -format => 'fasta', -alphabet => 'dna') or die "Failed to create SeqIO object from \n";
my %fastaH1 =();
while( my $seqFile1 = $FastaFile1->next_seq() ) {
unless (exists $fastaH1{$seqFile1->display_id."\t".$seqFile1->desc}) {
my $k = $seqFile1->display_id."\t".$seqFile1->desc;
$k =~ s/^\s+|\s+$//g;
$fastaH1{$k} = $seqFile1->seq; #key of the hash is fasta header (all line) and value is sequence.
}
}
# printing the fasta headers
print "stored fasta headers:\n";
foreach my $key (keys %fastaH1){
print "$key\n";
}
# reading the codes.txt file and creating the array
open my $file, '<', "codes.txt";
chomp(my @lines = <$file>);
close $file;
print "stored organism codes\n";
foreach (@lines) {
print " $_\n";
}
open(my $fh, '>', $filename) or die "Could not open file '$filename' $!";
# if each code is found on the hash print match is found
foreach my $line (@lines) {
chomp $line;
if( exists $fastaH1{$line} ){
print $fh"match found\n";
print $fh">$line\n$fastaH1{$line}\n";
}
}
close $fh;
The code should be printing found match with its sequences in a separate file. But it gives an empty file as the output
For example since lus
is in the code set, it should output
>lus-miR167c MIMAT0027158 Linum usitatissimum miR167c
UGAAGCUGCCAGCAUGAUCU
in the output. can you help me?
should work provided that fasta sequences are contained within a single line which are in your case as you are working with mature miRNA sequences. But I would still recommend you to keep working on the perl code to enhance your programming skills. Thanks.
I think your approach is better than OP's. Perl would need to pass the file multiple times for each query pattern, and that can never be as efficient as grep. Also, OP's subject file is single-line fasta. Couldn't get any UNIX friendlier!
Just a comment. Storing in memory every sequence using its ID as key and sequence as value will not scale properly if you have millions of sequences that are relatively long. I suggest, read your IDs first, store them in a hash and parse your fasta file and if it exists in the hash, print the ID+sequence.
I wonder why OP stores them in the first place. Isn't this a simple print-if-found scenario? Why use a hash at all?
Still, JC's solution "stores" them in a string and he needs to search linearly in the IDs. A hash provides a logarithmic solution at a reasonable memory case.
But we were discussing a case where the sequences number in the millions and/or the sequences are quite long, no? In which case JC's solution is better.
The size of the input was unspecified. On average, using a more efficient algorithm from the get-go saves more time than modifying inefficient code later :-)
I agree. I have the solution you suggest implemented on my github repo - it is my go to solution. It is not really efficient in this case though, with pattern matching involved. We'd need a hash for the IDs and one for the patterns.
I think Python is better with its
if X in array
syntax in this case.Jumping on the discussion, my solution is intended is your "list" of elements is short, storing a long list as a string will slow a lot the RegEx search. Of course you can store the elements in a hash and do iteration per element testing for a matching pattern, but I'm not sure if that will be faster than Python's
if X in array
even when is equivalent.Just to improve on Ashutosh comment, assuming your fasta sequences are single line fasta format, you can try:
should work faster than normal grep.