Question

Comparing fasta headers to set of 3 letter codes that is found in the header

0

Entering edit mode

10.3 years ago

pasanfernando87 • 0

Hi guys! I am a newbie to perl and I need help with finding list of fasta seuences which has specefic 3 letter code in the header. The question is explained below

Scenario: I have a fasta file which contains list of sequences as follows (test.fa)

>cel-let-7-5p MIMAT0000001 Caenorhabditis elegans let-7-5p
UGAGGUAGUAGGUUGUAUAGUU
>cfa-miR-761 MIMAT0009936 Canis familiaris miR-761
GCAGCAGGGUGAAACUGACACA
>cfa-miR-764 MIMAT0009937 Canis familiaris miR-764
GGUGCUCACUUGUCCUCCU
>lus-miR167c MIMAT0027158 Linum usitatissimum miR167c
UGAAGCUGCCAGCAUGAUCU

I have set of organism codes in a separate file (codes.txt)

lus
cel
cfa
...

what I want to do is search through test.fa for the codes and only print out the sequences which has that particular code in the header The example above contains only few sequences (not the entire file) So far I managed to create a hash and store the headers into keys and sequences into values. And I read through codes.txt and stored the codes in an array

The problem is when I use 'if exists' function to find whether each code in array exists in the hash here is my code,

#!/usr/bin/perl 

use warnings;
use Bio::Perl;
use Bio::Seq;
use Bio::SeqIO;
my $filename = 'report.fa';
# Reading the first file and store it into a hash
#the sequence header is stored in the hash key and sequence is stored in the value

my $FastaFile1 = Bio::SeqIO->new(-file => "test.fa", -format => 'fasta', -alphabet => 'dna') or die "Failed to create SeqIO object from \n";

my %fastaH1 =();
while( my $seqFile1 = $FastaFile1->next_seq() ) {
    unless (exists $fastaH1{$seqFile1->display_id."\t".$seqFile1->desc}) {
        my $k = $seqFile1->display_id."\t".$seqFile1->desc;
        $k =~ s/^\s+|\s+$//g;
        $fastaH1{$k} = $seqFile1->seq; #key of the hash is fasta header (all line) and value is sequence.
    }
}

# printing the fasta headers
print "stored fasta headers:\n";
foreach my $key (keys %fastaH1){

        print  "$key\n";

    }

# reading the codes.txt file and creating the array
open my $file, '<', "codes.txt";
chomp(my @lines = <$file>);
close $file;

print "stored organism codes\n";
foreach (@lines) {
     print " $_\n";
 }
open(my $fh, '>', $filename) or die "Could not open file '$filename' $!";
# if each code is found on the hash print match is found
 foreach my $line (@lines) {

    chomp $line;
   if(  exists $fastaH1{$line} ){
        print $fh"match found\n";
        print $fh">$line\n$fastaH1{$line}\n";
    }
}

close $fh;

The code should be printing found match with its sequences in a separate file. But it gives an empty file as the output

For example since lus is in the code set, it should output

>lus-miR167c MIMAT0027158 Linum usitatissimum miR167c
UGAAGCUGCCAGCAUGAUCU

in the output. can you help me?

perl • 4.4k views

ADD COMMENT • link updated 2.9 years ago by Ram 44k • written 10.3 years ago by pasanfernando87 • 0

2

Entering edit mode

grep -A2 -E 'lus-|cel-|cfa-|any_other_code' test.fa > new_test.fa

should work provided that fasta sequences are contained within a single line which are in your case as you are working with mature miRNA sequences. But I would still recommend you to keep working on the perl code to enhance your programming skills. Thanks.

ADD REPLY • link updated 3.8 years ago by Ram 44k • written 10.3 years ago by Ashutosh Pandey 12k

0

Entering edit mode

I think your approach is better than OP's. Perl would need to pass the file multiple times for each query pattern, and that can never be as efficient as grep. Also, OP's subject file is single-line fasta. Couldn't get any UNIX friendlier!

ADD REPLY • link 3.8 years ago by Ram 44k

1

Entering edit mode

Just a comment. Storing in memory every sequence using its ID as key and sequence as value will not scale properly if you have millions of sequences that are relatively long. I suggest, read your IDs first, store them in a hash and parse your fasta file and if it exists in the hash, print the ID+sequence.

ADD REPLY • link updated 2.9 years ago by Ram 44k • written 10.2 years ago by Gabriel R. ★ 2.9k

0

Entering edit mode

I wonder why OP stores them in the first place. Isn't this a simple print-if-found scenario? Why use a hash at all?

ADD REPLY • link 2.9 years ago by Ram 44k

0

Entering edit mode

Still, JC's solution "stores" them in a string and he needs to search linearly in the IDs. A hash provides a logarithmic solution at a reasonable memory case.

ADD REPLY • link updated 2.9 years ago by Ram 44k • written 10.2 years ago by Gabriel R. ★ 2.9k

0

Entering edit mode

But we were discussing a case where the sequences number in the millions and/or the sequences are quite long, no? In which case JC's solution is better.

ADD REPLY • link 2.9 years ago by Ram 44k

0

Entering edit mode

The size of the input was unspecified. On average, using a more efficient algorithm from the get-go saves more time than modifying inefficient code later :-)

ADD REPLY • link updated 2.9 years ago by Ram 44k • written 10.2 years ago by Gabriel R. ★ 2.9k

0

Entering edit mode

I agree. I have the solution you suggest implemented on my github repo - it is my go to solution. It is not really efficient in this case though, with pattern matching involved. We'd need a hash for the IDs and one for the patterns.

I think Python is better with its if X in array syntax in this case.

ADD REPLY • link 3.8 years ago by Ram 44k

0

Entering edit mode

Jumping on the discussion, my solution is intended is your "list" of elements is short, storing a long list as a string will slow a lot the RegEx search. Of course you can store the elements in a hash and do iteration per element testing for a matching pattern, but I'm not sure if that will be faster than Python's if X in array even when is equivalent.

ADD REPLY • link updated 2.9 years ago by Ram 44k • written 10.2 years ago by JC 13k

0

Entering edit mode

Just to improve on Ashutosh comment, assuming your fasta sequences are single line fasta format, you can try:

LC_ALL=C fgrep -A 1 -f codes.txt test.fa >new_test.fa

should work faster than normal grep.

ADD REPLY • link updated 3.8 years ago by Ram 44k • written 10.3 years ago by Prakki Rama ★ 2.7k

Ram · Accepted Answer · 2014-10-14

2

Entering edit mode

10.2 years ago

JC 13k

Here is my solution without BioPerl and using RegEx-fu:

#!/usr/bin/perl
use strict;
use warnings;

$ARGV[2] or die "use extractMiRNAs.pl LIST FASTA OUTPUT\n";
my $list   = shift @ARGV;
my $fasta  = shift @ARGV;
my $out    = shift @ARGV;
my $search = '';

open (my $lh, "<", $list) or die;
while (<$lh>) {
    chomp;
    $search .= "$_|";
}
close $lh;
$search =~ s/\|$//; # delete the last pipe

open (my $oh, ">", $out) or die;
open (my $fh, "<", $fasta) or die;
$/ = "\n>"; # Fasta sequence slurp-mode
while (<$fh>) {
    s/>//g;
    print $oh ">$_" if (m/^($search)/);
}
close $fh;
close $oh;

ADD COMMENT • link updated 3.8 years ago by Ram 44k • written 10.2 years ago by JC 13k

0

Entering edit mode

This is a good solution, but took me a while to understand the inner working. It's always the inevitable tradeoff between brilliant implementations and readability, no?

ADD REPLY • link 10.2 years ago by Ram 44k

2

Entering edit mode

I don't remember the reference exactly, but you can see a Perl code encrypted and decrypted and see no differences. ;)

ADD REPLY • link 10.2 years ago by JC 13k