It would be nice if you could try out suggested solutions and let us know which one performed best? I would be interested to see time comparison of solutions offered by Poe and Pierre Lindenbaum. Thanks
Put the FASTA sequences into a hash table, and print out rows from the matrix file if the motif field element is defined in the hash table:
#!/usr/bin/env perl
use strict;
use warnings;
my $fastaFn = $ARGV[0];
my $masterFn = $ARGV[1];
my $seqsRef;
my $header;
my $sequence;
open FASTA, "< $fastaFn" or die "could not open FASTA file\n";
while (<FASTA>) {
chomp;
if (/>/) {
$header = $_;
$header =~ s/^>//;
}
else {
$sequence = $_;
$seqsRef->{$sequence} = $header;
}
}
close FASTA;
open MASTER, "< $masterFn" or die "could not open master file for filtering\n";
my $ln = <MASTER>;
print STDOUT "$ln\n";
while (<MASTER>) {
chomp;
my @elems = split("\t", $_);
my $motif = $elems[0];
if (defined $seqsRef->{$motif}) {
print STDOUT "$_\n";
}
}
close MASTER;
The file myQuerySeqs.fa is your FASTA file. The myDataMatrix.mtx file is the "master" matrix file that you want to filter on sequences from the FASTA file. Output is sent to myFilteredMatrix.mtx.
This should be fairly fast, if memory-intensive, because hash table lookups are in constant time.
Not the most elegant, but this should do it (unless your Fasta file is too big for memory?)
from Bio import SeqIO
tags=[]
for seq_record in SeqIO.parse("in.fas", "fasta"):
if str(seq_record.seq) not in tags:
tags.append(str(seq_record.seq))
for tag in tags:
with open("2.txt","rU") as f:
for line in f:
line=line.rstrip()
if tag in line:
print line
It would be nice if you could try out suggested solutions and let us know which one performed best? I would be interested to see time comparison of solutions offered by Poe and Pierre Lindenbaum. Thanks