I wanted to extract fasta sequences using the wild card search
for example
grep "oxidase" input.fasta
The above statement provides all headers which match with oxidase and i had to extract the sequences as well.
For now how i was approaching is that i am writing above statements in to file and using some awk statements to remove unwanted characteres but with this approach, it becoming tiresome to execute all these commands as format keep changing. And using the file, i use below statement to extract sequences
perl -ne 'if(/^>(\S+)/){$c=$i{$1}}$c?print:chomp;$i{$_}=1 if @ARGV' ids.file fasta.file
direct solution means that any script which takes wild card as input and provides output with sequences..
">testcomp2344c0seq1 BlastHit=ref|NP182170.1| inner membrane OXA1-like protein [Arabidopsis thaliana]sp|Q9SKD3.1|OXA1L_ARATH RecName: Full=Mitochondrial inner membrane protein OXA1-like; Flags: Precursorgb|AAD23047.1| putative cytochrome oxidase biogenesis protein "
That was one of the header i have but the script works with fasta id only which is "testcomp2344c0_seq1 " . so i have to perform preprocessing to remove rest of it.
This limits the OP to searching a single file at a time.
You can match directly on $seq->id and $seq->description.
It's best to quotemeta on $string.
Given the above, consider the following:
use strict;
use warnings;
use Bio::SeqIO;
my $string = pop;
my $seqout = Bio::SeqIO->new( -format => 'Fasta', -fh => \*STDOUT );
for my $inFile (@ARGV) {
my $seqin = Bio::SeqIO->new( -format => 'Fasta', -file => $inFile );
while ( my $seq = $seqin->next_seq() ) {
if ( $seq->id =~ m/\Q$string\E/i
or $seq->description =~ m/\Q$string\E/i )
{
$seqout->write_seq($seq);
}
}
}
Since the search term is always the last item sent to the script, you can pop it off @ARGV (implied if no argument is passed to pop). File names are left in @ARGV, so the above iterates through each. Depending upon the file system, one could perl script.pl * oxidase on a dir of fasta files--or just write out the separate files.
Yes I assumed he would add the use strict/warnings and the shebang line. I was just trying to give the OP an idea of how to do it, to be adjusted to his exact wishes. Thanks for the observations though.
What do you mean by "direct solution"? (and what formats keep changing?)
direct solution means that any script which takes wild card as input and provides output with sequences..
">testcomp2344c0seq1 BlastHit=ref|NP182170.1| inner membrane OXA1-like protein [Arabidopsis thaliana]sp|Q9SKD3.1|OXA1L_ARATH RecName: Full=Mitochondrial inner membrane protein OXA1-like; Flags: Precursorgb|AAD23047.1| putative cytochrome oxidase biogenesis protein "
That was one of the header i have but the script works with fasta id only which is "testcomp2344c0_seq1 " . so i have to perform preprocessing to remove rest of it.